Friday 29 August 2014

Pig UDFs in Jython and Python

In this post we are going to see how to compose Pig UDFs in Jython and Python, we are going to do this by using a running example; n-grams generator.

Problem Definition:-
Given a concatenated string with some delimiter, generate n-grams.

For example:-
concateNatedString = 'a_b_c_d'
1-gram:- a, b, c, d
2-grams:- a b, b c, c d
3-grams:- a b c, b c d
4-grams:- a b c d

Jython UDF to do above compuation:-

@outputSchema("y:bag{t:tuple(nGram:chararray)}")
def nGramsOnConcatenatedText(concatenatedString, delimiter, nGramValue):
listOfNGrams = []
allCategories = str (  concatenatedString ).split(delimiter)
for i in range( len(allCategories)):
if len ( allCategories[i:nGramValue+i] ) == nGramValue:
tVariable = ''
for eE in allCategories[i:nGramValue+i]:
tVariable += '\t' + eE
listOfNGrams.append( tVariable.strip() )
return  listOfNGrams
     
content of the file "test.txt":-
a_b_c_d
1_2_3_4
a_b_c_d