In this post we are going to see how to compose Pig UDFs in Jython and Python, we are going to do this by using a running example; n-grams generator.
Problem Definition:-
Given a concatenated string with some delimiter, generate n-grams.
For example:-
concateNatedString = 'a_b_c_d'
1-gram:- a, b, c, d
2-grams:- a b, b c, c d
3-grams:- a b c, b c d
4-grams:- a b c d
Jython UDF to do above compuation:-
@outputSchema("y:bag{t:tuple(nGram:chararray)}")
def nGramsOnConcatenatedText(concatenatedString, delimiter, nGramValue):
listOfNGrams = []
allCategories = str ( concatenatedString ).split(delimiter)
for i in range( len(allCategories)):
if len ( allCategories[i:nGramValue+i] ) == nGramValue:
tVariable = ''
for eE in allCategories[i:nGramValue+i]:
tVariable += '\t' + eE
listOfNGrams.append( tVariable.strip() )
return listOfNGrams
content of the file "test.txt":-
a_b_c_d
1_2_3_4
a_b_c_d
Problem Definition:-
Given a concatenated string with some delimiter, generate n-grams.
For example:-
concateNatedString = 'a_b_c_d'
1-gram:- a, b, c, d
2-grams:- a b, b c, c d
3-grams:- a b c, b c d
4-grams:- a b c d
Jython UDF to do above compuation:-
@outputSchema("y:bag{t:tuple(nGram:chararray)}")
def nGramsOnConcatenatedText(concatenatedString, delimiter, nGramValue):
listOfNGrams = []
allCategories = str ( concatenatedString ).split(delimiter)
for i in range( len(allCategories)):
if len ( allCategories[i:nGramValue+i] ) == nGramValue:
tVariable = ''
for eE in allCategories[i:nGramValue+i]:
tVariable += '\t' + eE
listOfNGrams.append( tVariable.strip() )
return listOfNGrams
content of the file "test.txt":-
a_b_c_d
1_2_3_4
a_b_c_d