Friday, 12 December 2014

Four Short Links: Dec 12, 2014

1. Tab ( A Linux Shell Utility ) :-
   A modern text processing language that's similar to awk in spirit. 
    - Designed for concise one-liner aggregation and manipulation of tabular text data.

    - Makes no compromises on performance; aims to be no slower than traditional old-school UNIX             shell utilities whenever possible.

    - Feature-rich enough to support even very complex queries. (Also includes a good set of                         mathematical operations.)

    - Statically typed, type-inferred, declarative.

   Note:- By end of this weekend, will write a blog post "Tutorial on Tab:- A Linux Shell Utility"


2. Deep Learning Tutorial :- From Perceptron to Deep Networks :- 
    - In this tutorial, author is introducing the reader to the key concepts and algorithms behind deep             learning, begging with the simplest unit of composition and building to the concepts of machine             learning in Java

3. Faster Apache Pig with Apache Tez
    - Apache Pig 0.14.0 released on Nov 20th, 2014; And the good news is Tez if now one of the                 execution engine.

4. 10 Data Science Newsletters to subscribe to 

Friday, 29 August 2014

Pig UDFs in Jython and Python

In this post we are going to see how to compose Pig UDFs in Jython and Python, we are going to do this by using a running example; n-grams generator.

Problem Definition:-
Given a concatenated string with some delimiter, generate n-grams.

For example:-
concateNatedString = 'a_b_c_d'
1-gram:- a, b, c, d
2-grams:- a b, b c, c d
3-grams:- a b c, b c d
4-grams:- a b c d

Jython UDF to do above compuation:-

@outputSchema("y:bag{t:tuple(nGram:chararray)}")
def nGramsOnConcatenatedText(concatenatedString, delimiter, nGramValue):
listOfNGrams = []
allCategories = str (  concatenatedString ).split(delimiter)
for i in range( len(allCategories)):
if len ( allCategories[i:nGramValue+i] ) == nGramValue:
tVariable = ''
for eE in allCategories[i:nGramValue+i]:
tVariable += '\t' + eE
listOfNGrams.append( tVariable.strip() )
return  listOfNGrams
     
content of the file "test.txt":-
a_b_c_d
1_2_3_4
a_b_c_d