At SequenceIQ we are processing data in batch and streaming – for both we use Scala as our prefered language; for batch processing in particular we use Scalding to build our job and data pipelines. Actually there is
Scalding is a powerful tool and great choice to simplify the writing and abstracting MapReduce jobs – an open source project originally developed by Twitter and recently the community. In the following detailed example we’d like show you an example of how to write and test Scalding jobs, running on Hadoop.
Writing a Pearson correlation job
In this example, we’d like to calculate a Pearson’s product-moment coefficient on 2 columns of a given input. This is a simple computation and the easiest way to find any dependency between two datasets. First of all we need all the parameters for the given formula. In Scala the code would look like this:
1 2 3 4 5 6 7
In this example we compute all the required parameters for the correlation formula using the Field API of Scala. First we obtain the input/output and the two comparable column arguments which comes from command line parameters (usage : —key value) and provide the schema for the CSV input. After the input is read we map the two selected fields (product and squares); with the underlined informations, we are able to produce the required parameters (grouping part). At the end we just need to use the formula on the given fields (second map) and write the results into a TSV file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
For running the example you will have to run the following command: (you can use —hdfs instead of —local)
Testing Scalding jobs
In order to test that your data transformations are correct, you can use the JobTest class for unit testing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Writing results to HBase
In case we’d like to store our data in a database (at SequenceIQ we use HBase) we can use a special Cascading Tap for it. In this example we used Spyglass to store the correlation results in HBase.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Build the application
Running the example and persisting to HBase
In order to run the example you’ll have to run the following command: (you can use —hdfs instead of —local)
Hope this correlation example and introduction into Scalding was useful – you can get the example project from our GitHub repository.