In one of our previous posts we showed how easy is to extend the Kite Morphlines framework with your custom commands. In this post we are going to use it to remove columns from a dataset to demonstrate how it can be used and embeded in MapReduce jobs. Download the MovieLens + IMDb/Rotten Tomatoes dataset from Grouplens, extract it, and it should contain a file called user_ratedmovies.dat. It is a simple tsv file – we are going to use the same column names as it shows in the first line (header)
1 2 3 4 5 6 7 8
Let’s just pretend that we don’t need all the data from the file and remove the last 3 columns (date_hour, date_minute, date_second). We can achieve this task with the following 2 commands:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Now lets create our mapper only job to process the data. What we need to do is build the Morphlines command chain by parsing the configuration file as shown
1 2 3 4 5
and pass the lines through it.
1 2 3 4 5 6
Notice that the compile method takes an important parameter called finalChild which is in our example the
The returned command will feed records into finalChild which means if this parameter is not provided a DropRecord command will
be assigned automatically. In Apache Flume there is a Collector command to avoid loosing any transformed record.
The only thing left is to outbox the processed record and write the results to HDFS. The RecordEmitter will serve this purpose:
1 2 3 4 5 6 7 8 9 10
By default the readCSV command extracts the ATTACHMENT_BODY into headers with id provided in the columns field so the transformed data will look like this (3 columns were dropped):
The source code is available in our samples repository on GitHub. It is just a simple example but you can go further and download a much bigger dataset with 10 millions of lines and process it with multiple nodes to see how it scales.