Reading an lzo compressed file with Spark

Had to jump through some serious hoops to get this one – of course, in hindsight everything is obvious, so I am documenting what I did here. With some luck, my future experience will be in hindsight (and I don’t care how paradoxical it is what I just said, so drop it :-)).

Our RAM files are files compressed with lzo, and are now living in hdfs:///source/ram Configuration for Hadoop to read LZO files was set up through Cloudera Manager, section HDFS. I had read somewhere that to read LZO compressed files you had to invoke some variation of this incantation:

import com.hadoop.mapreduce._ // for LzoTextInputFormat
import org.apache.hadoop.io._ // for LongWritable and Text
val fileName = "/source/ram/deduped-2013-05-31.csv.lzo"
sc.newAPIHadoopFile(fileName, classOf[LzoTextInputFormat], classOf[LongWritable],classOf[Text]).count

So I got to work with my unfounded optimism: first I ran spark-shell (spark-shell –master local), and go for it. However this was failing with an error “Exception failure on localhost: RuntimeException: native-lzo library not availble” Turns out that our little comic Hadoop-lzo “seems to require its native code component, unlike Hadoop which can run non-native if it can’t find native. So you’ll need to add hadoop-lzo’s native component to the library path too.” AH. So what about we do this:

spark-shell --master local --jars /opt/hadoopgpl/lib/hadoop-lzo.jar --driver-library-path /opt/hadoopgpl/native/Linux-amd64-64/

This time things work, and after 35 seconds (yeah, I know, looks bad for my future), I get the answer: there is 42591706 records for that day.

UPDATE: several days later I tried to run spark-shell with the same parameters BUT without restricting it to the local machine (ie, I got rid of the <code>–master local</code> Turns out it doesn’t work. I have no clue why – I checked that Cloudera Manager properly installed the packages on all data nodes, and it did, so the config should hold for all of them. The workaround I found for this problem is to run spark-shell locally, export the tables I needed to hdfs, and then run spark-shell in the cluster for processing. 

 

cluster-lab: the beginnings

I am starting to interact with the newly set up cluster for data science here at YellowPages. Being a complete neophyte to all of these installation adventures, I will regularly post here a bunch of stuff that I am prone to forget – hopefully in a structured way.

Here is the address for Cloudera Manager, already installed: http://10.32.0.32:7180/cmf/home The cluster had CDH 5.0.3 installed; since we want to use Spark SQL, shipped with CDH 5.1+ (see this message from Databricks or this notification from Cloudera), the first thing I did was to upgrade CDH’s version. This was easy… once I discovered how (basically, make use of the packets in Cloudera Manager: look for the notifications in your status bar, and install/deploy the packet for CDH5.1); if you don’t have Cloudera Manager, you can do it by hand, this way.

Now, I have to confess I don’t know how to know, from inside Cloudera Manager, the version of each of the clients installed in the data nodes. For example, I wanted to know what version of Spark was deployed, and I couldn’t find how… I am sure it is possible, though! Anyway, what I did was to go on one of the data nodes (look for their address here) and run spark-shell, and look for the version information. I saw a 1.0 going on, so we are in business (since Spark SQL alpha comes with Spark 1.0). Just to be sure, I ran a Spark SQL example, and all went comme sur des roulettes. Good stuff.