Reading an lzo compressed file with Spark

Had to jump through some serious hoops to get this one – of course, in hindsight everything is obvious, so I am documenting what I did here. With some luck, my future experience will be in hindsight (and I don’t care how paradoxical it is what I just said, so drop it :-)).

Our RAM files are files compressed with lzo, and are now living in hdfs:///source/ram Configuration for Hadoop to read LZO files was set up through Cloudera Manager, section HDFS. I had read somewhere that to read LZO compressed files you had to invoke some variation of this incantation:

import com.hadoop.mapreduce._ // for LzoTextInputFormat
import org.apache.hadoop.io._ // for LongWritable and Text
val fileName = "/source/ram/deduped-2013-05-31.csv.lzo"
sc.newAPIHadoopFile(fileName, classOf[LzoTextInputFormat], classOf[LongWritable],classOf[Text]).count

So I got to work with my unfounded optimism: first I ran spark-shell (spark-shell –master local), and go for it. However this was failing with an error “Exception failure on localhost: RuntimeException: native-lzo library not availble” Turns out that our little comic Hadoop-lzo “seems to require its native code component, unlike Hadoop which can run non-native if it can’t find native. So you’ll need to add hadoop-lzo’s native component to the library path too.” AH. So what about we do this:

spark-shell --master local --jars /opt/hadoopgpl/lib/hadoop-lzo.jar --driver-library-path /opt/hadoopgpl/native/Linux-amd64-64/

This time things work, and after 35 seconds (yeah, I know, looks bad for my future), I get the answer: there is 42591706 records for that day.

UPDATE: several days later I tried to run spark-shell with the same parameters BUT without restricting it to the local machine (ie, I got rid of the <code>–master local</code> Turns out it doesn’t work. I have no clue why – I checked that Cloudera Manager properly installed the packages on all data nodes, and it did, so the config should hold for all of them. The workaround I found for this problem is to run spark-shell locally, export the tables I needed to hdfs, and then run spark-shell in the cluster for processing. 

 

2 thoughts on “Reading an lzo compressed file with Spark

Leave a comment