Had to jump through some serious hoops to get this one – of course, in hindsight everything is obvious, so I am documenting what I did here. With some luck, my future experience will be in hindsight (and I don’t care how paradoxical it is what I just said, so drop it :-)).
Our RAM files are files compressed with lzo, and are now living in hdfs:///source/ram Configuration for Hadoop to read LZO files was set up through Cloudera Manager, section HDFS. I had read somewhere that to read LZO compressed files you had to invoke some variation of this incantation:
import com.hadoop.mapreduce._ // for LzoTextInputFormat import org.apache.hadoop.io._ // for LongWritable and Text val fileName = "/source/ram/deduped-2013-05-31.csv.lzo" sc.newAPIHadoopFile(fileName, classOf[LzoTextInputFormat], classOf[LongWritable],classOf[Text]).count
So I got to work with my unfounded optimism: first I ran spark-shell (spark-shell –master local), and go for it. However this was failing with an error “Exception failure on localhost: RuntimeException: native-lzo library not availble” Turns out that our little comic Hadoop-lzo “seems to require its native code component, unlike Hadoop which can run non-native if it can’t find native. So you’ll need to add hadoop-lzo’s native component to the library path too.” AH. So what about we do this:
spark-shell --master local --jars /opt/hadoopgpl/lib/hadoop-lzo.jar --driver-library-path /opt/hadoopgpl/native/Linux-amd64-64/
This time things work, and after 35 seconds (yeah, I know, looks bad for my future), I get the answer: there is 42591706 records for that day.
UPDATE: several days later I tried to run spark-shell with the same parameters BUT without restricting it to the local machine (ie, I got rid of the <code>–master local</code> Turns out it doesn’t work. I have no clue why – I checked that Cloudera Manager properly installed the packages on all data nodes, and it did, so the config should hold for all of them. The workaround I found for this problem is to run spark-shell locally, export the tables I needed to hdfs, and then run spark-shell in the cluster for processing.
Some stuff I found useful to solve this problem: hadoop-gpl package (https://code.google.com/p/hadoop-gpl-packing/downloads/detail?name=hadoop-gpl-packaging-0.6.1-1.x86_64.rpm&can=2&q=), hadoop-lzo on GitHub (https://github.com/kevinweil/hadoop-lzo) and this comment (http://grokbase.com/t/cloudera/cdh-user/144bv47zb0/cdh5-0-spark-shell-cannot-work-when-enable-lzo-in-core-site-xml)
Somehow related: Geraud just pointed me to these documentation resources: https://wiki.ypg.com/pages/viewpage.action?title=Advertiser+Analytics+ETL+-+Design&spaceKey=CAA and in particular https://wiki.ypg.com/display/CAA/RawActivityLzoIndexDailyJob