spark

Had to jump through some serious hoops to get this one – of course, in hindsight everything is obvious, so I am documenting what I did here. With some luck, my future experience will be in hindsight (and I don’t care how paradoxical it is what I just said, so drop it :-)).

Our RAM files are files compressed with lzo, and are now living in hdfs:///source/ram Configuration for Hadoop to read LZO files was set up through Cloudera Manager, section HDFS. I had read somewhere that to read LZO compressed files you had to invoke some variation of this incantation:

import com.hadoop.mapreduce._ // for LzoTextInputFormat
import org.apache.hadoop.io._ // for LongWritable and Text
val fileName = &amp;amp;quot;/source/ram/deduped-2013-05-31.csv.lzo&amp;amp;quot;
sc.newAPIHadoopFile(fileName, classOf[LzoTextInputFormat], classOf[LongWritable],classOf[Text]).count

So I got to work with my unfounded optimism: first I ran spark-shell (spark-shell –master local), and go for it. However this was failing with an error “Exception failure on localhost: RuntimeException: native-lzo library not availble” Turns out that our little comic Hadoop-lzo “seems to require its native code component, unlike Hadoop which can run non-native if it can’t find native. So you’ll need to add hadoop-lzo’s native component to the library path too.” AH. So what about we do this:

spark-shell --master local --jars /opt/hadoopgpl/lib/hadoop-lzo.jar --driver-library-path /opt/hadoopgpl/native/Linux-amd64-64/

This time things work, and after 35 seconds (yeah, I know, looks bad for my future), I get the answer: there is 42591706 records for that day.

UPDATE: several days later I tried to run spark-shell with the same parameters BUT without restricting it to the local machine (ie, I got rid of the <code>–master local</code> Turns out it doesn’t work. I have no clue why – I checked that Cloudera Manager properly installed the packages on all data nodes, and it did, so the config should hold for all of them. The workaround I found for this problem is to run spark-shell locally, export the tables I needed to hdfs, and then run spark-shell in the cluster for processing.

I am starting to interact with the newly set up cluster for data science here at YellowPages. Being a complete neophyte to all of these installation adventures, I will regularly post here a bunch of stuff that I am prone to forget – hopefully in a structured way.

Here is the address for Cloudera Manager, already installed: http://10.32.0.32:7180/cmf/home The cluster had CDH 5.0.3 installed; since we want to use Spark SQL, shipped with CDH 5.1+ (see this message from Databricks or this notification from Cloudera), the first thing I did was to upgrade CDH’s version. This was easy… once I discovered how (basically, make use of the packets in Cloudera Manager: look for the notifications in your status bar, and install/deploy the packet for CDH5.1); if you don’t have Cloudera Manager, you can do it by hand, this way.

Now, I have to confess I don’t know how to know, from inside Cloudera Manager, the version of each of the clients installed in the data nodes. For example, I wanted to know what version of Spark was deployed, and I couldn’t find how… I am sure it is possible, though! Anyway, what I did was to go on one of the data nodes (look for their address here) and run spark-shell, and look for the version information. I saw a 1.0 going on, so we are in business (since Spark SQL alpha comes with Spark 1.0). Just to be sure, I ran a Spark SQL example, and all went comme sur des roulettes. Good stuff.

	Luis Da Costa on Reading an lzo compressed file…
	Luis Da Costa on Reading an lzo compressed file…
	Luis Da Costa on Starting a Data Science t…
	Luis Da Costa on Starting a Data Science t…

Random Walks

A great WordPress.com site

Reading an lzo compressed file with Spark

cluster-lab: the beginnings