Reading an lzo compressed file with Spark

Had to jump through some serious hoops to get this one – of course, in hindsight everything is obvious, so I am documenting what I did here. With some luck, my future experience will be in hindsight (and I don’t care how paradoxical it is what I just said, so drop it :-)).

Our RAM files are files compressed with lzo, and are now living in hdfs:///source/ram Configuration for Hadoop to read LZO files was set up through Cloudera Manager, section HDFS. I had read somewhere that to read LZO compressed files you had to invoke some variation of this incantation:

import com.hadoop.mapreduce._ // for LzoTextInputFormat
import org.apache.hadoop.io._ // for LongWritable and Text
val fileName = "/source/ram/deduped-2013-05-31.csv.lzo"
sc.newAPIHadoopFile(fileName, classOf[LzoTextInputFormat], classOf[LongWritable],classOf[Text]).count

So I got to work with my unfounded optimism: first I ran spark-shell (spark-shell –master local), and go for it. However this was failing with an error “Exception failure on localhost: RuntimeException: native-lzo library not availble” Turns out that our little comic Hadoop-lzo “seems to require its native code component, unlike Hadoop which can run non-native if it can’t find native. So you’ll need to add hadoop-lzo’s native component to the library path too.” AH. So what about we do this:

spark-shell --master local --jars /opt/hadoopgpl/lib/hadoop-lzo.jar --driver-library-path /opt/hadoopgpl/native/Linux-amd64-64/

This time things work, and after 35 seconds (yeah, I know, looks bad for my future), I get the answer: there is 42591706 records for that day.

UPDATE: several days later I tried to run spark-shell with the same parameters BUT without restricting it to the local machine (ie, I got rid of the <code>–master local</code> Turns out it doesn’t work. I have no clue why – I checked that Cloudera Manager properly installed the packages on all data nodes, and it did, so the config should hold for all of them. The workaround I found for this problem is to run spark-shell locally, export the tables I needed to hdfs, and then run spark-shell in the cluster for processing. 

 

Starting a Data Science team

I started a month ago here, at Yellow Pages, because there is the clear intention of start doing interesting projects with all the data we are collecting from our clients and advertisers. A lot can be said about that idea (and I will probably do later on), but for now I’d like to mention that at least 2 challenges are obvious in our case:

  1. What are the specifics problems we will try to solve? This is an important challenge, as it defines the way the different layers in the company talk to each other.
  2. Who is involved in these efforts? In other words, is there some kind of data science team defined?

In this post I will start the conversation concerning point #2 – and I am actually wondering if the question shouldn’t be rephrased as “what would I, Luis, need in other to start a data science team?”

People interested/trained in getting your hands dirty in “data exploration”

This is pretty large, but I’d say: software programmers that can speak (ie, have, at least, heard of):

Definition of some commonalities in environment(s) to use

  1. Hardware + basic layers of development. Personal preference: Hadoop cluster + BDAS. We have one such development cluster in the earliest stages of installation at CAA’s.
  2. Programming language. Because I prefer BDAS, Scala. But we should for sure be able to explore data in R,, use plotty to plot/share our stuff. And others!
  3. Code sharing environment. What about GitHub? I am here!

etcs.

  • Some kind of a chat system. I’ve used Grove.io, works OK. Anything works, though! The important thing here is to have a way to communicate between members of the team.
  • Some kind of way to share papers and longer thoughts. I’ve heard of Crowdbase. I’ve used simple stuff, like Google Docs. Again, anything works, as long as it is consistent and used by the members of the team.

 

Projects to work on!!

Kind of important, this one 😉

“So, what does a Data Scientist do?”

I get asked this question fairly often (probably because somebody said something about data scientist’s JOBS being sexy), which in turn forces me to think about what do I do everyday in front of my computer. I have lately stumbled upon 2 people that have gotten serious about becoming data scientists, and that have mapped out their next moves. Here I re-post their initiatives, and I will be sure to check them out for myself. Enjoy!

But beware: it might be harder than you think

cluster-lab: the beginnings

I am starting to interact with the newly set up cluster for data science here at YellowPages. Being a complete neophyte to all of these installation adventures, I will regularly post here a bunch of stuff that I am prone to forget – hopefully in a structured way.

Here is the address for Cloudera Manager, already installed: http://10.32.0.32:7180/cmf/home The cluster had CDH 5.0.3 installed; since we want to use Spark SQL, shipped with CDH 5.1+ (see this message from Databricks or this notification from Cloudera), the first thing I did was to upgrade CDH’s version. This was easy… once I discovered how (basically, make use of the packets in Cloudera Manager: look for the notifications in your status bar, and install/deploy the packet for CDH5.1); if you don’t have Cloudera Manager, you can do it by hand, this way.

Now, I have to confess I don’t know how to know, from inside Cloudera Manager, the version of each of the clients installed in the data nodes. For example, I wanted to know what version of Spark was deployed, and I couldn’t find how… I am sure it is possible, though! Anyway, what I did was to go on one of the data nodes (look for their address here) and run spark-shell, and look for the version information. I saw a 1.0 going on, so we are in business (since Spark SQL alpha comes with Spark 1.0). Just to be sure, I ran a Spark SQL example, and all went comme sur des roulettes. Good stuff.