I’ve been reading through several pages trying to figure out how to set up my spark-ec2 cluster to read LZO-compressed files from S3.
- http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E - http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E - https://github.com/twitter/hadoop-lzo - http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ It seems that several things may have changed since the above pages were put together, so getting this to work is more work than I expected. Is there a simple set of instructions somewhere one can follow to get a Spark EC2 cluster reading LZO-compressed input files correctly? Nick On Sun, Jul 6, 2014 at 10:55 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Ah, indeed it looks like I need to install this separately > <https://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1> > as it is not part of the core. > > Nick > > > > On Sun, Jul 6, 2014 at 2:22 AM, Gurvinder Singh < > gurvinder.si...@uninett.no> wrote: > >> On 07/06/2014 05:19 AM, Nicholas Chammas wrote: >> > On Fri, Jul 4, 2014 at 3:33 PM, Gurvinder Singh >> > <gurvinder.si...@uninett.no <mailto:gurvinder.si...@uninett.no>> wrote: >> > >> > csv = >> > >> sc.newAPIHadoopFile(opts.input,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text").count() >> > >> > Does anyone know what the rough equivalent of this would be in the Scala >> > API? >> > >> I am not sure, I haven't tested it using scala. >> com.hadoop.mapreduce.LzoTextInputFormat class is from this package >> https://github.com/twitter/hadoop-lzo >> >> I have installed it from clourdera "hadoop-lzo" package with liblzo2-2 >> debian package on all of my workers. Make sure you have hadoop-lzo.jar >> in your class path for spark. >> >> - Gurvinder >> >> > I am trying the following, but the first import yields an error on my >> > |spark-ec2| cluster: >> > >> > |import com.hadoop.mapreduce.LzoTextInputFormat >> > import org.apache.hadoop.io.LongWritable >> > import org.apache.hadoop.io.Text >> > >> > >> sc.newAPIHadoopFile("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data", >> LzoTextInputFormat, LongWritable, Text) >> > | >> > >> > |scala> import com.hadoop.mapreduce.LzoTextInputFormat >> > <console>:12: error: object hadoop is not a member of package com >> > import com.hadoop.mapreduce.LzoTextInputFormat >> > | >> > >> > Nick >> > >> > >> >> >> >