Re: reading LZO compressed file in spark

Rajeev Srivastava Wed, 11 Dec 2013 18:00:59 -0800

Hi Stephen,
     I tried the same lzo file with a simple hadoop script
this seems to work fine


HADOOP_HOME=/usr/lib/hadoop
/usr/bin/hadoop  jar
/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar
\
-libjars
/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
\
-input /tmp/ldpc.sstv3.lzo \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-output wc_test \
-mapper 'cat' \
-reducer 'wc -l'

This means hadoop is able to handle the lzo file correctly

Can you suggest me what i should do in spark for it to work

regards
Rajeev


Rajeev Srivastava
Silverline Design Inc
2118 Walsh ave, suite 204
Santa Clara, CA, 95050
cell : 408-409-0940


On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman <
[email protected]> wrote:

>
> > System.setProperty("spark.io.compression.codec",
> > "com.hadoop.compression.lzo.LzopCodec")
>
> This spark.io.compression.codec is a completely different setting than the
> codecs that are used for reading/writing from HDFS. (It is for compressing
> Spark's internal/non-HDFS intermediate output.)
>
> > Hope this helps and someone can help read a LZO file
>
> Spark just uses the regular Hadoop File System API, so any issues with
> reading
> LZO files would be Hadoop issues. I would search in the Hadoop issue
> tracker,
> and look for information on using LZO files with Hadoop/Hive, and whatever
> works
> for them, should magically work for Spark as well.
>
> This looks like a good place to start:
>
> https://github.com/twitter/hadoop-lzo
>
> IANAE, but I would try passing one of these:
>
>
> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
>
> To the SparkContext.hadoopFile method.
>
> - Stephen
>
>

Re: reading LZO compressed file in spark

Reply via email to