Re: reading LZO compressed file in spark

Andrew Ash Mon, 16 Dec 2013 11:25:29 -0800

Hi Rajeev,

It looks like you're using the com.hadoop.mapred.DeprecatedLzoTextInputFormat
input format above, while Stephen referred to com.hadoop.mapreduce.
LzoTextInputFormat


I think the way to use this in Spark would be to use the
SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with
the path and the InputFormat as parameters.  Can you give those a shot?

Andrew


On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava <[email protected]
> wrote:

> Hi Stephen,
>      I tried the same lzo file with a simple hadoop script
> this seems to work fine
>
> HADOOP_HOME=/usr/lib/hadoop
> /usr/bin/hadoop  jar
> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar
> \
> -libjars
> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
> \
> -input /tmp/ldpc.sstv3.lzo \
> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
> -output wc_test \
> -mapper 'cat' \
> -reducer 'wc -l'
>
> This means hadoop is able to handle the lzo file correctly
>
> Can you suggest me what i should do in spark for it to work
>
> regards
> Rajeev
>
>
> Rajeev Srivastava
> Silverline Design Inc
> 2118 Walsh ave, suite 204
> Santa Clara, CA, 95050
> cell : 408-409-0940
>
>
> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman <
> [email protected]> wrote:
>
>>
>> > System.setProperty("spark.io.compression.codec",
>> > "com.hadoop.compression.lzo.LzopCodec")
>>
>> This spark.io.compression.codec is a completely different setting than the
>> codecs that are used for reading/writing from HDFS. (It is for compressing
>> Spark's internal/non-HDFS intermediate output.)
>>
>> > Hope this helps and someone can help read a LZO file
>>
>> Spark just uses the regular Hadoop File System API, so any issues with
>> reading
>> LZO files would be Hadoop issues. I would search in the Hadoop issue
>> tracker,
>> and look for information on using LZO files with Hadoop/Hive, and
>> whatever works
>> for them, should magically work for Spark as well.
>>
>> This looks like a good place to start:
>>
>> https://github.com/twitter/hadoop-lzo
>>
>> IANAE, but I would try passing one of these:
>>
>>
>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
>>
>> To the SparkContext.hadoopFile method.
>>
>> - Stephen
>>
>>
>

Re: reading LZO compressed file in spark

Reply via email to