Re: reading LZO compressed file in spark

Rajeev Srivastava Mon, 16 Dec 2013 11:35:53 -0800

Thanks for your suggestion. I will try this and update by late evening.

regards
Rajeev


Rajeev Srivastava
Silverline Design Inc
2118 Walsh ave, suite 204
Santa Clara, CA, 95050
cell : 408-409-0940


On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <[email protected]> wrote:

> Hi Rajeev,
>
> It looks like you're using the com.hadoop.mapred.DeprecatedLzoTextInputFormat
> input format above, while Stephen referred to com.hadoop.mapreduce.
> LzoTextInputFormat
>
> I think the way to use this in Spark would be to use the
> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with
> the path and the InputFormat as parameters.  Can you give those a shot?
>
> Andrew
>
>
> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava <
> [email protected]> wrote:
>
>> Hi Stephen,
>>      I tried the same lzo file with a simple hadoop script
>> this seems to work fine
>>
>> HADOOP_HOME=/usr/lib/hadoop
>> /usr/bin/hadoop  jar
>> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar
>> \
>> -libjars
>> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
>> \
>> -input /tmp/ldpc.sstv3.lzo \
>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
>> -output wc_test \
>> -mapper 'cat' \
>> -reducer 'wc -l'
>>
>> This means hadoop is able to handle the lzo file correctly
>>
>> Can you suggest me what i should do in spark for it to work
>>
>> regards
>> Rajeev
>>
>>
>> Rajeev Srivastava
>> Silverline Design Inc
>> 2118 Walsh ave, suite 204
>> Santa Clara, CA, 95050
>> cell : 408-409-0940
>>
>>
>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman <
>> [email protected]> wrote:
>>
>>>
>>> > System.setProperty("spark.io.compression.codec",
>>> > "com.hadoop.compression.lzo.LzopCodec")
>>>
>>> This spark.io.compression.codec is a completely different setting than
>>> the
>>> codecs that are used for reading/writing from HDFS. (It is for
>>> compressing
>>> Spark's internal/non-HDFS intermediate output.)
>>>
>>> > Hope this helps and someone can help read a LZO file
>>>
>>> Spark just uses the regular Hadoop File System API, so any issues with
>>> reading
>>> LZO files would be Hadoop issues. I would search in the Hadoop issue
>>> tracker,
>>> and look for information on using LZO files with Hadoop/Hive, and
>>> whatever works
>>> for them, should magically work for Spark as well.
>>>
>>> This looks like a good place to start:
>>>
>>> https://github.com/twitter/hadoop-lzo
>>>
>>> IANAE, but I would try passing one of these:
>>>
>>>
>>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
>>>
>>> To the SparkContext.hadoopFile method.
>>>
>>> - Stephen
>>>
>>>
>>
>

Re: reading LZO compressed file in spark

Reply via email to