Re: reading LZO compressed file in spark

Andrew Ash Tue, 24 Dec 2013 21:18:13 -0800

Hi Berkeley,

By RF=3 I mean replication factor of 3 on the files in HDFS, so each block
is stored 3 times across the cluster.  It's a pretty standard choice for
the replication factor in order to give a hardware team time to replace bad
hardware in the case of failure.  With RF=3 the cluster can sustain failure
on any two nodes without data loss, but the loss of the third node may
cause loss.


When reading the LZO files with the newAPIHadoopFile() call I showed below,
the data in the RDD is already decompressed -- it transparently looks the
same to my Spark program as if I was operating on an uncompressed file.

Cheers,
Andrew


On Tue, Dec 24, 2013 at 12:29 PM, Berkeley Malagon <
[email protected]> wrote:

> Andrew, This is great.
>
> Excuse my ignorance, but what do you mean by RF=3? Also, after reading the
> LZO files, are you able to access the contents directly, or do you have to
> decompress them after reading them?
>
> Sent from my iPhone
>
> On Dec 24, 2013, at 12:03 AM, Andrew Ash <[email protected]> wrote:
>
> Hi Rajeev,
>
> I'm not sure if you ever got it working, but I just got mine up and going.
>  If you just use sc.textFile(...) the file will be read but the LZO index
> won't be used so a .count() on my 1B+ row file took 2483s.  When I ran it
> like this though:
>
> sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo",
> classOf[com.hadoop.mapreduce.LzoTextInputFormat],
> classOf[org.apache.hadoop.io.LongWritable],
> classOf[org.apache.hadoop.io.Text]).count
>
> the LZO index file was used and the .count() took just 101s.  For
> reference this file is 43GB when .gz compressed and 78.4GB when .lzo
> compressed.  I have RF=3 and this is across 4 pretty beefy machines with
> Hadoop DataNodes and Spark both running on each machine.
>
> Cheers!
> Andrew
>
>
> On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava <
> [email protected]> wrote:
>
>> Thanks for your suggestion. I will try this and update by late evening.
>>
>> regards
>> Rajeev
>>
>> Rajeev Srivastava
>> Silverline Design Inc
>> 2118 Walsh ave, suite 204
>> Santa Clara, CA, 95050
>> cell : 408-409-0940
>>
>>
>> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <[email protected]>wrote:
>>
>>> Hi Rajeev,
>>>
>>> It looks like you're using the 
>>> com.hadoop.mapred.DeprecatedLzoTextInputFormat
>>> input format above, while Stephen referred to com.hadoop.mapreduce.
>>> LzoTextInputFormat
>>>
>>> I think the way to use this in Spark would be to use the
>>> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with
>>> the path and the InputFormat as parameters.  Can you give those a shot?
>>>
>>> Andrew
>>>
>>>
>>> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava <
>>> [email protected]> wrote:
>>>
>>>> Hi Stephen,
>>>>      I tried the same lzo file with a simple hadoop script
>>>> this seems to work fine
>>>>
>>>> HADOOP_HOME=/usr/lib/hadoop
>>>> /usr/bin/hadoop  jar
>>>> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar
>>>> \
>>>> -libjars
>>>> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
>>>> \
>>>> -input /tmp/ldpc.sstv3.lzo \
>>>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
>>>> -output wc_test \
>>>> -mapper 'cat' \
>>>> -reducer 'wc -l'
>>>>
>>>> This means hadoop is able to handle the lzo file correctly
>>>>
>>>> Can you suggest me what i should do in spark for it to work
>>>>
>>>> regards
>>>> Rajeev
>>>>
>>>>
>>>> Rajeev Srivastava
>>>> Silverline Design Inc
>>>> 2118 Walsh ave, suite 204
>>>> Santa Clara, CA, 95050
>>>> cell : 408-409-0940
>>>>
>>>>
>>>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>> > System.setProperty("spark.io.compression.codec",
>>>>> > "com.hadoop.compression.lzo.LzopCodec")
>>>>>
>>>>> This spark.io.compression.codec is a completely different setting than
>>>>> the
>>>>> codecs that are used for reading/writing from HDFS. (It is for
>>>>> compressing
>>>>> Spark's internal/non-HDFS intermediate output.)
>>>>>
>>>>> > Hope this helps and someone can help read a LZO file
>>>>>
>>>>> Spark just uses the regular Hadoop File System API, so any issues with
>>>>> reading
>>>>> LZO files would be Hadoop issues. I would search in the Hadoop issue
>>>>> tracker,
>>>>> and look for information on using LZO files with Hadoop/Hive, and
>>>>> whatever works
>>>>> for them, should magically work for Spark as well.
>>>>>
>>>>> This looks like a good place to start:
>>>>>
>>>>> https://github.com/twitter/hadoop-lzo
>>>>>
>>>>> IANAE, but I would try passing one of these:
>>>>>
>>>>>
>>>>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
>>>>>
>>>>> To the SparkContext.hadoopFile method.
>>>>>
>>>>> - Stephen
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: reading LZO compressed file in spark

Reply via email to