Re: reading LZO compressed file in spark

Berkeley Malagon Tue, 24 Dec 2013 09:29:50 -0800

Andrew, This is great. 

Excuse my ignorance, but what do you mean by RF=3? Also, after reading the LZO 
files, are you able to access the contents directly, or do you have to 
decompress them after reading them?


Sent from my iPhone

> On Dec 24, 2013, at 12:03 AM, Andrew Ash <[email protected]> wrote:
> 
> Hi Rajeev,
> 
> I'm not sure if you ever got it working, but I just got mine up and going.  
> If you just use sc.textFile(...) the file will be read but the LZO index 
> won't be used so a .count() on my 1B+ row file took 2483s.  When I ran it 
> like this though:
> 
> sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo", 
> classOf[com.hadoop.mapreduce.LzoTextInputFormat], 
> classOf[org.apache.hadoop.io.LongWritable], 
> classOf[org.apache.hadoop.io.Text]).count
> 
> the LZO index file was used and the .count() took just 101s.  For reference 
> this file is 43GB when .gz compressed and 78.4GB when .lzo compressed.  I 
> have RF=3 and this is across 4 pretty beefy machines with Hadoop DataNodes 
> and Spark both running on each machine.
> 
> Cheers!
> Andrew
> 
> 
>> On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava 
>> <[email protected]> wrote:
>> Thanks for your suggestion. I will try this and update by late evening.
>> 
>> regards
>> Rajeev
>> 
>> Rajeev Srivastava
>> Silverline Design Inc
>> 2118 Walsh ave, suite 204
>> Santa Clara, CA, 95050
>> cell : 408-409-0940
>> 
>> 
>>> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <[email protected]> wrote:
>>> Hi Rajeev,
>>> 
>>> It looks like you're using the 
>>> com.hadoop.mapred.DeprecatedLzoTextInputFormat input format above, while 
>>> Stephen referred to com.hadoop.mapreduce.LzoTextInputFormat
>>> 
>>> I think the way to use this in Spark would be to use the 
>>> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with 
>>> the path and the InputFormat as parameters.  Can you give those a shot?
>>> 
>>> Andrew
>>> 
>>> 
>>>> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava 
>>>> <[email protected]> wrote:
>>>> Hi Stephen,
>>>>      I tried the same lzo file with a simple hadoop script
>>>> this seems to work fine
>>>> 
>>>> HADOOP_HOME=/usr/lib/hadoop
>>>> /usr/bin/hadoop  jar 
>>>> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar
>>>>  \
>>>> -libjars 
>>>> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
>>>>  \
>>>> -input /tmp/ldpc.sstv3.lzo \
>>>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
>>>> -output wc_test \
>>>> -mapper 'cat' \
>>>> -reducer 'wc -l'
>>>> 
>>>> This means hadoop is able to handle the lzo file correctly
>>>> 
>>>> Can you suggest me what i should do in spark for it to work
>>>> 
>>>> regards
>>>> Rajeev
>>>> 
>>>> 
>>>> Rajeev Srivastava
>>>> Silverline Design Inc
>>>> 2118 Walsh ave, suite 204
>>>> Santa Clara, CA, 95050
>>>> cell : 408-409-0940
>>>> 
>>>> 
>>>>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman 
>>>>> <[email protected]> wrote:
>>>>> 
>>>>> > System.setProperty("spark.io.compression.codec",
>>>>> > "com.hadoop.compression.lzo.LzopCodec")
>>>>> 
>>>>> This spark.io.compression.codec is a completely different setting than the
>>>>> codecs that are used for reading/writing from HDFS. (It is for compressing
>>>>> Spark's internal/non-HDFS intermediate output.)
>>>>> 
>>>>> > Hope this helps and someone can help read a LZO file
>>>>> 
>>>>> Spark just uses the regular Hadoop File System API, so any issues with 
>>>>> reading
>>>>> LZO files would be Hadoop issues. I would search in the Hadoop issue 
>>>>> tracker,
>>>>> and look for information on using LZO files with Hadoop/Hive, and 
>>>>> whatever works
>>>>> for them, should magically work for Spark as well.
>>>>> 
>>>>> This looks like a good place to start:
>>>>> 
>>>>> https://github.com/twitter/hadoop-lzo
>>>>> 
>>>>> IANAE, but I would try passing one of these:
>>>>> 
>>>>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
>>>>> 
>>>>> To the SparkContext.hadoopFile method.
>>>>> 
>>>>> - Stephen
>

Re: reading LZO compressed file in spark

Reply via email to