Re: reading LZO compressed file in spark

Vipul Pandey Tue, 21 Jan 2014 22:58:20 -0800

Hi Rajeev,

Did you get past this exception?


Thanks,
Vipul


On Dec 26, 2013, at 12:48 PM, Rajeev Srivastava <[email protected]> 
wrote:

> Hi Andrew,
>      Thanks for your example
> I used your command and i get the following errors from worker  ( missing 
> codec from worker i guess)
> How do i get codecs over to worker machines
> regards
> Rajeev
> *******************************************************************
> 13/12/26 12:34:42 INFO TaskSetManager: Loss was due to java.io.IOException: 
> Codec for file 
> hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not found, 
> cannot run                                                                    
>     at 
> com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:97)
>                                       at 
> spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:68)                  
>                                         at 
> spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57)                         
>                                         at 
> spark.RDD.computeOrReadCheckpoint(RDD.scala:207)                              
>                                         at spark.RDD.iterator(RDD.scala:196)  
>                                                                               
>      at spark.scheduler.ResultTask.run(ResultTask.scala:77)                   
>                                                 at 
> spark.executor.Executor$TaskRunner.run(Executor.scala:98)                     
>                                         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>                                        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>                                        at 
> java.lang.Thread.run(Thread.java:724)                                         
>                                 13/12/26 12:34:42 INFO TaskSetManager: 
> Starting task 0.0:15 as TID 28 on executor 4: hadoop02 (preferred)            
>                                                                               
>                              13/12/26 12:34:42 INFO TaskSetManager: 
> Serialized task 0.0:15 as 1358 bytes in 0 ms                                  
>     13/12/26 12:34:42 INFO TaskSetManager: Lost TID 22 (task 0.0:20)          
>                                                13/12/26 12:34:42 INFO 
> TaskSetManager: Loss was due to java.io.IOException: Codec for file 
> hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not found, 
> cannot run [duplicate 1]   
> 
> Rajeev Srivastava
> Silverline Design Inc
> 2118 Walsh ave, suite 204
> Santa Clara, CA, 95050
> cell : 408-409-0940
> 
> 
> On Tue, Dec 24, 2013 at 5:20 PM, Andrew Ash <[email protected]> wrote:
> Hi Berkeley,
> 
> By RF=3 I mean replication factor of 3 on the files in HDFS, so each block is 
> stored 3 times across the cluster.  It's a pretty standard choice for the 
> replication factor in order to give a hardware team time to replace bad 
> hardware in the case of failure.  With RF=3 the cluster can sustain failure 
> on any two nodes without data loss, but the loss of the third node may cause 
> loss.
> 
> When reading the LZO files with the newAPIHadoopFile() call I showed below, 
> the data in the RDD is already decompressed -- it transparently looks the 
> same to my Spark program as if I was operating on an uncompressed file.
> 
> Cheers,
> Andrew
> 
> 
> On Tue, Dec 24, 2013 at 12:29 PM, Berkeley Malagon 
> <[email protected]> wrote:
> Andrew, This is great. 
> 
> Excuse my ignorance, but what do you mean by RF=3? Also, after reading the 
> LZO files, are you able to access the contents directly, or do you have to 
> decompress them after reading them?
> 
> Sent from my iPhone
> 
> On Dec 24, 2013, at 12:03 AM, Andrew Ash <[email protected]> wrote:
> 
>> Hi Rajeev,
>> 
>> I'm not sure if you ever got it working, but I just got mine up and going.  
>> If you just use sc.textFile(...) the file will be read but the LZO index 
>> won't be used so a .count() on my 1B+ row file took 2483s.  When I ran it 
>> like this though:
>> 
>> sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo", 
>> classOf[com.hadoop.mapreduce.LzoTextInputFormat], 
>> classOf[org.apache.hadoop.io.LongWritable], 
>> classOf[org.apache.hadoop.io.Text]).count
>> 
>> the LZO index file was used and the .count() took just 101s.  For reference 
>> this file is 43GB when .gz compressed and 78.4GB when .lzo compressed.  I 
>> have RF=3 and this is across 4 pretty beefy machines with Hadoop DataNodes 
>> and Spark both running on each machine.
>> 
>> Cheers!
>> Andrew
>> 
>> 
>> On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava 
>> <[email protected]> wrote:
>> Thanks for your suggestion. I will try this and update by late evening.
>> 
>> regards
>> Rajeev
>> 
>> Rajeev Srivastava
>> Silverline Design Inc
>> 2118 Walsh ave, suite 204
>> Santa Clara, CA, 95050
>> cell : 408-409-0940
>> 
>> 
>> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <[email protected]> wrote:
>> Hi Rajeev,
>> 
>> It looks like you're using the 
>> com.hadoop.mapred.DeprecatedLzoTextInputFormat input format above, while 
>> Stephen referred to com.hadoop.mapreduce.LzoTextInputFormat
>> 
>> I think the way to use this in Spark would be to use the 
>> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with 
>> the path and the InputFormat as parameters.  Can you give those a shot?
>> 
>> Andrew
>> 
>> 
>> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava 
>> <[email protected]> wrote:
>> Hi Stephen,
>>      I tried the same lzo file with a simple hadoop script
>> this seems to work fine
>> 
>> HADOOP_HOME=/usr/lib/hadoop
>> /usr/bin/hadoop  jar 
>> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar
>>  \
>> -libjars 
>> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
>>  \
>> -input /tmp/ldpc.sstv3.lzo \
>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
>> -output wc_test \
>> -mapper 'cat' \
>> -reducer 'wc -l'
>> 
>> This means hadoop is able to handle the lzo file correctly
>> 
>> Can you suggest me what i should do in spark for it to work
>> 
>> regards
>> Rajeev
>> 
>> 
>> Rajeev Srivastava
>> Silverline Design Inc
>> 2118 Walsh ave, suite 204
>> Santa Clara, CA, 95050
>> cell : 408-409-0940
>> 
>> 
>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman 
>> <[email protected]> wrote:
>> 
>> > System.setProperty("spark.io.compression.codec",
>> > "com.hadoop.compression.lzo.LzopCodec")
>> 
>> This spark.io.compression.codec is a completely different setting than the
>> codecs that are used for reading/writing from HDFS. (It is for compressing
>> Spark's internal/non-HDFS intermediate output.)
>> 
>> > Hope this helps and someone can help read a LZO file
>> 
>> Spark just uses the regular Hadoop File System API, so any issues with 
>> reading
>> LZO files would be Hadoop issues. I would search in the Hadoop issue tracker,
>> and look for information on using LZO files with Hadoop/Hive, and whatever 
>> works
>> for them, should magically work for Spark as well.
>> 
>> This looks like a good place to start:
>> 
>> https://github.com/twitter/hadoop-lzo
>> 
>> IANAE, but I would try passing one of these:
>> 
>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
>> 
>> To the SparkContext.hadoopFile method.
>> 
>> - Stephen
>> 
>> 
>> 
>> 
>> 
> 
>

Re: reading LZO compressed file in spark

Reply via email to