Hi Rajeev, Did you get past this exception?
Thanks, Vipul On Dec 26, 2013, at 12:48 PM, Rajeev Srivastava <[email protected]> wrote: > Hi Andrew, > Thanks for your example > I used your command and i get the following errors from worker ( missing > codec from worker i guess) > How do i get codecs over to worker machines > regards > Rajeev > ******************************************************************* > 13/12/26 12:34:42 INFO TaskSetManager: Loss was due to java.io.IOException: > Codec for file > hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not found, > cannot run > at > com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:97) > at > spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:68) > at > spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:57) > at > spark.RDD.computeOrReadCheckpoint(RDD.scala:207) > at spark.RDD.iterator(RDD.scala:196) > > at spark.scheduler.ResultTask.run(ResultTask.scala:77) > at > spark.executor.Executor$TaskRunner.run(Executor.scala:98) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at > java.lang.Thread.run(Thread.java:724) > 13/12/26 12:34:42 INFO TaskSetManager: > Starting task 0.0:15 as TID 28 on executor 4: hadoop02 (preferred) > > 13/12/26 12:34:42 INFO TaskSetManager: > Serialized task 0.0:15 as 1358 bytes in 0 ms > 13/12/26 12:34:42 INFO TaskSetManager: Lost TID 22 (task 0.0:20) > 13/12/26 12:34:42 INFO > TaskSetManager: Loss was due to java.io.IOException: Codec for file > hdfs://hadoop00/tmp/ldpc_dec_top_2450000_to_2750000.vcd.sstv3.lzo not found, > cannot run [duplicate 1] > > Rajeev Srivastava > Silverline Design Inc > 2118 Walsh ave, suite 204 > Santa Clara, CA, 95050 > cell : 408-409-0940 > > > On Tue, Dec 24, 2013 at 5:20 PM, Andrew Ash <[email protected]> wrote: > Hi Berkeley, > > By RF=3 I mean replication factor of 3 on the files in HDFS, so each block is > stored 3 times across the cluster. It's a pretty standard choice for the > replication factor in order to give a hardware team time to replace bad > hardware in the case of failure. With RF=3 the cluster can sustain failure > on any two nodes without data loss, but the loss of the third node may cause > loss. > > When reading the LZO files with the newAPIHadoopFile() call I showed below, > the data in the RDD is already decompressed -- it transparently looks the > same to my Spark program as if I was operating on an uncompressed file. > > Cheers, > Andrew > > > On Tue, Dec 24, 2013 at 12:29 PM, Berkeley Malagon > <[email protected]> wrote: > Andrew, This is great. > > Excuse my ignorance, but what do you mean by RF=3? Also, after reading the > LZO files, are you able to access the contents directly, or do you have to > decompress them after reading them? > > Sent from my iPhone > > On Dec 24, 2013, at 12:03 AM, Andrew Ash <[email protected]> wrote: > >> Hi Rajeev, >> >> I'm not sure if you ever got it working, but I just got mine up and going. >> If you just use sc.textFile(...) the file will be read but the LZO index >> won't be used so a .count() on my 1B+ row file took 2483s. When I ran it >> like this though: >> >> sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo", >> classOf[com.hadoop.mapreduce.LzoTextInputFormat], >> classOf[org.apache.hadoop.io.LongWritable], >> classOf[org.apache.hadoop.io.Text]).count >> >> the LZO index file was used and the .count() took just 101s. For reference >> this file is 43GB when .gz compressed and 78.4GB when .lzo compressed. I >> have RF=3 and this is across 4 pretty beefy machines with Hadoop DataNodes >> and Spark both running on each machine. >> >> Cheers! >> Andrew >> >> >> On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava >> <[email protected]> wrote: >> Thanks for your suggestion. I will try this and update by late evening. >> >> regards >> Rajeev >> >> Rajeev Srivastava >> Silverline Design Inc >> 2118 Walsh ave, suite 204 >> Santa Clara, CA, 95050 >> cell : 408-409-0940 >> >> >> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <[email protected]> wrote: >> Hi Rajeev, >> >> It looks like you're using the >> com.hadoop.mapred.DeprecatedLzoTextInputFormat input format above, while >> Stephen referred to com.hadoop.mapreduce.LzoTextInputFormat >> >> I think the way to use this in Spark would be to use the >> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with >> the path and the InputFormat as parameters. Can you give those a shot? >> >> Andrew >> >> >> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava >> <[email protected]> wrote: >> Hi Stephen, >> I tried the same lzo file with a simple hadoop script >> this seems to work fine >> >> HADOOP_HOME=/usr/lib/hadoop >> /usr/bin/hadoop jar >> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar >> \ >> -libjars >> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar >> \ >> -input /tmp/ldpc.sstv3.lzo \ >> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \ >> -output wc_test \ >> -mapper 'cat' \ >> -reducer 'wc -l' >> >> This means hadoop is able to handle the lzo file correctly >> >> Can you suggest me what i should do in spark for it to work >> >> regards >> Rajeev >> >> >> Rajeev Srivastava >> Silverline Design Inc >> 2118 Walsh ave, suite 204 >> Santa Clara, CA, 95050 >> cell : 408-409-0940 >> >> >> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman >> <[email protected]> wrote: >> >> > System.setProperty("spark.io.compression.codec", >> > "com.hadoop.compression.lzo.LzopCodec") >> >> This spark.io.compression.codec is a completely different setting than the >> codecs that are used for reading/writing from HDFS. (It is for compressing >> Spark's internal/non-HDFS intermediate output.) >> >> > Hope this helps and someone can help read a LZO file >> >> Spark just uses the regular Hadoop File System API, so any issues with >> reading >> LZO files would be Hadoop issues. I would search in the Hadoop issue tracker, >> and look for information on using LZO files with Hadoop/Hive, and whatever >> works >> for them, should magically work for Spark as well. >> >> This looks like a good place to start: >> >> https://github.com/twitter/hadoop-lzo >> >> IANAE, but I would try passing one of these: >> >> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java >> >> To the SparkContext.hadoopFile method. >> >> - Stephen >> >> >> >> >> > >
