Hi Rajeev, It looks like you're using the com.hadoop.mapred.DeprecatedLzoTextInputFormat input format above, while Stephen referred to com.hadoop.mapreduce. LzoTextInputFormat
I think the way to use this in Spark would be to use the SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with the path and the InputFormat as parameters. Can you give those a shot? Andrew On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava <[email protected] > wrote: > Hi Stephen, > I tried the same lzo file with a simple hadoop script > this seems to work fine > > HADOOP_HOME=/usr/lib/hadoop > /usr/bin/hadoop jar > /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar > \ > -libjars > /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar > \ > -input /tmp/ldpc.sstv3.lzo \ > -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \ > -output wc_test \ > -mapper 'cat' \ > -reducer 'wc -l' > > This means hadoop is able to handle the lzo file correctly > > Can you suggest me what i should do in spark for it to work > > regards > Rajeev > > > Rajeev Srivastava > Silverline Design Inc > 2118 Walsh ave, suite 204 > Santa Clara, CA, 95050 > cell : 408-409-0940 > > > On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman < > [email protected]> wrote: > >> >> > System.setProperty("spark.io.compression.codec", >> > "com.hadoop.compression.lzo.LzopCodec") >> >> This spark.io.compression.codec is a completely different setting than the >> codecs that are used for reading/writing from HDFS. (It is for compressing >> Spark's internal/non-HDFS intermediate output.) >> >> > Hope this helps and someone can help read a LZO file >> >> Spark just uses the regular Hadoop File System API, so any issues with >> reading >> LZO files would be Hadoop issues. I would search in the Hadoop issue >> tracker, >> and look for information on using LZO files with Hadoop/Hive, and >> whatever works >> for them, should magically work for Spark as well. >> >> This looks like a good place to start: >> >> https://github.com/twitter/hadoop-lzo >> >> IANAE, but I would try passing one of these: >> >> >> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java >> >> To the SparkContext.hadoopFile method. >> >> - Stephen >> >> >
