Andrew, This is great. Excuse my ignorance, but what do you mean by RF=3? Also, after reading the LZO files, are you able to access the contents directly, or do you have to decompress them after reading them?
Sent from my iPhone > On Dec 24, 2013, at 12:03 AM, Andrew Ash <[email protected]> wrote: > > Hi Rajeev, > > I'm not sure if you ever got it working, but I just got mine up and going. > If you just use sc.textFile(...) the file will be read but the LZO index > won't be used so a .count() on my 1B+ row file took 2483s. When I ran it > like this though: > > sc.newAPIHadoopFile("hdfs:///path/to/myfile.lzo", > classOf[com.hadoop.mapreduce.LzoTextInputFormat], > classOf[org.apache.hadoop.io.LongWritable], > classOf[org.apache.hadoop.io.Text]).count > > the LZO index file was used and the .count() took just 101s. For reference > this file is 43GB when .gz compressed and 78.4GB when .lzo compressed. I > have RF=3 and this is across 4 pretty beefy machines with Hadoop DataNodes > and Spark both running on each machine. > > Cheers! > Andrew > > >> On Mon, Dec 16, 2013 at 2:34 PM, Rajeev Srivastava >> <[email protected]> wrote: >> Thanks for your suggestion. I will try this and update by late evening. >> >> regards >> Rajeev >> >> Rajeev Srivastava >> Silverline Design Inc >> 2118 Walsh ave, suite 204 >> Santa Clara, CA, 95050 >> cell : 408-409-0940 >> >> >>> On Mon, Dec 16, 2013 at 11:24 AM, Andrew Ash <[email protected]> wrote: >>> Hi Rajeev, >>> >>> It looks like you're using the >>> com.hadoop.mapred.DeprecatedLzoTextInputFormat input format above, while >>> Stephen referred to com.hadoop.mapreduce.LzoTextInputFormat >>> >>> I think the way to use this in Spark would be to use the >>> SparkContext.hadoopFile() or SparkContext.newAPIHadoopFile() methods with >>> the path and the InputFormat as parameters. Can you give those a shot? >>> >>> Andrew >>> >>> >>>> On Wed, Dec 11, 2013 at 8:59 PM, Rajeev Srivastava >>>> <[email protected]> wrote: >>>> Hi Stephen, >>>> I tried the same lzo file with a simple hadoop script >>>> this seems to work fine >>>> >>>> HADOOP_HOME=/usr/lib/hadoop >>>> /usr/bin/hadoop jar >>>> /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop-mapreduce/hadoop-streaming.jar >>>> \ >>>> -libjars >>>> /opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar >>>> \ >>>> -input /tmp/ldpc.sstv3.lzo \ >>>> -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \ >>>> -output wc_test \ >>>> -mapper 'cat' \ >>>> -reducer 'wc -l' >>>> >>>> This means hadoop is able to handle the lzo file correctly >>>> >>>> Can you suggest me what i should do in spark for it to work >>>> >>>> regards >>>> Rajeev >>>> >>>> >>>> Rajeev Srivastava >>>> Silverline Design Inc >>>> 2118 Walsh ave, suite 204 >>>> Santa Clara, CA, 95050 >>>> cell : 408-409-0940 >>>> >>>> >>>>> On Tue, Dec 10, 2013 at 1:20 PM, Stephen Haberman >>>>> <[email protected]> wrote: >>>>> >>>>> > System.setProperty("spark.io.compression.codec", >>>>> > "com.hadoop.compression.lzo.LzopCodec") >>>>> >>>>> This spark.io.compression.codec is a completely different setting than the >>>>> codecs that are used for reading/writing from HDFS. (It is for compressing >>>>> Spark's internal/non-HDFS intermediate output.) >>>>> >>>>> > Hope this helps and someone can help read a LZO file >>>>> >>>>> Spark just uses the regular Hadoop File System API, so any issues with >>>>> reading >>>>> LZO files would be Hadoop issues. I would search in the Hadoop issue >>>>> tracker, >>>>> and look for information on using LZO files with Hadoop/Hive, and >>>>> whatever works >>>>> for them, should magically work for Spark as well. >>>>> >>>>> This looks like a good place to start: >>>>> >>>>> https://github.com/twitter/hadoop-lzo >>>>> >>>>> IANAE, but I would try passing one of these: >>>>> >>>>> https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java >>>>> >>>>> To the SparkContext.hadoopFile method. >>>>> >>>>> - Stephen >
