> From: Hiller, Dean x66079 <[email protected]> > Is there an example of LZO and is > that what I want customers to deliver as, correct? As > when LZO comes in and is layed over hadoop, it is split into > chunks on different nodes so I can decompress in parallel?
When you copy a big LZO compressed file it is split into blocks by HDFS and distributed in the usual manner, yes. > We have a customer who sends 10 million line > file. I see some other stats from production on a > higher end system that is taking 5 hours to decompress so > naturally I would like to make this faster. You will have to do a one time index of the LZO file before it can be split by a suitable InputFormat. See http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ So if you process the data multiple times, this will produce a substantial speedup. If not, maybe you want the customer to break up the big files into a number of smaller ones that can be processed single-shot in parallel. Better, over in HADOOP-4012, Hadoop recognized that BZIP2 has identifiable block boundaries in the compression format, and also dealt with the complication that file structure does not often coincide with compression block boundaries. CDH3 has this in 0.20. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
