Re: our customer delivering compressed file to hadoop question

Andrew Purtell Sun, 12 Jun 2011 08:41:17 -0700

> From: Hiller, Dean  x66079 <[email protected]>
> Is there an example of LZO and is
> that what I want customers to deliver as, correct?  As
> when LZO comes in and is layed over hadoop, it is split into
> chunks on different nodes so I can decompress in parallel?


When you copy a big LZO compressed file it is split into blocks by HDFS and 
distributed in the usual manner, yes.

> We have a customer who sends  10 million line
> file.  I see some other stats from production on a
> higher end system that is taking 5 hours to decompress so
> naturally I would like to make this faster.

You will have to do a one time index of the LZO file before it can be split by 
a suitable InputFormat. See 
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
 So if you process the data multiple times, this will produce a substantial 
speedup. If not, maybe you want the customer to break up the big files into a 
number of smaller ones that can be processed single-shot in parallel.

Better, over in HADOOP-4012, Hadoop recognized that BZIP2 has identifiable 
block boundaries in the compression format, and also dealt with the 
complication that file structure does not often coincide with compression block 
boundaries. CDH3 has this in 0.20. 

Best regards,

    - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)

Re: our customer delivering compressed file to hadoop question

Reply via email to