I think you nailed it with "I guess I/O is not a bottle neck for me". Yes when you can have a dedicated cpu, decompression in stream is faster that I/O, but if your downstream process is complicated, you probably won't see much benefit, because the decompression process will be waiting for the downstream process.
You'll see a little benefit if you pig job (downstream process) is faster than I/O but possibly slower than the decompression. Kannan On 18 November 2012 08:25, W W <[email protected]> wrote: > hello > > In Alan Gates' Programming in Pig , chapter "Making Pig Fly" it was > mentioned > In testing we did while developing this feature we saw performance > improvements of up to 4x when using LZO, and slight performance degradation > when using gzip. > (http://ofps.oreilly.com/titles/9781449302641/making_pig_fly.html) > > > I've tried using lzo as the compression tools( took me couple of days to > compile it ) , and also with gzip. > The result of gzip is the same as mentioned in the book, but the result of > with lzo is not imporvements of up to 4x , but almost the no improvement or > slight degradation as well. > > I enabled the compression between Map and Reduce , and also between M/R > jobs "pig.tmpfilecompression=true pig.tmpfilecompression.codec=lzo". > > From the counters I can see the HDFS bytes are compressed to about 1/3 > compared to no compress. > I can followings in the log on TaskTracker. > > 2012-11-18 16:14:11,638 INFO > com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl > library > 2012-11-18 16:14:11,639 INFO com.hadoop.compression.lzo.LzoCodec: > Successfully loaded & initialized native-lzo library > 2012-11-18 16:14:11,640 INFO org.apache.hadoop.io.compress.CodecPool: > Got brand-new decompressor > > > The data volume is about 6G in total, and I have 100 cpus + 150G memory > fall on 10 nodes. > My pig script is compiled into 4 M/R jobs. The operation in each job is : > MAP_ONLY --> HASH_JOIN --> GROUP_BY --> HASH_JOIN . > > My guess of the reason is IO is not a bottle net for me, but was one for > Alan Gates' case when he wrote the book. > > Any one have any clue why I didn't gain any improvement? > > > Thanks > Regards > Xingbang Wang > -- Kannan Shah Analytical-Modeling Staff Scientist Financial Services - Modeling SAS Institute San Diego Detection-and-Estimation Group Data Fusion Laboratory Philadelphia
