Re: Flume compression peculiar behaviour while processing compressed files by a map reduce job

Jagadish Bihani Mon, 29 Oct 2012 20:06:51 -0700

Does anyone have any inputs about why below mentioned behaviour mighthave happened?


On 10/26/2012 06:32 PM, Jagadish Bihani wrote:

Same thing happens even for gzip.

Regards,
Jagadish

On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
Hi

I have a very peculiar scenario.
1. My HDFS sink creates a bz2 file. File is perfectly fine I candecompress it and
read it. It has 0.2 million records.
2. Now I give that file to map-reduce job (hadoop 1.0.3) andsurprisingly it only
reads first 100 records.
3. I then decompress the same file on local file system and use bzip2command of
linux to again compress it and copy to HDFS.
4. Now I run the map -reduce job and this time it correctly processesall the records.
I think flume agent writes compressed data to HDFS file in batches.And somehow
bzip2 codec used by hadoop uses only first part of it.
This way bz2 files generated by Flume, if used directly, can't beprocessed by Map reduce job.
Is there any solution to it?

Any inputs about other compression formats?

P.S.
Versions:
Flume 1.2.0 (Raw version; downloaded fromhttp://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
Hadoop 1.0.3

Regards,
Jagadish

Re: Flume compression peculiar behaviour while processing compressed files by a map reduce job

Reply via email to