Re: Flume bz2 issue while processing by a map reduce job

Jagadish Bihani Fri, 02 Nov 2012 00:51:23 -0700

Hi

Any inputs on this?
It looks like a basic thing which, I guess, must have been handled in flume



On 10/30/2012 10:31 PM, Jagadish Bihani wrote:

Text.

Few updates on that:
-- It looks like some header issue.
-- When I copyToLocal the file and then again copy it back to HDFS,
map reduce job processes the the file correctly then.

Is it something related tohttps://issues.apache.org/jira/browse/HADOOP-6852?


Regards,
Jagadish


On 10/30/2012 09:15 PM, Brock Noland wrote:

What kind of files is your sink writing out? Text, Sequence, etc?

On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani
<[email protected]>  wrote:

Same thing happens even for gzip.

Regards,
Jagadish


On 10/26/2012 04:30 PM, Jagadish Bihani wrote:

Hi

I have a very peculiar scenario.

  1. My HDFS sink creates a bz2 file. File is perfectly fine I can
decompress it and
read it. It has 0.2 million records.
2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly
it only
reads first 100 records.
3. I then decompress the same file on local file system and use bzip2
command of
linux to again compress it and copy to HDFS.
4. Now I run the map -reduce job and this time it correctly processes all
the records.

I think flume agent writes compressed data to HDFS file in batches. And
somehow
bzip2 codec used by hadoop uses only first part of it.

This way bz2 files generated by Flume, if used directly, can't be
processed by Map reduce job.
Is there any solution to it?

Any inputs about other compression formats?

P.S.
Versions:

Flume 1.2.0 (Raw version; downloaded from
http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
Hadoop 1.0.3

Regards,
Jagadish

Re: Flume bz2 issue while processing by a map reduce job

Reply via email to