Re: Flume bz2 issue while processing by a map reduce job

Jagadish Bihani Sat, 03 Nov 2012 04:32:37 -0700

Hi Mike

Thanks for the valuable inputs. That was driving us crazy.
But I had tested that this issue doesn't happen with compression format
 lzo/lzop (tested on hadoop 1.0.3).


Regards,
Jagadish



On 11/02/2012 03:16 PM, Mike Percy wrote:

Hi Jagadish,
My understanding based on investigating this issue over the lastcouple of days is that MapReduce jobs will only read the first sectionof a concatenaed bzip2 file. I believe you are correct thathttps://issues.apache.org/jira/browse/HADOOP-6852 is the only way tosolve this issue, and that would only be for the Hadoop 2.0 line, Ibelieve. I think that the Hadoop 1.x line would need to backport otherpatches from the 0.22 line, includinghttps://issues.apache.org/jira/browse/HADOOP-6835, which may also beneeded (my understanding is that that patch is already included in the2.x line).
I am aware of folks interested in trying to fix HADOOP-6852, however Ihave no ETA to give.
From Flume's perspective, I know of no other way of ensuringdurability using the hadoop-common APIs except for calling finalize inorder to flume the compression buffer at each transaction/batchboundary, in order to call hflush()/hsync() with the fully writtendata. This results in concatenated compressed plain text files in thecase of CompressedStream.
Current workarounds include not using compression, reprocessing thecompressed file as you mention, using a SequenceFile as a container,or using an Avro file as a container. The latter two are splittableand properly handle several compression codecs, including Snappy,which is a great way to go if you can do it.
Regards,
Mike
On Fri, Nov 2, 2012 at 12:50 AM, Jagadish Bihani<[email protected] <mailto:[email protected]>>wrote:
    Hi

    Any inputs on this?
    It looks like a basic thing which, I guess, must have been handled
    in flume



    On 10/30/2012 10:31 PM, Jagadish Bihani wrote:
    Text.

    Few updates on that:
    -- It looks like some header issue.
    -- When I copyToLocal the file and then again copy it back to HDFS,
    map reduce job processes the the file correctly then.
    Is it something related to
    https://issues.apache.org/jira/browse/HADOOP-6852?

    Regards,
    Jagadish


    On 10/30/2012 09:15 PM, Brock Noland wrote:
    What kind of files is your sink writing out? Text, Sequence, etc?

    On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani
    <[email protected]>  <mailto:[email protected]>  
wrote:
    Same thing happens even for gzip.

    Regards,
    Jagadish


    On 10/26/2012 04:30 PM, Jagadish Bihani wrote:
    Hi

    I have a very peculiar scenario.

      1. My HDFS sink creates a bz2 file. File is perfectly fine I can
    decompress it and
    read it. It has 0.2 million records.
    2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly
    it only
    reads first 100 records.
    3. I then decompress the same file on local file system and use bzip2
    command of
    linux to again compress it and copy to HDFS.
    4. Now I run the map -reduce job and this time it correctly processes all
    the records.

    I think flume agent writes compressed data to HDFS file in batches. And
    somehow
    bzip2 codec used by hadoop uses only first part of it.

    This way bz2 files generated by Flume, if used directly, can't be
    processed by Map reduce job.
    Is there any solution to it?

    Any inputs about other compression formats?

    P.S.
    Versions:

    Flume 1.2.0 (Raw version; downloaded from
    
http://www.apache.org/dyn/closer.cgi/flume/1.2.0/apache-flume-1.2.0-bin.tar.gz)
    Hadoop 1.0.3

    Regards,
    Jagadish

Re: Flume bz2 issue while processing by a map reduce job

Reply via email to