Re: Question about gzip compression when using Flume Ng

Brock Noland Mon, 14 Jan 2013 16:55:13 -0800

Hi,

That's just the file channel. The HDFSEventSink will need a heck of a
lot more than the just those two jars. To override the version of
hadoop it will find from the hadoop command you probably want to set
HADOOP_HOME in flume-env.sh to your custom install.


Also, the client and server should be the same version.

Brock

On Mon, Jan 14, 2013 at 4:43 PM, Sagar Mehta <[email protected]> wrote:
> ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got some
> errors about the guava dependencies so put in that jar too]
>
> smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" -e
> "guava"
> -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar
> -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50
> hadoop-core-0.20.2-cdh3u5.jar
>
> Now I don't event see the file being created in hdfs and the flume log is
> happily talking about housekeeping for some file channel checkpoints,
> updating pointers et al
>
> Below is tail of flume log
>
> hadoop@collector102:/data/flume_log$ tail -10 flume.log
> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO
> org.apache.flume.channel.file.Log - Updated checkpoint for file:
> /data/flume_data/channel2/data/log-36 position: 129415524 logWriteOrderID:
> 1358209947324
> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO
> org.apache.flume.channel.file.LogFile - Closing RandomReader
> /data/flume_data/channel2/data/log-34
> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO
> org.apache.flume.channel.file.Log - Updated checkpoint for file:
> /data/flume_data/channel1/data/log-36 position: 129415524 logWriteOrderID:
> 1358209947323
> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO
> org.apache.flume.channel.file.LogFile - Closing RandomReader
> /data/flume_data/channel1/data/log-34
> 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO
> org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta
> currentPosition = 18577138, logWriteOrderID = 1358209947324
> 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO
> org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta
> currentPosition = 18577138, logWriteOrderID = 1358209947323
> 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO
> org.apache.flume.channel.file.LogFile - Closing RandomReader
> /data/flume_data/channel1/data/log-35
> 2013-01-15 00:42:10,821 [Log-BackgroundWorker-channel2] INFO
> org.apache.flume.channel.file.LogFile - Closing RandomReader
> /data/flume_data/channel2/data/log-35
> 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel1] INFO
> org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta
> currentPosition = 217919486, logWriteOrderID = 1358209947323
> 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel2] INFO
> org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta
> currentPosition = 217919486, logWriteOrderID = 1358209947324
>
> Sagar
>
>
> On Mon, Jan 14, 2013 at 3:38 PM, Brock Noland <[email protected]> wrote:
>>
>> Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old,
>> I would upgrade to CDH3u5 or CDH 4.1.2.
>>
>> On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <[email protected]> wrote:
>> > About the bz2 suggestion, we have a ton of downstream jobs that assume
>> > gzip
>> > compressed files - so it is better to stick to gzip.
>> >
>> > The plan B for us is to have a Oozie step to gzip compress the logs
>> > before
>> > proceeding with downstream Hadoop jobs - but that looks like a hack to
>> > me!!
>> >
>> > Sagar
>> >
>> >
>> > On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[email protected]>
>> > wrote:
>> >>
>> >> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat
>> >> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l
>> >>
>> >> gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz:
>> >> decompression
>> >> OK, trailing garbage ignored
>> >> 100
>> >>
>> >> This should be about 50,000 events for the 5 min window!!
>> >>
>> >> Sagar
>> >>
>> >> On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <[email protected]>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> Can you try:  zcat file > output
>> >>>
>> >>> I think what is occurring is because of the flush the output file is
>> >>> actually several concatenated gz files.
>> >>>
>> >>> Brock
>> >>>
>> >>> On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[email protected]>
>> >>> wrote:
>> >>> > Yeah I have tried the text write format in vain before, but
>> >>> > nevertheless
>> >>> > gave it a try again!! Below is the latest file - still the same
>> >>> > thing.
>> >>> >
>> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date
>> >>> > Mon Jan 14 23:02:07 UTC 2013
>> >>> >
>> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls
>> >>> >
>> >>> >
>> >>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>> >>> > Found 1 items
>> >>> > -rw-r--r--   3 hadoop supergroup    4798117 2013-01-14 22:55
>> >>> >
>> >>> >
>> >>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>> >>> >
>> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget
>> >>> >
>> >>> >
>> >>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>> >>> > .
>> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip
>> >>> > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
>> >>> >
>> >>> > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz:
>> >>> > decompression
>> >>> > OK, trailing garbage ignored
>> >>> >
>> >>> > Interestingly enough, the gzip page says it is a harmless warning -
>> >>> > http://www.gzip.org/#faq8
>> >>> >
>> >>> > However, I'm losing events on decompression so I cannot afford to
>> >>> > ignore
>> >>> > this warning. The gzip page gives an example about magnetic tape -
>> >>> > there is
>> >>> > an analogy of hdfs block here since the file is initially stored in
>> >>> > hdfs
>> >>> > before I pull it out on the local filesystem.
>> >>> >
>> >>> > Sagar
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson
>> >>> > <[email protected]>
>> >>> > wrote:
>> >>> >>
>> >>> >> collector102.sinks.sink1.hdfs.writeFormat = TEXT
>> >>> >> collector102.sinks.sink2.hdfs.writeFormat = TEXT
>> >>> >
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Apache MRUnit - Unit testing MapReduce -
>> >>> http://incubator.apache.org/mrunit/
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Apache MRUnit - Unit testing MapReduce -
>> http://incubator.apache.org/mrunit/
>
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Question about gzip compression when using Flume Ng

Reply via email to