Re: Question about gzip compression when using Flume Ng

Sagar Mehta Mon, 14 Jan 2013 17:03:33 -0800

Hmm - good point!! Even in the best case say this works, moving to a newer
Hadoop version for the entire 2 production clusters that depend on it [400+
nodes] will need some thorough testing and won't be immediate.


I would have loved for the gzip compression part to have worked more or
less out of the box but for now most likely seems to be a Oozie step to
pre-compress before downstream takes over.

I'm still open to suggestions/insights from this group which has been
super-prompt so far :)

Sagar

On Mon, Jan 14, 2013 at 4:54 PM, Brock Noland <[email protected]> wrote:

> Hi,
>
> That's just the file channel. The HDFSEventSink will need a heck of a
> lot more than the just those two jars. To override the version of
> hadoop it will find from the hadoop command you probably want to set
> HADOOP_HOME in flume-env.sh to your custom install.
>
> Also, the client and server should be the same version.
>
> Brock
>
> On Mon, Jan 14, 2013 at 4:43 PM, Sagar Mehta <[email protected]> wrote:
> > ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got some
> > errors about the guava dependencies so put in that jar too]
> >
> > smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" -e
> > "guava"
> > -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar
> > -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50
> > hadoop-core-0.20.2-cdh3u5.jar
> >
> > Now I don't event see the file being created in hdfs and the flume log is
> > happily talking about housekeeping for some file channel checkpoints,
> > updating pointers et al
> >
> > Below is tail of flume log
> >
> > hadoop@collector102:/data/flume_log$ tail -10 flume.log
> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO
> > org.apache.flume.channel.file.Log - Updated checkpoint for file:
> > /data/flume_data/channel2/data/log-36 position: 129415524
> logWriteOrderID:
> > 1358209947324
> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO
> > org.apache.flume.channel.file.LogFile - Closing RandomReader
> > /data/flume_data/channel2/data/log-34
> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO
> > org.apache.flume.channel.file.Log - Updated checkpoint for file:
> > /data/flume_data/channel1/data/log-36 position: 129415524
> logWriteOrderID:
> > 1358209947323
> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO
> > org.apache.flume.channel.file.LogFile - Closing RandomReader
> > /data/flume_data/channel1/data/log-34
> > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO
> > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta
> > currentPosition = 18577138, logWriteOrderID = 1358209947324
> > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO
> > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta
> > currentPosition = 18577138, logWriteOrderID = 1358209947323
> > 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO
> > org.apache.flume.channel.file.LogFile - Closing RandomReader
> > /data/flume_data/channel1/data/log-35
> > 2013-01-15 00:42:10,821 [Log-BackgroundWorker-channel2] INFO
> > org.apache.flume.channel.file.LogFile - Closing RandomReader
> > /data/flume_data/channel2/data/log-35
> > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel1] INFO
> > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta
> > currentPosition = 217919486, logWriteOrderID = 1358209947323
> > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel2] INFO
> > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta
> > currentPosition = 217919486, logWriteOrderID = 1358209947324
> >
> > Sagar
> >
> >
> > On Mon, Jan 14, 2013 at 3:38 PM, Brock Noland <[email protected]>
> wrote:
> >>
> >> Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old,
> >> I would upgrade to CDH3u5 or CDH 4.1.2.
> >>
> >> On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <[email protected]>
> wrote:
> >> > About the bz2 suggestion, we have a ton of downstream jobs that assume
> >> > gzip
> >> > compressed files - so it is better to stick to gzip.
> >> >
> >> > The plan B for us is to have a Oozie step to gzip compress the logs
> >> > before
> >> > proceeding with downstream Hadoop jobs - but that looks like a hack to
> >> > me!!
> >> >
> >> > Sagar
> >> >
> >> >
> >> > On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[email protected]>
> >> > wrote:
> >> >>
> >> >> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat
> >> >> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l
> >> >>
> >> >> gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz:
> >> >> decompression
> >> >> OK, trailing garbage ignored
> >> >> 100
> >> >>
> >> >> This should be about 50,000 events for the 5 min window!!
> >> >>
> >> >> Sagar
> >> >>
> >> >> On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <[email protected]>
> >> >> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> Can you try:  zcat file > output
> >> >>>
> >> >>> I think what is occurring is because of the flush the output file is
> >> >>> actually several concatenated gz files.
> >> >>>
> >> >>> Brock
> >> >>>
> >> >>> On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[email protected]>
> >> >>> wrote:
> >> >>> > Yeah I have tried the text write format in vain before, but
> >> >>> > nevertheless
> >> >>> > gave it a try again!! Below is the latest file - still the same
> >> >>> > thing.
> >> >>> >
> >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date
> >> >>> > Mon Jan 14 23:02:07 UTC 2013
> >> >>> >
> >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls
> >> >>> >
> >> >>> >
> >> >>> >
> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
> >> >>> > Found 1 items
> >> >>> > -rw-r--r--   3 hadoop supergroup    4798117 2013-01-14 22:55
> >> >>> >
> >> >>> >
> >> >>> >
> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
> >> >>> >
> >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget
> >> >>> >
> >> >>> >
> >> >>> >
> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
> >> >>> > .
> >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip
> >> >>> > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz
> >> >>> >
> >> >>> > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz:
> >> >>> > decompression
> >> >>> > OK, trailing garbage ignored
> >> >>> >
> >> >>> > Interestingly enough, the gzip page says it is a harmless warning
> -
> >> >>> > http://www.gzip.org/#faq8
> >> >>> >
> >> >>> > However, I'm losing events on decompression so I cannot afford to
> >> >>> > ignore
> >> >>> > this warning. The gzip page gives an example about magnetic tape -
> >> >>> > there is
> >> >>> > an analogy of hdfs block here since the file is initially stored
> in
> >> >>> > hdfs
> >> >>> > before I pull it out on the local filesystem.
> >> >>> >
> >> >>> > Sagar
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson
> >> >>> > <[email protected]>
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> collector102.sinks.sink1.hdfs.writeFormat = TEXT
> >> >>> >> collector102.sinks.sink2.hdfs.writeFormat = TEXT
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Apache MRUnit - Unit testing MapReduce -
> >> >>> http://incubator.apache.org/mrunit/
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Apache MRUnit - Unit testing MapReduce -
> >> http://incubator.apache.org/mrunit/
> >
> >
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce -
> http://incubator.apache.org/mrunit/
>

Re: Question about gzip compression when using Flume Ng

Reply via email to