Hmm - good point!! Even in the best case say this works, moving to a newer Hadoop version for the entire 2 production clusters that depend on it [400+ nodes] will need some thorough testing and won't be immediate.
I would have loved for the gzip compression part to have worked more or less out of the box but for now most likely seems to be a Oozie step to pre-compress before downstream takes over. I'm still open to suggestions/insights from this group which has been super-prompt so far :) Sagar On Mon, Jan 14, 2013 at 4:54 PM, Brock Noland <[email protected]> wrote: > Hi, > > That's just the file channel. The HDFSEventSink will need a heck of a > lot more than the just those two jars. To override the version of > hadoop it will find from the hadoop command you probably want to set > HADOOP_HOME in flume-env.sh to your custom install. > > Also, the client and server should be the same version. > > Brock > > On Mon, Jan 14, 2013 at 4:43 PM, Sagar Mehta <[email protected]> wrote: > > ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got some > > errors about the guava dependencies so put in that jar too] > > > > smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" -e > > "guava" > > -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar > > -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50 > > hadoop-core-0.20.2-cdh3u5.jar > > > > Now I don't event see the file being created in hdfs and the flume log is > > happily talking about housekeeping for some file channel checkpoints, > > updating pointers et al > > > > Below is tail of flume log > > > > hadoop@collector102:/data/flume_log$ tail -10 flume.log > > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO > > org.apache.flume.channel.file.Log - Updated checkpoint for file: > > /data/flume_data/channel2/data/log-36 position: 129415524 > logWriteOrderID: > > 1358209947324 > > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO > > org.apache.flume.channel.file.LogFile - Closing RandomReader > > /data/flume_data/channel2/data/log-34 > > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO > > org.apache.flume.channel.file.Log - Updated checkpoint for file: > > /data/flume_data/channel1/data/log-36 position: 129415524 > logWriteOrderID: > > 1358209947323 > > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO > > org.apache.flume.channel.file.LogFile - Closing RandomReader > > /data/flume_data/channel1/data/log-34 > > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO > > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta > > currentPosition = 18577138, logWriteOrderID = 1358209947324 > > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO > > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta > > currentPosition = 18577138, logWriteOrderID = 1358209947323 > > 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO > > org.apache.flume.channel.file.LogFile - Closing RandomReader > > /data/flume_data/channel1/data/log-35 > > 2013-01-15 00:42:10,821 [Log-BackgroundWorker-channel2] INFO > > org.apache.flume.channel.file.LogFile - Closing RandomReader > > /data/flume_data/channel2/data/log-35 > > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel1] INFO > > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta > > currentPosition = 217919486, logWriteOrderID = 1358209947323 > > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel2] INFO > > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta > > currentPosition = 217919486, logWriteOrderID = 1358209947324 > > > > Sagar > > > > > > On Mon, Jan 14, 2013 at 3:38 PM, Brock Noland <[email protected]> > wrote: > >> > >> Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old, > >> I would upgrade to CDH3u5 or CDH 4.1.2. > >> > >> On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <[email protected]> > wrote: > >> > About the bz2 suggestion, we have a ton of downstream jobs that assume > >> > gzip > >> > compressed files - so it is better to stick to gzip. > >> > > >> > The plan B for us is to have a Oozie step to gzip compress the logs > >> > before > >> > proceeding with downstream Hadoop jobs - but that looks like a hack to > >> > me!! > >> > > >> > Sagar > >> > > >> > > >> > On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[email protected]> > >> > wrote: > >> >> > >> >> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat > >> >> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l > >> >> > >> >> gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz: > >> >> decompression > >> >> OK, trailing garbage ignored > >> >> 100 > >> >> > >> >> This should be about 50,000 events for the 5 min window!! > >> >> > >> >> Sagar > >> >> > >> >> On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <[email protected]> > >> >> wrote: > >> >>> > >> >>> Hi, > >> >>> > >> >>> Can you try: zcat file > output > >> >>> > >> >>> I think what is occurring is because of the flush the output file is > >> >>> actually several concatenated gz files. > >> >>> > >> >>> Brock > >> >>> > >> >>> On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[email protected]> > >> >>> wrote: > >> >>> > Yeah I have tried the text write format in vain before, but > >> >>> > nevertheless > >> >>> > gave it a try again!! Below is the latest file - still the same > >> >>> > thing. > >> >>> > > >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date > >> >>> > Mon Jan 14 23:02:07 UTC 2013 > >> >>> > > >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls > >> >>> > > >> >>> > > >> >>> > > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > >> >>> > Found 1 items > >> >>> > -rw-r--r-- 3 hadoop supergroup 4798117 2013-01-14 22:55 > >> >>> > > >> >>> > > >> >>> > > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > >> >>> > > >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget > >> >>> > > >> >>> > > >> >>> > > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > >> >>> > . > >> >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip > >> >>> > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > >> >>> > > >> >>> > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz: > >> >>> > decompression > >> >>> > OK, trailing garbage ignored > >> >>> > > >> >>> > Interestingly enough, the gzip page says it is a harmless warning > - > >> >>> > http://www.gzip.org/#faq8 > >> >>> > > >> >>> > However, I'm losing events on decompression so I cannot afford to > >> >>> > ignore > >> >>> > this warning. The gzip page gives an example about magnetic tape - > >> >>> > there is > >> >>> > an analogy of hdfs block here since the file is initially stored > in > >> >>> > hdfs > >> >>> > before I pull it out on the local filesystem. > >> >>> > > >> >>> > Sagar > >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson > >> >>> > <[email protected]> > >> >>> > wrote: > >> >>> >> > >> >>> >> collector102.sinks.sink1.hdfs.writeFormat = TEXT > >> >>> >> collector102.sinks.sink2.hdfs.writeFormat = TEXT > >> >>> > > >> >>> > > >> >>> > > >> >>> > >> >>> > >> >>> > >> >>> -- > >> >>> Apache MRUnit - Unit testing MapReduce - > >> >>> http://incubator.apache.org/mrunit/ > >> >> > >> >> > >> > > >> > >> > >> > >> -- > >> Apache MRUnit - Unit testing MapReduce - > >> http://incubator.apache.org/mrunit/ > > > > > > > > -- > Apache MRUnit - Unit testing MapReduce - > http://incubator.apache.org/mrunit/ >
