Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old, I would upgrade to CDH3u5 or CDH 4.1.2.
On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <[email protected]> wrote: > About the bz2 suggestion, we have a ton of downstream jobs that assume gzip > compressed files - so it is better to stick to gzip. > > The plan B for us is to have a Oozie step to gzip compress the logs before > proceeding with downstream Hadoop jobs - but that looks like a hack to me!! > > Sagar > > > On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[email protected]> wrote: >> >> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat >> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l >> >> gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz: decompression >> OK, trailing garbage ignored >> 100 >> >> This should be about 50,000 events for the 5 min window!! >> >> Sagar >> >> On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <[email protected]> wrote: >>> >>> Hi, >>> >>> Can you try: zcat file > output >>> >>> I think what is occurring is because of the flush the output file is >>> actually several concatenated gz files. >>> >>> Brock >>> >>> On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[email protected]> >>> wrote: >>> > Yeah I have tried the text write format in vain before, but >>> > nevertheless >>> > gave it a try again!! Below is the latest file - still the same thing. >>> > >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date >>> > Mon Jan 14 23:02:07 UTC 2013 >>> > >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls >>> > >>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >>> > Found 1 items >>> > -rw-r--r-- 3 hadoop supergroup 4798117 2013-01-14 22:55 >>> > >>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >>> > >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget >>> > >>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >>> > . >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip >>> > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >>> > >>> > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz: >>> > decompression >>> > OK, trailing garbage ignored >>> > >>> > Interestingly enough, the gzip page says it is a harmless warning - >>> > http://www.gzip.org/#faq8 >>> > >>> > However, I'm losing events on decompression so I cannot afford to >>> > ignore >>> > this warning. The gzip page gives an example about magnetic tape - >>> > there is >>> > an analogy of hdfs block here since the file is initially stored in >>> > hdfs >>> > before I pull it out on the local filesystem. >>> > >>> > Sagar >>> > >>> > >>> > >>> > >>> > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson >>> > <[email protected]> >>> > wrote: >>> >> >>> >> collector102.sinks.sink1.hdfs.writeFormat = TEXT >>> >> collector102.sinks.sink2.hdfs.writeFormat = TEXT >>> > >>> > >>> > >>> >>> >>> >>> -- >>> Apache MRUnit - Unit testing MapReduce - >>> http://incubator.apache.org/mrunit/ >> >> > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
