About the bz2 suggestion, we have a ton of downstream jobs that assume gzip compressed files - so it is better to stick to gzip.
The plan B for us is to have a Oozie step to gzip compress the logs before proceeding with downstream Hadoop jobs - but that looks like a hack to me!! Sagar On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[email protected]> wrote: > hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat > collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l > > gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz: decompression > OK, trailing garbage ignored > 100 > > This should be about 50,000 events for the 5 min window!! > > Sagar > > On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <[email protected]> wrote: > >> Hi, >> >> Can you try: zcat file > output >> >> I think what is occurring is because of the flush the output file is >> actually several concatenated gz files. >> >> Brock >> >> On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[email protected]> >> wrote: >> > Yeah I have tried the text write format in vain before, but nevertheless >> > gave it a try again!! Below is the latest file - still the same thing. >> > >> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date >> > Mon Jan 14 23:02:07 UTC 2013 >> > >> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls >> > >> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >> > Found 1 items >> > -rw-r--r-- 3 hadoop supergroup 4798117 2013-01-14 22:55 >> > >> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >> > >> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget >> > >> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >> > . >> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip >> > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >> > >> > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz: >> decompression >> > OK, trailing garbage ignored >> > >> > Interestingly enough, the gzip page says it is a harmless warning - >> > http://www.gzip.org/#faq8 >> > >> > However, I'm losing events on decompression so I cannot afford to ignore >> > this warning. The gzip page gives an example about magnetic tape - >> there is >> > an analogy of hdfs block here since the file is initially stored in hdfs >> > before I pull it out on the local filesystem. >> > >> > Sagar >> > >> > >> > >> > >> > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson <[email protected] >> > >> > wrote: >> >> >> >> collector102.sinks.sink1.hdfs.writeFormat = TEXT >> >> collector102.sinks.sink2.hdfs.writeFormat = TEXT >> > >> > >> > >> >> >> >> -- >> Apache MRUnit - Unit testing MapReduce - >> http://incubator.apache.org/mrunit/ >> > >
