Hi Scott,
I work with lots of gzipped files also and sometimes I used to get the same
error.
I started checking the gzip files before processing them. In fact I check
immediately after I put them on hdfs.
What I do is a cat  of the gzip file and check it with the gzip -t.

For example, all the files that is followed by an gzip error is corrupted.

$> for i in `hadoop fs -ls /user/cdh-hadoop/mscdata/edgecast/201010/ | awk
'{print $8}'`; do echo $i; hadoop fs -cat $i | gzip -t; done
/user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0000.log.gz
/user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0001.log.gz
/user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0002.log.gz
/user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0003.log.gz
/user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0004.log.gz
/user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0005.log.gz
*/user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0006.log.gz
gzip: stdin: invalid compressed data--crc error
gzip: stdin: invalid compressed data--length error*
/user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0007.log.gz
/user/cdh-hadoop/mscdata/edgecast/201010/wpc_NORM_201010-0001-0008.log.gz


Hope that work for you!


On Wed, Feb 16, 2011 at 12:52 PM, Kester, Scott <[email protected]> wrote:

> This may be better asked on one of the other hadoop lists, but as the job
> in question is done with Pig I thought I would start here.  I have a nightly
> job that runs against around 1000 gzip log files.  Around once a week one of
> the map tasks will fail reporting some form of gzip error/corruption of the
> input file. The job still completes as successful as we have set
> mapred.max.map.failures.percent = 1 to allow a few input files to fail
> without aborting the entire job.
>
>
>  Sometimes I can find the name of the corrupt input file in the logs
> available for the map task from the Map/Reduce Administration page on port
> 50030 of the name node.  However most of the time the name is not in these
> logs.  I can find the map task id of the form
> attempt_201102141346_0097_m_000000_0, but would like to know how if possible
> to find the name of the corrupted input file.  Is there a Pig/Haddop
> file/log somewhere that associates the attempt id with the input file?
>
> Thanks,
> Scott
>
>


-- 
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Reply via email to