One thing you can try is to pull each file out of S3 and decompress with "gzip -d" to see if it works. I'm guessing there's a corrupted .gz file somewhere in your path glob.
Andrew On Wed, May 21, 2014 at 12:40 PM, Michael Cutler <mich...@tumra.com> wrote: > Hi Nick, > > Which version of Hadoop are you using with Spark? I spotted an issue with > the built-in GzipDecompressor while doing something similar with Hadoop > 1.0.4, all my Gzip files were valid and tested yet certain files blew up > from Hadoop/Spark. > > The following JIRA ticket goes into more detail > https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all > Hadoop releases prior to 1.2.X > > MC > > > > > *Michael Cutler* > Founder, CTO > > > * Mobile: +44 789 990 7847 Email: mich...@tumra.com <mich...@tumra.com> > Web: tumra.com > <http://tumra.com/?utm_source=signature&utm_medium=email> * > *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>* > *Registered in England & Wales, 07916412. VAT No. 130595328* > > > This email and any files transmitted with it are confidential and may also > be privileged. It is intended only for the person to whom it is addressed. > If you have received this email in error, please inform the sender > immediately. > If you are not the intended recipient you must not use, disclose, copy, > print, distribute or rely on this email. > > > On 21 May 2014 14:26, Madhu <ma...@madhu.com> wrote: > >> Can you identify a specific file that fails? >> There might be a real bug here, but I have found gzip to be reliable. >> Every time I have run into a "bad header" error with gzip, I had a >> non-gzip >> file with the wrong extension for whatever reason. >> >> >> >> >> ----- >> Madhu >> https://www.linkedin.com/in/msiddalingaiah >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >