YES, your hunches were correct. I’ve identified at least one file among the hundreds I’m processing that is indeed not a valid gzip file.
Does anyone know of an easy way to exclude a specific file or files when calling sc.textFile() on a pattern? e.g. Something like: sc.textFile('s3n://bucket/stuff/*.gz, exclude:s3n://bucket/stuff/bad.gz') On Wed, May 21, 2014 at 11:50 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Thanks for the suggestions, people. I will try to hone in on which > specific gzipped files, if any, are actually corrupt. > > Michael, > > I’m using Hadoop 1.0.4, which I believe is the default version that gets > deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281 > <https://issues.apache.org/jira/browse/HADOOP-5281>, affects Hadoop > 0.18.0 and is fixed in 0.20.0 and is also related to gzip compression. I > know there is some funkiness in how Hadoop is versioned, so I’m not sure if > this issue is relevant to 1.0.4. > > Were you able to resolve your issue by changing your version of Hadoop? > How did you do that? > > Nick > > > > On Wed, May 21, 2014 at 11:38 PM, Andrew Ash <and...@andrewash.com> wrote: > >> One thing you can try is to pull each file out of S3 and decompress with >> "gzip -d" to see if it works. I'm guessing there's a corrupted .gz file >> somewhere in your path glob. >> >> Andrew >> >> >> On Wed, May 21, 2014 at 12:40 PM, Michael Cutler <mich...@tumra.com> >> wrote: >> >>> Hi Nick, >>> >>> Which version of Hadoop are you using with Spark? I spotted an issue >>> with the built-in GzipDecompressor while doing something similar with >>> Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files >>> blew up from Hadoop/Spark. >>> >>> The following JIRA ticket goes into more detail >>> https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all >>> Hadoop releases prior to 1.2.X >>> >>> MC >>> >>> >>> >>> >>> *Michael Cutler* >>> Founder, CTO >>> >>> >>> * Mobile: +44 789 990 7847 Email: mich...@tumra.com >>> <mich...@tumra.com> Web: tumra.com >>> <http://tumra.com/?utm_source=signature&utm_medium=email> * >>> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>* >>> *Registered in England & Wales, 07916412. VAT No. 130595328* >>> >>> >>> This email and any files transmitted with it are confidential and may >>> also be privileged. It is intended only for the person to whom it is >>> addressed. If you have received this email in error, please inform the >>> sender immediately. If you are not the intended recipient you must not >>> use, disclose, copy, print, distribute or rely on this email. >>> >>> >>> On 21 May 2014 14:26, Madhu <ma...@madhu.com> wrote: >>> >>>> Can you identify a specific file that fails? >>>> There might be a real bug here, but I have found gzip to be reliable. >>>> Every time I have run into a "bad header" error with gzip, I had a >>>> non-gzip >>>> file with the wrong extension for whatever reason. >>>> >>>> >>>> >>>> >>>> ----- >>>> Madhu >>>> https://www.linkedin.com/in/msiddalingaiah >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>> >>> >> >