That's a neat idea. I'll try that out.
On Sat, May 31, 2014 at 2:45 PM, Patrick Wendell <pwend...@gmail.com> wrote: > I think there are a few ways to do this... the simplest one might be to > manually build a set of comma-separated paths that excludes the bad file, > and pass that to textFile(). > > When you call textFile() under the hood it is going to pass your filename > string to hadoopFile() which calls setInputPaths() on the hadoop > FileInputformat. > > > http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf, > org.apache.hadoop.fs.Path...) > > I think this can accept a comma-separate list of paths. > > So you could do something like this (this is pseudo-code): > files = fs.listStatus("s3n://bucket/stuff/*.gz") > files = files.filter(not the bad file) > fileStr = files.map(f => f.getPath.toString).mkstring(",") > > sc.textFile(fileStr)... > > - Patrick > > > > > On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> YES, your hunches were correct. I’ve identified at least one file among >> the hundreds I’m processing that is indeed not a valid gzip file. >> >> Does anyone know of an easy way to exclude a specific file or files when >> calling sc.textFile() on a pattern? e.g. Something like: >> sc.textFile('s3n://bucket/stuff/*.gz, >> exclude:s3n://bucket/stuff/bad.gz') >> >> >> On Wed, May 21, 2014 at 11:50 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Thanks for the suggestions, people. I will try to hone in on which >>> specific gzipped files, if any, are actually corrupt. >>> >>> Michael, >>> >>> I’m using Hadoop 1.0.4, which I believe is the default version that gets >>> deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281 >>> <https://issues.apache.org/jira/browse/HADOOP-5281>, affects Hadoop >>> 0.18.0 and is fixed in 0.20.0 and is also related to gzip compression. I >>> know there is some funkiness in how Hadoop is versioned, so I’m not sure if >>> this issue is relevant to 1.0.4. >>> >>> Were you able to resolve your issue by changing your version of Hadoop? >>> How did you do that? >>> >>> Nick >>> >>> >>> On Wed, May 21, 2014 at 11:38 PM, Andrew Ash <and...@andrewash.com> >>> wrote: >>> >>>> One thing you can try is to pull each file out of S3 and decompress >>>> with "gzip -d" to see if it works. I'm guessing there's a corrupted .gz >>>> file somewhere in your path glob. >>>> >>>> Andrew >>>> >>>> >>>> On Wed, May 21, 2014 at 12:40 PM, Michael Cutler <mich...@tumra.com> >>>> wrote: >>>> >>>>> Hi Nick, >>>>> >>>>> Which version of Hadoop are you using with Spark? I spotted an issue >>>>> with the built-in GzipDecompressor while doing something similar with >>>>> Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files >>>>> blew up from Hadoop/Spark. >>>>> >>>>> The following JIRA ticket goes into more detail >>>>> https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all >>>>> Hadoop releases prior to 1.2.X >>>>> >>>>> MC >>>>> >>>>> >>>>> >>>>> >>>>> *Michael Cutler* >>>>> Founder, CTO >>>>> >>>>> >>>>> * Mobile: +44 789 990 7847 Email: mich...@tumra.com >>>>> <mich...@tumra.com> Web: tumra.com >>>>> <http://tumra.com/?utm_source=signature&utm_medium=email> * >>>>> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>* >>>>> *Registered in England & Wales, 07916412. VAT No. 130595328* >>>>> >>>>> >>>>> This email and any files transmitted with it are confidential and may >>>>> also be privileged. It is intended only for the person to whom it is >>>>> addressed. If you have received this email in error, please inform the >>>>> sender immediately. If you are not the intended recipient you must >>>>> not use, disclose, copy, print, distribute or rely on this email. >>>>> >>>>> >>>>> On 21 May 2014 14:26, Madhu <ma...@madhu.com> wrote: >>>>> >>>>>> Can you identify a specific file that fails? >>>>>> There might be a real bug here, but I have found gzip to be reliable. >>>>>> Every time I have run into a "bad header" error with gzip, I had a >>>>>> non-gzip >>>>>> file with the wrong extension for whatever reason. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ----- >>>>>> Madhu >>>>>> https://www.linkedin.com/in/msiddalingaiah >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>> >>>>> >>>> >>> >> >