YES, your hunches were correct. I’ve identified at least one file among the
hundreds I’m processing that is indeed not a valid gzip file.

Does anyone know of an easy way to exclude a specific file or files when
calling sc.textFile() on a pattern? e.g. Something like:
sc.textFile('s3n://bucket/stuff/*.gz,
exclude:s3n://bucket/stuff/bad.gz')
​


On Wed, May 21, 2014 at 11:50 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Thanks for the suggestions, people. I will try to hone in on which
> specific gzipped files, if any, are actually corrupt.
>
> Michael,
>
> I’m using Hadoop 1.0.4, which I believe is the default version that gets
> deployed by spark-ec2. The JIRA issue I linked to earlier, HADOOP-5281
> <https://issues.apache.org/jira/browse/HADOOP-5281>, affects Hadoop
> 0.18.0 and is fixed in 0.20.0 and is also related to gzip compression. I
> know there is some funkiness in how Hadoop is versioned, so I’m not sure if
> this issue is relevant to 1.0.4.
>
> Were you able to resolve your issue by changing your version of Hadoop?
> How did you do that?
>
> Nick
> ​
>
>
> On Wed, May 21, 2014 at 11:38 PM, Andrew Ash <and...@andrewash.com> wrote:
>
>> One thing you can try is to pull each file out of S3 and decompress with
>> "gzip -d" to see if it works.  I'm guessing there's a corrupted .gz file
>> somewhere in your path glob.
>>
>> Andrew
>>
>>
>> On Wed, May 21, 2014 at 12:40 PM, Michael Cutler <mich...@tumra.com>
>> wrote:
>>
>>> Hi Nick,
>>>
>>> Which version of Hadoop are you using with Spark?  I spotted an issue
>>> with the built-in GzipDecompressor while doing something similar with
>>> Hadoop 1.0.4, all my Gzip files were valid and tested yet certain files
>>> blew up from Hadoop/Spark.
>>>
>>> The following JIRA ticket goes into more detail
>>> https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all
>>> Hadoop releases prior to 1.2.X
>>>
>>> MC
>>>
>>>
>>>
>>>
>>>  *Michael Cutler*
>>> Founder, CTO
>>>
>>>
>>> * Mobile: +44 789 990 7847 Email:   mich...@tumra.com
>>> <mich...@tumra.com> Web:     tumra.com
>>> <http://tumra.com/?utm_source=signature&utm_medium=email> *
>>> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
>>> *Registered in England & Wales, 07916412. VAT No. 130595328*
>>>
>>>
>>> This email and any files transmitted with it are confidential and may
>>> also be privileged. It is intended only for the person to whom it is
>>> addressed. If you have received this email in error, please inform the
>>> sender immediately. If you are not the intended recipient you must not
>>> use, disclose, copy, print, distribute or rely on this email.
>>>
>>>
>>> On 21 May 2014 14:26, Madhu <ma...@madhu.com> wrote:
>>>
>>>> Can you identify a specific file that fails?
>>>> There might be a real bug here, but I have found gzip to be reliable.
>>>> Every time I have run into a "bad header" error with gzip, I had a
>>>> non-gzip
>>>> file with the wrong extension for whatever reason.
>>>>
>>>>
>>>>
>>>>
>>>> -----
>>>> Madhu
>>>> https://www.linkedin.com/in/msiddalingaiah
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6169.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>
>

Reply via email to