[jira] [Commented] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles

2015-10-19 Thread Mojmir Vinkler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963110#comment-14963110
 ] 

Mojmir Vinkler commented on SPARK-5250:
---

Yes, it's caused by reading a corrupt file (we only experienced this for 
compressed (gzipped) files). I think the file got corrupted when it was saved 
to S3, but we used boto for that, not Spark. What's weird is that I'm able to 
read the file with pandas without any problems.

> EOFException in when reading gzipped files from S3 with wholeTextFiles
> --
>
> Key: SPARK-5250
> URL: https://issues.apache.org/jira/browse/SPARK-5250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Mojmir Vinkler
>Priority: Critical
>
> I get an `EOFException` error when reading *some* gzipped files using 
> `sc.wholeTextFiles`. It happens to just a few files, I thought that the file 
> is corrupted, but I was able to read it without problems using `sc.textFile` 
> (and pandas). 
> Traceback for command 
> `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()`
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()
> /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self)
> 674 """
> 675 with SCCallSiteSync(self.context) as css:
> --> 676 bytesInJava = self._jrdd.collect().iterator()
> 677 return list(self._collect_iterator_through_file(bytesInJava))
> 678 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o1576.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: 
> Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
>   at java.io.InputStream.read(InputStream.java:101)
>   at com.google.common.io.ByteStreams.copy(ByteStreams.java:207)
>   at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252)
>   at 
> org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>  

[jira] [Created] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles

2015-01-14 Thread Mojmir Vinkler (JIRA)
Mojmir Vinkler created SPARK-5250:
-

 Summary: EOFException in when reading gzipped files from S3 with 
wholeTextFiles
 Key: SPARK-5250
 URL: https://issues.apache.org/jira/browse/SPARK-5250
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Mojmir Vinkler


I get an `EOFException` error when reading *some* gzipped files using 
`sc.wholeTextFiles`. It happens to just a few files, I thought that the file is 
corrupted, but I was able to read it without problems using `sc.textFile` (and 
pandas). 

Traceback for command 
`sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()`

{code}
---
Py4JJavaError Traceback (most recent call last)
ipython-input-104-943aab11de03 in module()
 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()

/home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self)
674 
675 with SCCallSiteSync(self.context) as css:
-- 676 bytesInJava = self._jrdd.collect().iterator()
677 return list(self._collect_iterator_through_file(bytesInJava))
678 

/home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
-- 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py 
in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
-- 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling o1576.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 41.0 
(TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: Unexpected end 
of input stream
at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137)
at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
at java.io.InputStream.read(InputStream.java:101)
at com.google.common.io.ByteStreams.copy(ByteStreams.java:207)
at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252)
at 
org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at 
org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at 
org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at 
org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 

[jira] [Commented] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles

2015-01-14 Thread Mojmir Vinkler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277051#comment-14277051
 ] 

Mojmir Vinkler commented on SPARK-5250:
---

Just tested with Scala and got the same error.

 EOFException in when reading gzipped files from S3 with wholeTextFiles
 --

 Key: SPARK-5250
 URL: https://issues.apache.org/jira/browse/SPARK-5250
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Mojmir Vinkler

 I get an `EOFException` error when reading *some* gzipped files using 
 `sc.wholeTextFiles`. It happens to just a few files, I thought that the file 
 is corrupted, but I was able to read it without problems using `sc.textFile` 
 (and pandas). 
 Traceback for command 
 `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()`
 {code}
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-104-943aab11de03 in module()
  1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()
 /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self)
 674 
 675 with SCCallSiteSync(self.context) as css:
 -- 676 bytesInJava = self._jrdd.collect().iterator()
 677 return list(self._collect_iterator_through_file(bytesInJava))
 678 
 /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o1576.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: 
 Unexpected end of input stream
   at 
 org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137)
   at 
 org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
   at java.io.InputStream.read(InputStream.java:101)
   at com.google.common.io.ByteStreams.copy(ByteStreams.java:207)
   at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252)
   at 
 org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73)
   at 
 org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
   at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at 
 org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at 
 org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at 
 org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at 
 org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at