[jira] [Commented] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes

2015-10-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965007#comment-14965007
 ] 

Steve Loughran commented on SPARK-11177:


0-byte files are a troublespot in object stores, as they are often used/abused 
in hadoop fs clients to mimic directories.

One thing to consider is actually skipping 0-byte files on the basis they have 
no relevant data whatsoever

> sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero 
> bytes
> ---
>
> Key: SPARK-11177
> URL: https://issues.apache.org/jira/browse/SPARK-11177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Input/Output
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> From a user report:
> {quote}
> When I upload a series of text files to an S3 directory and one of the files 
> is empty (0 bytes). The sc.wholeTextFiles method stack traces.
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245)
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
> {quote}
> It looks like this has been a longstanding issue:
> * 
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html
> * 
> https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark
> * 
> https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes

2015-10-19 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962913#comment-14962913
 ] 

Josh Rosen commented on SPARK-11177:


It looks like this is caused by MAPREDUCE-4470, which is not patched in Apache 
Hadoop 1.x releases. If Spark users cannot upgrade to Hadoop 2.x and absolutely 
need a fix for this, then one somewhat hacky solution is to use a modified copy 
of CombineFileInputFormat which lives in the Spark source tree and includes the 
three-line fix for MAPREDUCE-4470. While this works (I have tests!), it's not 
an approach which is suitable for inclusion in a Spark release: it's going to 
be borderline impossible to maintain source- and binary-compatibility with all 
of our supported Hadoop versions while using this approach.



> sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero 
> bytes
> ---
>
> Key: SPARK-11177
> URL: https://issues.apache.org/jira/browse/SPARK-11177
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> From a user report:
> {quote}
> When I upload a series of text files to an S3 directory and one of the files 
> is empty (0 bytes). The sc.wholeTextFiles method stack traces.
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245)
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
> {quote}
> It looks like this has been a longstanding issue:
> * 
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html
> * 
> https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark
> * 
> https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes

2015-10-18 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962821#comment-14962821
 ] 

Josh Rosen commented on SPARK-11177:


Reproduced it. This only occurs on Hadoop 1.x.

> sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero 
> bytes
> ---
>
> Key: SPARK-11177
> URL: https://issues.apache.org/jira/browse/SPARK-11177
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Josh Rosen
>
> From a user report:
> {quote}
> When I upload a series of text files to an S3 directory and one of the files 
> is empty (0 bytes). The sc.wholeTextFiles method stack traces.
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245)
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
> {quote}
> It looks like this has been a longstanding issue:
> * 
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html
> * 
> https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark
> * 
> https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org