[jira] [Commented] (SPARK-25753) binaryFiles broken for small files
[ https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944817#comment-16944817 ] Dongjoon Hyun commented on SPARK-25753: --- This is backported to branch-2.4 via https://github.com/apache/spark/pull/26026 . > binaryFiles broken for small files > -- > > Key: SPARK-25753 > URL: https://issues.apache.org/jira/browse/SPARK-25753 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.4, 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > _{{StreamFileInputFormat}}_ and > {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}} > have the same problem: for small sized files, the computed maxSplitSize by > `_{{StreamFileInputFormat}}_ ` is way smaller than the default or commonly > used split size of 64/128M and spark throws an exception while trying to read > them. > {{Exception info:}} > _{{Minimum split size pernode 5123456 cannot be larger than maximum split > size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot > be larger than maximum split size 4194304 at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: > 201) at > org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at > scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at > org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25753) binaryFiles broken for small files
[ https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652801#comment-16652801 ] Apache Spark commented on SPARK-25753: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/22725 > binaryFiles broken for small files > -- > > Key: SPARK-25753 > URL: https://issues.apache.org/jira/browse/SPARK-25753 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: liuxian >Priority: Minor > > _{{StreamFileInputFormat}}_ and > {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}} > have the same problem: for small sized files, the computed maxSplitSize by > `_{{StreamFileInputFormat}}_ ` is way smaller than the default or commonly > used split size of 64/128M and spark throws an exception while trying to read > them. > {{Exception info:}} > _{{Minimum split size pernode 5123456 cannot be larger than maximum split > size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot > be larger than maximum split size 4194304 at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: > 201) at > org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at > scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at > org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org