[ https://issues.apache.org/jira/browse/SPARK-18211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust resolved SPARK-18211. -------------------------------------- Resolution: Not A Problem As of Spark 2.0 we do our own splitting/bin-packing of files into tasks, rather than relying on Hadoop InputFormats. The configuration you are probably looking for is: {{spark.sql.files.maxPartitionBytes}} which you can set on the SQLContext. > Spark SQL ignores split.size > ---------------------------- > > Key: SPARK-18211 > URL: https://issues.apache.org/jira/browse/SPARK-18211 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: lostinoverflow > > I expect that RDD and DataFrame will have the same number of partitions > (worked in 1.6) but it looks like Spark SQL ignores Hadoop configuration. > {code} > import org.apache.spark.sql.SparkSession > object App { > def main(args: Array[String]) { > val spark = SparkSession > .builder() > .master("local[*]") > .appName("split size") > .getOrCreate() > spark.sparkContext.hadoopConfiguration.setInt("mapred.min.split.size", > args(0).toInt) > spark.sparkContext.hadoopConfiguration.setInt("mapred.max.split.size", > args(0).toInt) > println(spark.sparkContext.textFile(args(1)).partitions.size) > println(spark.read.textFile(args(1)).rdd.partitions.size) > spark.stop() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org