Is your application using Spark SQL / DataFrame API ? Is so, please try setting
spark.sql.files.maxPartitionBytes to a larger value which is 128MB by default. Thanks, Manu Zhang On Feb 25, 2019, 2:58 AM +0800, Akshay Mendole <akshaymend...@gmail.com>, wrote: > Hi, > We have dfs.blocksize configured to be 512MB and we have some large files > in hdfs that we want to process with spark application. We want to split the > files get more splits to optimise for memory but the above mentioned > parameters are not working > The max and min size params as below are configured to be 50MB still a file > which is as big as 500MB is read as one split while it is expected to split > into at least 10 input splits > SparkConf conf = new SparkConf().setAppName(jobName); > > SparkContext sparkContext = new SparkContext(conf); > sparkContext.hadoopConfiguration().set("mapreduce.input.fileinputformat.split.maxsize", > "50000000"); > sparkContext.hadoopConfiguration().set("mapreduce.input.fileinputformat.split.minsize", > "50000000"); > JavaSparkContext sc = new JavaSparkContext(sparkContext); > sc.hadoopConfiguration().set("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec"); > > Could you please suggest what could be wrong with my configuration? > > Thanks, > Akshay >