Re: Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Thanks Ayan! Finally it worked!! Thanks a lot everyone for the inputs! Once I prefixed the params with "spark.hadoop", I see the no.of tasks getting reduced. I'm setting the following params: --conf spark.hadoop.dfs.block.size --conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
Maybe you need to set the parameters for the mapreduce api and not the mapred api. I do not have in mind now how they differ but the Hadoop web page should tell you ;-) > On 10. Oct 2017, at 17:53, Kanagha Kumar wrote: > > Thanks for the inputs!! > > I passed in

Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
Have you seen this: https://stackoverflow.com/questions/42796561/set-hadoop-configuration-values-on-spark-submit-command-line ? Please try and let us know. On Wed, Oct 11, 2017 at 2:53 AM, Kanagha Kumar wrote: > Thanks for the inputs!! > > I passed in

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Thanks for the inputs!! I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to the size I wanted to read. It didn't take any effect. I also tried passing in spark.dfs.block.size, with all the params set to the same value.

Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format) On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
Write your own input format/datasource or split the file yourself beforehand (not recommended). > On 10. Oct 2017, at 09:14, Kanagha Kumar wrote: > > Hi, > > I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", > minPartitions). > > How can I

Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Hi, I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions). How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from