Re: Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Thanks Ayan!

Finally it worked!! Thanks a lot everyone for the inputs!

Once I prefixed the params with "spark.hadoop", I see the no.of tasks
getting reduced.

I'm setting the following params:

--conf spark.hadoop.dfs.block.size

--conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize

--conf spark.hadoop.mapreduce.input.fileinputformat.split.maxsize

On Tue, Oct 10, 2017 at 1:16 PM, Jörn Franke  wrote:

> Maybe you need to set the parameters for the mapreduce api and not the
> mapred api. I do not have in mind now how they differ but the Hadoop web
> page should tell you ;-)
>
> On 10. Oct 2017, at 17:53, Kanagha Kumar  wrote:
>
> Thanks for the inputs!!
>
> I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to
> the size I wanted to read. It didn't take any effect.
> I also tried passing in spark.dfs.block.size, with all the params set to
> the same value.
>
> JavaSparkContext.fromSparkContext(spark.sparkContext()).textFile(hdfsPath,
> 13);
>
> Is there any other param that needs to be set as well?
>
> Thanks
>
> On Tue, Oct 10, 2017 at 4:32 AM, ayan guha  wrote:
>
>> I have not tested this, but you should be able to pass on any map-reduce
>> like conf to underlying hadoop config.essentially you should be able to
>> control behaviour of split as you can do in a map-reduce program (as Spark
>> uses the same input format)
>>
>> On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke 
>> wrote:
>>
>>> Write your own input format/datasource or split the file yourself
>>> beforehand (not recommended).
>>>
>>> > On 10. Oct 2017, at 09:14, Kanagha Kumar 
>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I'm trying to read a 60GB HDFS file using spark
>>> textFile("hdfs_file_path", minPartitions).
>>> >
>>> > How can I control the no.of tasks by increasing the split size? With
>>> default split size of 250 MB, several tasks are created. But I would like
>>> to have a specific no.of tasks created while reading from HDFS itself
>>> instead of using repartition() etc.,
>>> >
>>> > Any suggestions are helpful!
>>> >
>>> > Thanks
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
>
>
> 
>
>


-- 





Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
Maybe you need to set the parameters for the mapreduce api and not the mapred 
api. I do not have in mind now how they differ but the Hadoop web page should 
tell you ;-)

> On 10. Oct 2017, at 17:53, Kanagha Kumar  wrote:
> 
> Thanks for the inputs!!
> 
> I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to the 
> size I wanted to read. It didn't take any effect.
> I also tried passing in spark.dfs.block.size, with all the params set to the 
> same value.
> 
> JavaSparkContext.fromSparkContext(spark.sparkContext()).textFile(hdfsPath, 
> 13);
> 
> Is there any other param that needs to be set as well?
> 
> Thanks
> 
>> On Tue, Oct 10, 2017 at 4:32 AM, ayan guha  wrote:
>> I have not tested this, but you should be able to pass on any map-reduce 
>> like conf to underlying hadoop config.essentially you should be able to 
>> control behaviour of split as you can do in a map-reduce program (as Spark 
>> uses the same input format)
>> 
>>> On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke  wrote:
>>> Write your own input format/datasource or split the file yourself 
>>> beforehand (not recommended).
>>> 
>>> > On 10. Oct 2017, at 09:14, Kanagha Kumar  wrote:
>>> >
>>> > Hi,
>>> >
>>> > I'm trying to read a 60GB HDFS file using spark 
>>> > textFile("hdfs_file_path", minPartitions).
>>> >
>>> > How can I control the no.of tasks by increasing the split size? With 
>>> > default split size of 250 MB, several tasks are created. But I would like 
>>> > to have a specific no.of tasks created while reading from HDFS itself 
>>> > instead of using repartition() etc.,
>>> >
>>> > Any suggestions are helpful!
>>> >
>>> > Thanks
>>> >
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 
>> 
>> 
>> 
>> -- 
>> Best Regards,
>> Ayan Guha
> 
> 
> 
> -- 
> 
> 


Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
Have you seen this:
https://stackoverflow.com/questions/42796561/set-hadoop-configuration-values-on-spark-submit-command-line
? Please try and let us know.

On Wed, Oct 11, 2017 at 2:53 AM, Kanagha Kumar 
wrote:

> Thanks for the inputs!!
>
> I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to
> the size I wanted to read. It didn't take any effect.
> I also tried passing in spark.dfs.block.size, with all the params set to
> the same value.
>
> JavaSparkContext.fromSparkContext(spark.sparkContext()).textFile(hdfsPath,
> 13);
>
> Is there any other param that needs to be set as well?
>
> Thanks
>
> On Tue, Oct 10, 2017 at 4:32 AM, ayan guha  wrote:
>
>> I have not tested this, but you should be able to pass on any map-reduce
>> like conf to underlying hadoop config.essentially you should be able to
>> control behaviour of split as you can do in a map-reduce program (as Spark
>> uses the same input format)
>>
>> On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke 
>> wrote:
>>
>>> Write your own input format/datasource or split the file yourself
>>> beforehand (not recommended).
>>>
>>> > On 10. Oct 2017, at 09:14, Kanagha Kumar 
>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I'm trying to read a 60GB HDFS file using spark
>>> textFile("hdfs_file_path", minPartitions).
>>> >
>>> > How can I control the no.of tasks by increasing the split size? With
>>> default split size of 250 MB, several tasks are created. But I would like
>>> to have a specific no.of tasks created while reading from HDFS itself
>>> instead of using repartition() etc.,
>>> >
>>> > Any suggestions are helpful!
>>> >
>>> > Thanks
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
>
>
> 
>



-- 
Best Regards,
Ayan Guha


Re: Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Thanks for the inputs!!

I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to the
size I wanted to read. It didn't take any effect.
I also tried passing in spark.dfs.block.size, with all the params set to
the same value.

JavaSparkContext.fromSparkContext(spark.sparkContext()).textFile(hdfsPath,
13);

Is there any other param that needs to be set as well?

Thanks

On Tue, Oct 10, 2017 at 4:32 AM, ayan guha  wrote:

> I have not tested this, but you should be able to pass on any map-reduce
> like conf to underlying hadoop config.essentially you should be able to
> control behaviour of split as you can do in a map-reduce program (as Spark
> uses the same input format)
>
> On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke 
> wrote:
>
>> Write your own input format/datasource or split the file yourself
>> beforehand (not recommended).
>>
>> > On 10. Oct 2017, at 09:14, Kanagha Kumar 
>> wrote:
>> >
>> > Hi,
>> >
>> > I'm trying to read a 60GB HDFS file using spark
>> textFile("hdfs_file_path", minPartitions).
>> >
>> > How can I control the no.of tasks by increasing the split size? With
>> default split size of 250 MB, several tasks are created. But I would like
>> to have a specific no.of tasks created while reading from HDFS itself
>> instead of using repartition() etc.,
>> >
>> > Any suggestions are helpful!
>> >
>> > Thanks
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 





Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
I have not tested this, but you should be able to pass on any map-reduce
like conf to underlying hadoop config.essentially you should be able to
control behaviour of split as you can do in a map-reduce program (as Spark
uses the same input format)

On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke  wrote:

> Write your own input format/datasource or split the file yourself
> beforehand (not recommended).
>
> > On 10. Oct 2017, at 09:14, Kanagha Kumar  wrote:
> >
> > Hi,
> >
> > I'm trying to read a 60GB HDFS file using spark
> textFile("hdfs_file_path", minPartitions).
> >
> > How can I control the no.of tasks by increasing the split size? With
> default split size of 250 MB, several tasks are created. But I would like
> to have a specific no.of tasks created while reading from HDFS itself
> instead of using repartition() etc.,
> >
> > Any suggestions are helpful!
> >
> > Thanks
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
Write your own input format/datasource or split the file yourself beforehand 
(not recommended).

> On 10. Oct 2017, at 09:14, Kanagha Kumar  wrote:
> 
> Hi,
> 
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", 
> minPartitions). 
> 
> How can I control the no.of tasks by increasing the split size? With default 
> split size of 250 MB, several tasks are created. But I would like to have a 
> specific no.of tasks created while reading from HDFS itself instead of using 
> repartition() etc.,
> 
> Any suggestions are helpful!
> 
> Thanks
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Hi,

I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path",
minPartitions).

How can I control the no.of tasks by increasing the split size? With
default split size of 250 MB, several tasks are created. But I would like
to have a specific no.of tasks created while reading from HDFS itself
instead of using repartition() etc.,

Any suggestions are helpful!

Thanks