Re: Inferring Data driven Spark parameters

2018-07-04 Thread Mich Talebzadeh
Hi Aakash,

For clarification are you running this in Yarn client mode or standalone?

How much total yarn memory is available?

>From my experience for a bigger cluster I found the following incremental
settings useful (CDH 5.9, Yarn client) so you can scale yours

[1] - 576GB

--num-executors 24

--executor-memory 21G

--executor-cores 4
--conf spark.yarn.executor.memoryOverhead=3000



[2] - 672GB

--num-executors 28

--executor-memory 21G

--executor-cores 4
--conf spark.yarn.executor.memoryOverhead=3000



[3] - 786GB

--num-executors 32

--executor-memory 21G

--executor-cores 4
--conf spark.yarn.executor.memoryOverhead=3000



[4] - 864GB

--num-executors 32

--executor-memory 21G

--executor-cores 4
--conf spark.yarn.executor.memoryOverhead=3000



HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 3 Jul 2018 at 08:34, Aakash Basu  wrote:

> Hi,
>
> Cluster - 5 node (1 Driver and 4 workers)
> Driver Config: 16 cores, 32 GB RAM
> Worker Config: 8 cores, 16 GB RAM
>
> I'm using the below parameters from which I know the first chunk is
> cluster dependent and the second chunk is data/code dependent.
>
> --num-executors 4
> --executor-cores 5
> --executor-memory 10G
> --driver-cores 5
> --driver-memory 25G
>
>
> --conf spark.sql.shuffle.partitions=100
> --conf spark.driver.maxResultSize=2G
> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
> --conf spark.scheduler.listenerbus.eventqueue.capacity=2
>
> I've come upto these values depending on my R on the properties and the
> issues I faced and hence the handles.
>
> My ask here is -
>
> *1) How can I infer, using some formula or a code, to calculate the below
> chunk dependent on the data/code?*
> *2) What are the other usable properties/configurations which I can use to
> shorten my job runtime?*
>
> Thanks,
> Aakash.
>


Re: Inferring Data driven Spark parameters

2018-07-04 Thread Prem Sure
Can you share the API that your jobs use.. just core RDDs or SQL or
DStreams..etc?
refer  recommendations from
https://spark.apache.org/docs/2.3.0/configuration.html for detailed
configurations.
Thanks,
Prem

On Wed, Jul 4, 2018 at 12:34 PM, Aakash Basu 
wrote:

> I do not want to change executor/driver cores/memory on the fly in a
> single Spark job, all I want is to make them cluster specific. So, I want
> to have a formulae, with which, depending on the size of driver and
> executor details, I can find out the values for them before submitting
> those details in the spark-submit.
>
> I, more or less know how to achieve the above as I've previously done that.
>
> All I need to do is, I want to tweak the other spark confs depending on
> the data. Is that possible? I mean (just an example), if I have 100+
> features, I want to double my default spark.driver.maxResultSize to 2G, and
> similarly for other configs. Can that be achieved by any means for a
> optimal run on that kind of dataset? If yes, can I?
>
> On Tue, Jul 3, 2018 at 6:28 PM, Vadim Semenov  wrote:
>
>> You can't change the executor/driver cores/memory on the fly once
>> you've already started a Spark Context.
>> On Tue, Jul 3, 2018 at 4:30 AM Aakash Basu 
>> wrote:
>> >
>> > We aren't using Oozie or similar, moreover, the end to end job shall be
>> exactly the same, but the data will be extremely different (number of
>> continuous and categorical columns, vertical size, horizontal size, etc),
>> hence, if there would have been a calculation of the parameters to arrive
>> at a conclusion that we can simply get the data and derive the respective
>> configuration/parameters, it would be great.
>> >
>> > On Tue, Jul 3, 2018 at 1:09 PM, Jörn Franke 
>> wrote:
>> >>
>> >> Don’t do this in your job. Create for different types of jobs
>> different jobs and orchestrate them using oozie or similar.
>> >>
>> >> On 3. Jul 2018, at 09:34, Aakash Basu 
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> Cluster - 5 node (1 Driver and 4 workers)
>> >> Driver Config: 16 cores, 32 GB RAM
>> >> Worker Config: 8 cores, 16 GB RAM
>> >>
>> >> I'm using the below parameters from which I know the first chunk is
>> cluster dependent and the second chunk is data/code dependent.
>> >>
>> >> --num-executors 4
>> >> --executor-cores 5
>> >> --executor-memory 10G
>> >> --driver-cores 5
>> >> --driver-memory 25G
>> >>
>> >>
>> >> --conf spark.sql.shuffle.partitions=100
>> >> --conf spark.driver.maxResultSize=2G
>> >> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
>> >> --conf spark.scheduler.listenerbus.eventqueue.capacity=2
>> >>
>> >> I've come upto these values depending on my R on the properties and
>> the issues I faced and hence the handles.
>> >>
>> >> My ask here is -
>> >>
>> >> 1) How can I infer, using some formula or a code, to calculate the
>> below chunk dependent on the data/code?
>> >> 2) What are the other usable properties/configurations which I can use
>> to shorten my job runtime?
>> >>
>> >> Thanks,
>> >> Aakash.
>> >
>> >
>>
>>
>> --
>> Sent from my iPhone
>>
>
>


Re: Inferring Data driven Spark parameters

2018-07-04 Thread Aakash Basu
I do not want to change executor/driver cores/memory on the fly in a single
Spark job, all I want is to make them cluster specific. So, I want to have
a formulae, with which, depending on the size of driver and executor
details, I can find out the values for them before submitting those details
in the spark-submit.

I, more or less know how to achieve the above as I've previously done that.

All I need to do is, I want to tweak the other spark confs depending on the
data. Is that possible? I mean (just an example), if I have 100+ features,
I want to double my default spark.driver.maxResultSize to 2G, and similarly
for other configs. Can that be achieved by any means for a optimal run on
that kind of dataset? If yes, can I?

On Tue, Jul 3, 2018 at 6:28 PM, Vadim Semenov  wrote:

> You can't change the executor/driver cores/memory on the fly once
> you've already started a Spark Context.
> On Tue, Jul 3, 2018 at 4:30 AM Aakash Basu 
> wrote:
> >
> > We aren't using Oozie or similar, moreover, the end to end job shall be
> exactly the same, but the data will be extremely different (number of
> continuous and categorical columns, vertical size, horizontal size, etc),
> hence, if there would have been a calculation of the parameters to arrive
> at a conclusion that we can simply get the data and derive the respective
> configuration/parameters, it would be great.
> >
> > On Tue, Jul 3, 2018 at 1:09 PM, Jörn Franke 
> wrote:
> >>
> >> Don’t do this in your job. Create for different types of jobs different
> jobs and orchestrate them using oozie or similar.
> >>
> >> On 3. Jul 2018, at 09:34, Aakash Basu 
> wrote:
> >>
> >> Hi,
> >>
> >> Cluster - 5 node (1 Driver and 4 workers)
> >> Driver Config: 16 cores, 32 GB RAM
> >> Worker Config: 8 cores, 16 GB RAM
> >>
> >> I'm using the below parameters from which I know the first chunk is
> cluster dependent and the second chunk is data/code dependent.
> >>
> >> --num-executors 4
> >> --executor-cores 5
> >> --executor-memory 10G
> >> --driver-cores 5
> >> --driver-memory 25G
> >>
> >>
> >> --conf spark.sql.shuffle.partitions=100
> >> --conf spark.driver.maxResultSize=2G
> >> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
> >> --conf spark.scheduler.listenerbus.eventqueue.capacity=2
> >>
> >> I've come upto these values depending on my R on the properties and
> the issues I faced and hence the handles.
> >>
> >> My ask here is -
> >>
> >> 1) How can I infer, using some formula or a code, to calculate the
> below chunk dependent on the data/code?
> >> 2) What are the other usable properties/configurations which I can use
> to shorten my job runtime?
> >>
> >> Thanks,
> >> Aakash.
> >
> >
>
>
> --
> Sent from my iPhone
>


Re: Inferring Data driven Spark parameters

2018-07-03 Thread Vadim Semenov
You can't change the executor/driver cores/memory on the fly once
you've already started a Spark Context.
On Tue, Jul 3, 2018 at 4:30 AM Aakash Basu  wrote:
>
> We aren't using Oozie or similar, moreover, the end to end job shall be 
> exactly the same, but the data will be extremely different (number of 
> continuous and categorical columns, vertical size, horizontal size, etc), 
> hence, if there would have been a calculation of the parameters to arrive at 
> a conclusion that we can simply get the data and derive the respective 
> configuration/parameters, it would be great.
>
> On Tue, Jul 3, 2018 at 1:09 PM, Jörn Franke  wrote:
>>
>> Don’t do this in your job. Create for different types of jobs different jobs 
>> and orchestrate them using oozie or similar.
>>
>> On 3. Jul 2018, at 09:34, Aakash Basu  wrote:
>>
>> Hi,
>>
>> Cluster - 5 node (1 Driver and 4 workers)
>> Driver Config: 16 cores, 32 GB RAM
>> Worker Config: 8 cores, 16 GB RAM
>>
>> I'm using the below parameters from which I know the first chunk is cluster 
>> dependent and the second chunk is data/code dependent.
>>
>> --num-executors 4
>> --executor-cores 5
>> --executor-memory 10G
>> --driver-cores 5
>> --driver-memory 25G
>>
>>
>> --conf spark.sql.shuffle.partitions=100
>> --conf spark.driver.maxResultSize=2G
>> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
>> --conf spark.scheduler.listenerbus.eventqueue.capacity=2
>>
>> I've come upto these values depending on my R on the properties and the 
>> issues I faced and hence the handles.
>>
>> My ask here is -
>>
>> 1) How can I infer, using some formula or a code, to calculate the below 
>> chunk dependent on the data/code?
>> 2) What are the other usable properties/configurations which I can use to 
>> shorten my job runtime?
>>
>> Thanks,
>> Aakash.
>
>


-- 
Sent from my iPhone

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Inferring Data driven Spark parameters

2018-07-03 Thread Aakash Basu
We aren't using Oozie or similar, moreover, the end to end job shall be
exactly the same, but the data will be extremely different (number of
continuous and categorical columns, vertical size, horizontal size, etc),
hence, if there would have been a calculation of the parameters to arrive
at a conclusion that we can simply get the data and derive the respective
configuration/parameters, it would be great.

On Tue, Jul 3, 2018 at 1:09 PM, Jörn Franke  wrote:

> Don’t do this in your job. Create for different types of jobs different
> jobs and orchestrate them using oozie or similar.
>
> On 3. Jul 2018, at 09:34, Aakash Basu  wrote:
>
> Hi,
>
> Cluster - 5 node (1 Driver and 4 workers)
> Driver Config: 16 cores, 32 GB RAM
> Worker Config: 8 cores, 16 GB RAM
>
> I'm using the below parameters from which I know the first chunk is
> cluster dependent and the second chunk is data/code dependent.
>
> --num-executors 4
> --executor-cores 5
> --executor-memory 10G
> --driver-cores 5
> --driver-memory 25G
>
>
> --conf spark.sql.shuffle.partitions=100
> --conf spark.driver.maxResultSize=2G
> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
> --conf spark.scheduler.listenerbus.eventqueue.capacity=2
>
> I've come upto these values depending on my R on the properties and the
> issues I faced and hence the handles.
>
> My ask here is -
>
> *1) How can I infer, using some formula or a code, to calculate the below
> chunk dependent on the data/code?*
> *2) What are the other usable properties/configurations which I can use to
> shorten my job runtime?*
>
> Thanks,
> Aakash.
>
>


Re: Inferring Data driven Spark parameters

2018-07-03 Thread Jörn Franke
Don’t do this in your job. Create for different types of jobs different jobs 
and orchestrate them using oozie or similar.

> On 3. Jul 2018, at 09:34, Aakash Basu  wrote:
> 
> Hi,
> 
> Cluster - 5 node (1 Driver and 4 workers)
> Driver Config: 16 cores, 32 GB RAM
> Worker Config: 8 cores, 16 GB RAM
> 
> I'm using the below parameters from which I know the first chunk is cluster 
> dependent and the second chunk is data/code dependent.
> 
> --num-executors 4 
> --executor-cores 5
> --executor-memory 10G 
> --driver-cores 5 
> --driver-memory 25G 
> 
> 
> --conf spark.sql.shuffle.partitions=100 
> --conf spark.driver.maxResultSize=2G 
> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" 
> --conf spark.scheduler.listenerbus.eventqueue.capacity=2
> 
> I've come upto these values depending on my R on the properties and the 
> issues I faced and hence the handles.
> 
> My ask here is -
> 
> 1) How can I infer, using some formula or a code, to calculate the below 
> chunk dependent on the data/code?
> 2) What are the other usable properties/configurations which I can use to 
> shorten my job runtime?
> 
> Thanks,
> Aakash.


Inferring Data driven Spark parameters

2018-07-03 Thread Aakash Basu
Hi,

Cluster - 5 node (1 Driver and 4 workers)
Driver Config: 16 cores, 32 GB RAM
Worker Config: 8 cores, 16 GB RAM

I'm using the below parameters from which I know the first chunk is cluster
dependent and the second chunk is data/code dependent.

--num-executors 4
--executor-cores 5
--executor-memory 10G
--driver-cores 5
--driver-memory 25G


--conf spark.sql.shuffle.partitions=100
--conf spark.driver.maxResultSize=2G
--conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
--conf spark.scheduler.listenerbus.eventqueue.capacity=2

I've come upto these values depending on my R on the properties and the
issues I faced and hence the handles.

My ask here is -

*1) How can I infer, using some formula or a code, to calculate the below
chunk dependent on the data/code?*
*2) What are the other usable properties/configurations which I can use to
shorten my job runtime?*

Thanks,
Aakash.