Hi team,

I think I have been digging into this over the weekend and believe I am on
the right path of speeding up our training through the following settings
on a single machine. (16 cores, 64g RAM). Now we are moving to a cluster
and are having other issues. I will post a separate thread on our cluster
issues. I think my initial question on this thread has been solved by
reviewing this article.
https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-memory

I am now using a m5-4xlarge with the following spark-submit parameters

pio train -- --executor-cores 5 --executor-memory 19g --num-executors 3
--driver-memory 48g


Thanks

*Shane Johnson | LIFT IQ*
*Founder | CEO*

*www.liftiq.com <http://www.liftiq.com/>* or *[email protected]
<[email protected]>*
mobile: (801) 360-3350
LinkedIn <https://www.linkedin.com/in/shanewjohnson/>  |  Twitter
<https://twitter.com/SWaldenJ> |  Facebook
<https://www.facebook.com/shane.johnson.71653>



On Fri, Jul 27, 2018 at 5:11 PM, Shane Johnson <[email protected]> wrote:

> Hi team,
>
> We just moved over to deploying our ML prototype to AWS. I currently have
> everything configured on a single machine r5d.xlarge. Our training is
> taking about an hour when running most things with default settings. Can
> someone advise what I can do to leverage the cores differently to split up
> the jobs/tasks and speed up the processing. If I migrate to the
> r5d.4xlarge, is it reasonable to think that the processing will speed to be
> 4x faster because I am moving from 4 cores to 16 cores? Are there
> parameters I need to set or will spark make the best use of the cores and
> memory automatically. I am using a simple randomforest model in the lead
> scoring template.
>
> Perhaps I need to adjust the spark-config or the spark-submit parameters.
> Can someone help me understand how driver mem, executor mem and num cores
> play together and how I should think about them and other params to
> optimize the training process given that I am still running on a single
> machine and not a cluster?
>
> I am trying to understand the optimal setup for training based on the
> r5d.xlarge
> pio train -- --driver-memory 32G --executor-memory 32G --num-cores 4
>
> I would like to move to a r5d.4xlarge to get the training to 15 minutes or
> faster once I can get a better handle on tuning spark. Thank you for the
> help.
>
> Best,
>
> Shane
>
>
>
>
>
>
> *Shane Johnson | LIFT IQ*
> *Founder | CEO*
>
> *www.liftiq.com <http://www.liftiq.com/>* or *[email protected]
> <[email protected]>*
> mobile: (801) 360-3350
> LinkedIn <https://www.linkedin.com/in/shanewjohnson/>  |  Twitter
> <https://twitter.com/SWaldenJ> |  Facebook
> <https://www.facebook.com/shane.johnson.71653>
>
>
>

Reply via email to