Hi team,

We just moved over to deploying our ML prototype to AWS. I currently have
everything configured on a single machine r5d.xlarge. Our training is
taking about an hour when running most things with default settings. Can
someone advise what I can do to leverage the cores differently to split up
the jobs/tasks and speed up the processing. If I migrate to the
r5d.4xlarge, is it reasonable to think that the processing will speed to be
4x faster because I am moving from 4 cores to 16 cores? Are there
parameters I need to set or will spark make the best use of the cores and
memory automatically. I am using a simple randomforest model in the lead
scoring template.

Perhaps I need to adjust the spark-config or the spark-submit parameters.
Can someone help me understand how driver mem, executor mem and num cores
play together and how I should think about them and other params to
optimize the training process given that I am still running on a single
machine and not a cluster?

I am trying to understand the optimal setup for training based on the
r5d.xlarge
pio train -- --driver-memory 32G --executor-memory 32G --num-cores 4

I would like to move to a r5d.4xlarge to get the training to 15 minutes or
faster once I can get a better handle on tuning spark. Thank you for the
help.

Best,

Shane






*Shane Johnson | LIFT IQ*
*Founder | CEO*

*www.liftiq.com <http://www.liftiq.com/>* or *[email protected]
<[email protected]>*
mobile: (801) 360-3350
LinkedIn <https://www.linkedin.com/in/shanewjohnson/>  |  Twitter
<https://twitter.com/SWaldenJ> |  Facebook
<https://www.facebook.com/shane.johnson.71653>

Reply via email to