Hi team, We just moved over to deploying our ML prototype to AWS. I currently have everything configured on a single machine r5d.xlarge. Our training is taking about an hour when running most things with default settings. Can someone advise what I can do to leverage the cores differently to split up the jobs/tasks and speed up the processing. If I migrate to the r5d.4xlarge, is it reasonable to think that the processing will speed to be 4x faster because I am moving from 4 cores to 16 cores? Are there parameters I need to set or will spark make the best use of the cores and memory automatically. I am using a simple randomforest model in the lead scoring template.
Perhaps I need to adjust the spark-config or the spark-submit parameters. Can someone help me understand how driver mem, executor mem and num cores play together and how I should think about them and other params to optimize the training process given that I am still running on a single machine and not a cluster? I am trying to understand the optimal setup for training based on the r5d.xlarge pio train -- --driver-memory 32G --executor-memory 32G --num-cores 4 I would like to move to a r5d.4xlarge to get the training to 15 minutes or faster once I can get a better handle on tuning spark. Thank you for the help. Best, Shane *Shane Johnson | LIFT IQ* *Founder | CEO* *www.liftiq.com <http://www.liftiq.com/>* or *[email protected] <[email protected]>* mobile: (801) 360-3350 LinkedIn <https://www.linkedin.com/in/shanewjohnson/> | Twitter <https://twitter.com/SWaldenJ> | Facebook <https://www.facebook.com/shane.johnson.71653>
