Hi team, I think I have been digging into this over the weekend and believe I am on the right path of speeding up our training through the following settings on a single machine. (16 cores, 64g RAM). Now we are moving to a cluster and are having other issues. I will post a separate thread on our cluster issues. I think my initial question on this thread has been solved by reviewing this article. https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-memory
I am now using a m5-4xlarge with the following spark-submit parameters pio train -- --executor-cores 5 --executor-memory 19g --num-executors 3 --driver-memory 48g Thanks *Shane Johnson | LIFT IQ* *Founder | CEO* *www.liftiq.com <http://www.liftiq.com/>* or *[email protected] <[email protected]>* mobile: (801) 360-3350 LinkedIn <https://www.linkedin.com/in/shanewjohnson/> | Twitter <https://twitter.com/SWaldenJ> | Facebook <https://www.facebook.com/shane.johnson.71653> On Fri, Jul 27, 2018 at 5:11 PM, Shane Johnson <[email protected]> wrote: > Hi team, > > We just moved over to deploying our ML prototype to AWS. I currently have > everything configured on a single machine r5d.xlarge. Our training is > taking about an hour when running most things with default settings. Can > someone advise what I can do to leverage the cores differently to split up > the jobs/tasks and speed up the processing. If I migrate to the > r5d.4xlarge, is it reasonable to think that the processing will speed to be > 4x faster because I am moving from 4 cores to 16 cores? Are there > parameters I need to set or will spark make the best use of the cores and > memory automatically. I am using a simple randomforest model in the lead > scoring template. > > Perhaps I need to adjust the spark-config or the spark-submit parameters. > Can someone help me understand how driver mem, executor mem and num cores > play together and how I should think about them and other params to > optimize the training process given that I am still running on a single > machine and not a cluster? > > I am trying to understand the optimal setup for training based on the > r5d.xlarge > pio train -- --driver-memory 32G --executor-memory 32G --num-cores 4 > > I would like to move to a r5d.4xlarge to get the training to 15 minutes or > faster once I can get a better handle on tuning spark. Thank you for the > help. > > Best, > > Shane > > > > > > > *Shane Johnson | LIFT IQ* > *Founder | CEO* > > *www.liftiq.com <http://www.liftiq.com/>* or *[email protected] > <[email protected]>* > mobile: (801) 360-3350 > LinkedIn <https://www.linkedin.com/in/shanewjohnson/> | Twitter > <https://twitter.com/SWaldenJ> | Facebook > <https://www.facebook.com/shane.johnson.71653> > > >
