Definitely the biggest difference is the maxDepth of the trees. With values smaller or equal to 5 the time goes into milliseconds. The amount of trees affects the performance but not that much. I tried to profile the app and I see decent time spent in serialization. I'm wondering if Spark isn't somehow caching model on workers during classification ?
useNodeIdCache is ON but docs aren't clear if Spark is using it only on training. Also, I must say we didn't have this problem in the old mllib API so it might be something in the new ml that I'm missing. I will dig deeper into the problem after holidays. 2015-12-25 16:26 GMT+01:00 Chris Fregly <ch...@fregly.com>: > so it looks like you're increasing num trees by 5x and you're seeing an 8x > increase in runtime, correct? > > did you analyze the Spark cluster resources to monitor the memory usage, > spillage, disk I/O, etc? > > you may need more Workers. > > On Tue, Dec 22, 2015 at 8:57 AM, Alexander Ratnikov > <ratnikov.alexan...@gmail.com> wrote: >> >> Hi All, >> >> It would be good to get some tips on tuning Apache Spark for Random >> Forest classification. >> Currently, we have a model that looks like: >> >> featureSubsetStrategy all >> impurity gini >> maxBins 32 >> maxDepth 11 >> numberOfClasses 2 >> numberOfTrees 100 >> >> We are running Spark 1.5.1 as a standalone cluster. >> >> 1 Master and 2 Worker nodes. >> The amount of RAM is 32GB on each node with 4 Cores. >> The classification takes 440ms. >> >> When we increase the number of trees to 500, it takes 8 sec already. >> We tried to reduce the depth but then error rate is higher. We have >> around 246 attributes. >> >> Probably we are doing something wrong. Any ideas how we could improve >> the performance ? >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Tips-for-Spark-s-Random-Forest-slow-performance-tp25766.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > > > > -- > > Chris Fregly > Principal Data Solutions Engineer > IBM Spark Technology Center, San Francisco, CA > http://spark.tc | http://advancedspark.com --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org