Re: Tips for Spark's Random Forest slow performance

Alexander Ratnikov Fri, 25 Dec 2015 15:03:33 -0800

Definitely the biggest difference is the maxDepth of the trees. With
values smaller or equal to 5 the time goes into milliseconds.
The amount of trees affects the performance but not that much.
I tried to profile the app and I see decent time spent in serialization.
I'm wondering if Spark isn't somehow caching model on workers during
classification ?


useNodeIdCache is ON but docs aren't clear if Spark is using it only
on training.
Also, I must say we didn't have this problem in the old mllib API so
it might be something in the new ml that I'm missing.
I will dig deeper into the problem after holidays.

2015-12-25 16:26 GMT+01:00 Chris Fregly <ch...@fregly.com>:
> so it looks like you're increasing num trees by 5x and you're seeing an 8x
> increase in runtime, correct?
>
> did you analyze the Spark cluster resources to monitor the memory usage,
> spillage, disk I/O, etc?
>
> you may need more Workers.
>
> On Tue, Dec 22, 2015 at 8:57 AM, Alexander Ratnikov
> <ratnikov.alexan...@gmail.com> wrote:
>>
>> Hi All,
>>
>> It would be good to get some tips on tuning Apache Spark for Random
>> Forest classification.
>> Currently, we have a model that looks like:
>>
>> featureSubsetStrategy all
>> impurity gini
>> maxBins 32
>> maxDepth 11
>> numberOfClasses 2
>> numberOfTrees 100
>>
>> We are running Spark 1.5.1 as a standalone cluster.
>>
>> 1 Master and 2 Worker nodes.
>> The amount of RAM is 32GB on each node with 4 Cores.
>> The classification takes 440ms.
>>
>> When we increase the number of trees to 500, it takes 8 sec already.
>> We tried to reduce the depth but then error rate is higher. We have
>> around 246 attributes.
>>
>> Probably we are doing something wrong. Any ideas how we could improve
>> the performance ?
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Tips-for-Spark-s-Random-Forest-slow-performance-tp25766.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
>
> Chris Fregly
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Tips for Spark's Random Forest slow performance

Reply via email to