Re: Tips for Spark's Random Forest slow performance

2016-01-28 Thread Alexander Ratnikov
Coming back to this I believe I found some reasons. Basically, the main logic sits inside ProbabilisticClassificationModel. It has a transform method which takes a DataFrame (the vector to classify) and appends to it some UDFs which actually do the prediction. The thing is that this DataFrame

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Chris Fregly
so it looks like you're increasing num trees by 5x and you're seeing an 8x increase in runtime, correct? did you analyze the Spark cluster resources to monitor the memory usage, spillage, disk I/O, etc? you may need more Workers. On Tue, Dec 22, 2015 at 8:57 AM, Alexander Ratnikov <

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Alexander Ratnikov
Definitely the biggest difference is the maxDepth of the trees. With values smaller or equal to 5 the time goes into milliseconds. The amount of trees affects the performance but not that much. I tried to profile the app and I see decent time spent in serialization. I'm wondering if Spark isn't

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Chris Fregly
ah, so with that much serialization happening, you might actually need *less* workers! :) in the next couple releases of Spark ML should, we should see better scoring/predicting functionality using a single node for exactly this reason. to get there, we need model.save/load support (PMML?),

Tips for Spark's Random Forest slow performance

2015-12-22 Thread Alexander Ratnikov
Hi All, It would be good to get some tips on tuning Apache Spark for Random Forest classification. Currently, we have a model that looks like: featureSubsetStrategy all impurity gini maxBins 32 maxDepth 11 numberOfClasses 2 numberOfTrees 100 We are running Spark 1.5.1 as a standalone cluster.