SparkML RandomForest

2016-08-10 Thread Pengcheng
Hi There, I was comparing Randomforest in sparkml(org.apache.spark.ml.classification) and spark mllib(org.apache.spark.mllib.tree) using the same datasets and same parameter settings, spark mllib always gives me better results on test data sets. I was wondering 1. Did anyone notice similar

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
Can you try reducing maxBins? That reduces communication (at the cost of coarser discretization of continuous features). On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley wrote: > In my experience, 20K is a lot but often doable; 2K is easy; 200 is > small. Communication

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
In my experience, 20K is a lot but often doable; 2K is easy; 200 is small. Communication scales linearly in the number of features. On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov wrote: > Joseph, > > Correction, there 20k features. Is it still a lot? > What number

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-31 Thread Eugene Morozov
Joseph, Correction, there 20k features. Is it still a lot? What number of features can be considered as normal? -- Be well! Jean Morozov On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley wrote: > First thought: 70K features is *a lot* for the MLlib implementation (and >

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-30 Thread Eugene Morozov
One more thing. With increased stack size it completed twice more already, but now I see in the log. [dispatcher-event-loop-1] WARN o.a.spark.scheduler.TaskSetManager - Stage 24860 contains a task of very large size (157 KB). The maximum recommended task size is 100 KB. Size of the task

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Joseph, I'm using 1.6.0. -- Be well! Jean Morozov On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley wrote: > First thought: 70K features is *a lot* for the MLlib implementation (and > any PLANET-like implementation) > > Using fewer partitions is a good idea. > > Which

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Joseph Bradley
First thought: 70K features is *a lot* for the MLlib implementation (and any PLANET-like implementation) Using fewer partitions is a good idea. Which Spark version was this on? On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov wrote: > The questions I have in mind: >

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
The questions I have in mind: Is it smth that the one might expect? From the stack trace itself it's not clear where does it come from. Is it an already known bug? Although I haven't found anything like that. Is it possible to configure something to workaround / avoid this? I'm not sure it's the

SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Hi, I have a web service that provides rest api to train random forest algo. I train random forest on a 5 nodes spark cluster with enough memory - everything is cached (~22 GB). On a small datasets up to 100k samples everything is fine, but with the biggest one (400k samples and ~70k features)

SparkML. RandomForest scalability question.

2016-03-08 Thread Eugene Morozov
Hi, I have 4 nodes cluster: one master (also has hdfs namenode) and 3 workers (also have 3 colocated hdfs datanodes). Each worker has only 2 cores and spark.executor.memory is 2.3g. Input file is two hdfs blocks, one block configured = 64MB. I train random forest regression with numTrees=50 and

Re: SparkML. RandomForest predict performance for small dataset.

2015-12-11 Thread Yanbo Liang
I think you are finding the ability of prediction on single instance. It's a feature on the development, please refer SPARK-10413. 2015-12-10 4:37 GMT+08:00 Eugene Morozov : > Hello, > > I'm using RandomForest pipeline (ml package). Everything is working fine >

SparkML. RandomForest predict performance for small dataset.

2015-12-09 Thread Eugene Morozov
Hello, I'm using RandomForest pipeline (ml package). Everything is working fine (learning models, prediction, etc), but I'd like to tune it for the case, when I predict with small dataset. My issue is that when I apply (PipelineModel)model.transform(dataset) The model consists of the following