Hi There,
I was comparing Randomforest in sparkml(org.apache.spark.ml.classification)
and spark mllib(org.apache.spark.mllib.tree) using the same datasets and
same parameter settings, spark mllib always gives me better results on test
data sets.
I was wondering
1. Did anyone notice similar
Can you try reducing maxBins? That reduces communication (at the cost of
coarser discretization of continuous features).
On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley
wrote:
> In my experience, 20K is a lot but often doable; 2K is easy; 200 is
> small. Communication
In my experience, 20K is a lot but often doable; 2K is easy; 200 is small.
Communication scales linearly in the number of features.
On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov
wrote:
> Joseph,
>
> Correction, there 20k features. Is it still a lot?
> What number
Joseph,
Correction, there 20k features. Is it still a lot?
What number of features can be considered as normal?
--
Be well!
Jean Morozov
On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley
wrote:
> First thought: 70K features is *a lot* for the MLlib implementation (and
>
One more thing.
With increased stack size it completed twice more already, but now I see in
the log.
[dispatcher-event-loop-1] WARN o.a.spark.scheduler.TaskSetManager - Stage
24860 contains a task of very large size (157 KB). The maximum recommended
task size is 100 KB.
Size of the task
Joseph,
I'm using 1.6.0.
--
Be well!
Jean Morozov
On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley
wrote:
> First thought: 70K features is *a lot* for the MLlib implementation (and
> any PLANET-like implementation)
>
> Using fewer partitions is a good idea.
>
> Which
First thought: 70K features is *a lot* for the MLlib implementation (and
any PLANET-like implementation)
Using fewer partitions is a good idea.
Which Spark version was this on?
On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov
wrote:
> The questions I have in mind:
>
The questions I have in mind:
Is it smth that the one might expect? From the stack trace itself it's not
clear where does it come from.
Is it an already known bug? Although I haven't found anything like that.
Is it possible to configure something to workaround / avoid this?
I'm not sure it's the
Hi,
I have a web service that provides rest api to train random forest algo.
I train random forest on a 5 nodes spark cluster with enough memory -
everything is cached (~22 GB).
On a small datasets up to 100k samples everything is fine, but with the
biggest one (400k samples and ~70k features)
Hi,
I have 4 nodes cluster: one master (also has hdfs namenode) and 3 workers
(also have 3 colocated hdfs datanodes). Each worker has only 2 cores and
spark.executor.memory is 2.3g.
Input file is two hdfs blocks, one block configured = 64MB.
I train random forest regression with numTrees=50 and
I think you are finding the ability of prediction on single instance. It's
a feature on the development, please refer SPARK-10413.
2015-12-10 4:37 GMT+08:00 Eugene Morozov :
> Hello,
>
> I'm using RandomForest pipeline (ml package). Everything is working fine
>
Hello,
I'm using RandomForest pipeline (ml package). Everything is working fine
(learning models, prediction, etc), but I'd like to tune it for the case,
when I predict with small dataset.
My issue is that when I apply
(PipelineModel)model.transform(dataset)
The model consists of the following
12 matches
Mail list logo