subject:"Re\: SparkML algos limitations question."

Re: SparkML algos limitations question.

2016-03-21 Thread Joseph Bradley

The indexing I mentioned is more restrictive than that: each index
corresponds to a unique position in a binary tree.  (I.e., the first index
of row 0 is 1, the first of row 1 is 2, the first of row 2 is 4, etc., IIRC)

You're correct that this restriction could be removed; with some careful
thought, we could probably avoid using indices altogether.  I just created
https://issues.apache.org/jira/browse/SPARK-14043  to track this.

On Mon, Mar 21, 2016 at 11:22 AM, Eugene Morozov  wrote:

> Hi, Joseph,
>
> I thought I understood, why it has a limit of 30 levels for decision tree,
> but now I'm not that sure. I thought that's because the decision tree
> stored in the array, which has length of type int, which cannot be more,
> than 2^31-1.
> But here are my new discoveries. I've trained two different random forest
> models of 50 trees and different maxDepth (20 and 30) and specified node
> size = 5. Here are couple of those trees
>
> Model with maxDepth = 20:
> depth=20, numNodes=471
> depth=19, numNodes=497
>
> Model with maxDepth = 30:
> depth=30, numNodes=11347
> depth=30, numNodes=10963
>
> It looks like the tree is not pretty balanced and I understand why that
> happens, but I'm surprised that actual number of nodes way less, than 2^31
> - 1. And now I'm not sure of why the limitation actually exists. With tree
> that consist of 2^31 nodes it'd required to have 8G of memory just to store
> those indexes, so I'd say that depth isn't the biggest issue in such a
> case.
>
> Is it possible to workaround or simply miss maxDepth limitation (without
> codebase modification) to train the tree until I hit the max number of
> nodes? I'd assume that in most cases I simply won't hit it, but the depth
> of the tree would be much more, than 30.
>
>
> --
> Be well!
> Jean Morozov
>
> On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley 
> wrote:
>
>> Hi Eugene,
>>
>> The maxDepth parameter exists because the implementation uses Integer
>> node IDs which correspond to positions in the binary tree.  This simplified
>> the implementation.  I'd like to eventually modify it to avoid depending on
>> tree node IDs, but that is not yet on the roadmap.
>>
>> There is not an analogous limit for the GLMs you listed, but I'm not very
>> familiar with the perceptron implementation.
>>
>> Joseph
>>
>> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> Hello!
>>>
>>> I'm currently working on POC and try to use Random Forest
>>> (classification and regression). I also have to check SVM and Multiclass
>>> perceptron (other algos are less important at the moment). So far I've
>>> discovered that Random Forest has a limitation of maxDepth for trees and
>>> just out of curiosity I wonder why such a limitation has been introduced?
>>>
>>> An actual question is that I'm going to use Spark ML in production next
>>> year and would like to know if there are other limitations like maxDepth in
>>> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>>>
>>> Thanks in advance for your time.
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>
>>
>

Re: SparkML algos limitations question.

2016-03-21 Thread Eugene Morozov

Hi, Joseph,

I thought I understood, why it has a limit of 30 levels for decision tree,
but now I'm not that sure. I thought that's because the decision tree
stored in the array, which has length of type int, which cannot be more,
than 2^31-1.
But here are my new discoveries. I've trained two different random forest
models of 50 trees and different maxDepth (20 and 30) and specified node
size = 5. Here are couple of those trees

Model with maxDepth = 20:
depth=20, numNodes=471
depth=19, numNodes=497

Model with maxDepth = 30:
depth=30, numNodes=11347
depth=30, numNodes=10963

It looks like the tree is not pretty balanced and I understand why that
happens, but I'm surprised that actual number of nodes way less, than 2^31
- 1. And now I'm not sure of why the limitation actually exists. With tree
that consist of 2^31 nodes it'd required to have 8G of memory just to store
those indexes, so I'd say that depth isn't the biggest issue in such a
case.

Is it possible to workaround or simply miss maxDepth limitation (without
codebase modification) to train the tree until I hit the max number of
nodes? I'd assume that in most cases I simply won't hit it, but the depth
of the tree would be much more, than 30.


--
Be well!
Jean Morozov

On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley 
wrote:

> Hi Eugene,
>
> The maxDepth parameter exists because the implementation uses Integer node
> IDs which correspond to positions in the binary tree.  This simplified the
> implementation.  I'd like to eventually modify it to avoid depending on
> tree node IDs, but that is not yet on the roadmap.
>
> There is not an analogous limit for the GLMs you listed, but I'm not very
> familiar with the perceptron implementation.
>
> Joseph
>
> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Hello!
>>
>> I'm currently working on POC and try to use Random Forest (classification
>> and regression). I also have to check SVM and Multiclass perceptron (other
>> algos are less important at the moment). So far I've discovered that Random
>> Forest has a limitation of maxDepth for trees and just out of curiosity I
>> wonder why such a limitation has been introduced?
>>
>> An actual question is that I'm going to use Spark ML in production next
>> year and would like to know if there are other limitations like maxDepth in
>> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>>
>> Thanks in advance for your time.
>> --
>> Be well!
>> Jean Morozov
>>
>
>

RE: SparkML algos limitations question.

2016-01-04 Thread Ulanov, Alexander

Hi Yanbo,

As long as two models fit into memory of a single machine, there should be no 
problems, so even 16GB machines can handle large models. (master should have 
more memory because it runs LBFGS) In my experiments, I’ve trained the models 
12M and 32M parameters without issues.

Best regards, Alexander

From: Yanbo Liang [mailto:yblia...@gmail.com]
Sent: Sunday, December 27, 2015 2:23 AM
To: Joseph Bradley
Cc: Eugene Morozov; user; d...@spark.apache.org
Subject: Re: SparkML algos limitations question.

Hi Eugene,

AFAIK, the current implementation of MultilayerPerceptronClassifier have some 
scalability problems if the model is very huge (such as >10M), although I think 
the limitation can cover many use cases already.

Yanbo

2015-12-16 6:00 GMT+08:00 Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>>:
Hi Eugene,

The maxDepth parameter exists because the implementation uses Integer node IDs 
which correspond to positions in the binary tree.  This simplified the 
implementation.  I'd like to eventually modify it to avoid depending on tree 
node IDs, but that is not yet on the roadmap.

There is not an analogous limit for the GLMs you listed, but I'm not very 
familiar with the perceptron implementation.

Joseph

On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov 
<evgeny.a.moro...@gmail.com<mailto:evgeny.a.moro...@gmail.com>> wrote:
Hello!

I'm currently working on POC and try to use Random Forest (classification and 
regression). I also have to check SVM and Multiclass perceptron (other algos 
are less important at the moment). So far I've discovered that Random Forest 
has a limitation of maxDepth for trees and just out of curiosity I wonder why 
such a limitation has been introduced?

An actual question is that I'm going to use Spark ML in production next year 
and would like to know if there are other limitations like maxDepth in RF for 
other algorithms: Logistic Regression, Perceptron, SVM, etc.

Thanks in advance for your time.
--
Be well!
Jean Morozov

Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang

Hi Alexander,

That's cool! Thanks for the clarification.

Yanbo

2016-01-05 5:06 GMT+08:00 Ulanov, Alexander <alexander.ula...@hpe.com>:

> Hi Yanbo,
>
>
>
> As long as two models fit into memory of a single machine, there should be
> no problems, so even 16GB machines can handle large models. (master should
> have more memory because it runs LBFGS) In my experiments, I’ve trained the
> models 12M and 32M parameters without issues.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Yanbo Liang [mailto:yblia...@gmail.com]
> *Sent:* Sunday, December 27, 2015 2:23 AM
> *To:* Joseph Bradley
> *Cc:* Eugene Morozov; user; d...@spark.apache.org
> *Subject:* Re: SparkML algos limitations question.
>
>
>
> Hi Eugene,
>
>
>
> AFAIK, the current implementation of MultilayerPerceptronClassifier have
> some scalability problems if the model is very huge (such as >10M),
> although I think the limitation can cover many use cases already.
>
>
>
> Yanbo
>
>
>
> 2015-12-16 6:00 GMT+08:00 Joseph Bradley <jos...@databricks.com>:
>
> Hi Eugene,
>
>
>
> The maxDepth parameter exists because the implementation uses Integer node
> IDs which correspond to positions in the binary tree.  This simplified the
> implementation.  I'd like to eventually modify it to avoid depending on
> tree node IDs, but that is not yet on the roadmap.
>
>
>
> There is not an analogous limit for the GLMs you listed, but I'm not very
> familiar with the perceptron implementation.
>
>
>
> Joseph
>
>
>
> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
> Hello!
>
>
>
> I'm currently working on POC and try to use Random Forest (classification
> and regression). I also have to check SVM and Multiclass perceptron (other
> algos are less important at the moment). So far I've discovered that Random
> Forest has a limitation of maxDepth for trees and just out of curiosity I
> wonder why such a limitation has been introduced?
>
>
>
> An actual question is that I'm going to use Spark ML in production next
> year and would like to know if there are other limitations like maxDepth in
> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>
>
>
> Thanks in advance for your time.
>
> --
> Be well!
> Jean Morozov
>
>
>
>
>

Re: SparkML algos limitations question.

2015-12-27 Thread Yanbo Liang

Hi Eugene,

AFAIK, the current implementation of MultilayerPerceptronClassifier have
some scalability problems if the model is very huge (such as >10M),
although I think the limitation can cover many use cases already.

Yanbo

2015-12-16 6:00 GMT+08:00 Joseph Bradley :

> Hi Eugene,
>
> The maxDepth parameter exists because the implementation uses Integer node
> IDs which correspond to positions in the binary tree.  This simplified the
> implementation.  I'd like to eventually modify it to avoid depending on
> tree node IDs, but that is not yet on the roadmap.
>
> There is not an analogous limit for the GLMs you listed, but I'm not very
> familiar with the perceptron implementation.
>
> Joseph
>
> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Hello!
>>
>> I'm currently working on POC and try to use Random Forest (classification
>> and regression). I also have to check SVM and Multiclass perceptron (other
>> algos are less important at the moment). So far I've discovered that Random
>> Forest has a limitation of maxDepth for trees and just out of curiosity I
>> wonder why such a limitation has been introduced?
>>
>> An actual question is that I'm going to use Spark ML in production next
>> year and would like to know if there are other limitations like maxDepth in
>> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>>
>> Thanks in advance for your time.
>> --
>> Be well!
>> Jean Morozov
>>
>
>

Re: SparkML algos limitations question.

2015-12-15 Thread Joseph Bradley

Hi Eugene,

The maxDepth parameter exists because the implementation uses Integer node
IDs which correspond to positions in the binary tree.  This simplified the
implementation.  I'd like to eventually modify it to avoid depending on
tree node IDs, but that is not yet on the roadmap.

There is not an analogous limit for the GLMs you listed, but I'm not very
familiar with the perceptron implementation.

Joseph

On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov  wrote:

> Hello!
>
> I'm currently working on POC and try to use Random Forest (classification
> and regression). I also have to check SVM and Multiclass perceptron (other
> algos are less important at the moment). So far I've discovered that Random
> Forest has a limitation of maxDepth for trees and just out of curiosity I
> wonder why such a limitation has been introduced?
>
> An actual question is that I'm going to use Spark ML in production next
> year and would like to know if there are other limitations like maxDepth in
> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>
> Thanks in advance for your time.
> --
> Be well!
> Jean Morozov
>

Re: SparkML algos limitations question.

Re: SparkML algos limitations question.

RE: SparkML algos limitations question.

Re: SparkML algos limitations question.

Re: SparkML algos limitations question.

Re: SparkML algos limitations question.

6 matches

Site Navigation

Mail list logo

Footer information