Re: SparkML algos limitations question.
The indexing I mentioned is more restrictive than that: each index corresponds to a unique position in a binary tree. (I.e., the first index of row 0 is 1, the first of row 1 is 2, the first of row 2 is 4, etc., IIRC) You're correct that this restriction could be removed; with some careful thought, we could probably avoid using indices altogether. I just created https://issues.apache.org/jira/browse/SPARK-14043 to track this. On Mon, Mar 21, 2016 at 11:22 AM, Eugene Morozovwrote: > Hi, Joseph, > > I thought I understood, why it has a limit of 30 levels for decision tree, > but now I'm not that sure. I thought that's because the decision tree > stored in the array, which has length of type int, which cannot be more, > than 2^31-1. > But here are my new discoveries. I've trained two different random forest > models of 50 trees and different maxDepth (20 and 30) and specified node > size = 5. Here are couple of those trees > > Model with maxDepth = 20: > depth=20, numNodes=471 > depth=19, numNodes=497 > > Model with maxDepth = 30: > depth=30, numNodes=11347 > depth=30, numNodes=10963 > > It looks like the tree is not pretty balanced and I understand why that > happens, but I'm surprised that actual number of nodes way less, than 2^31 > - 1. And now I'm not sure of why the limitation actually exists. With tree > that consist of 2^31 nodes it'd required to have 8G of memory just to store > those indexes, so I'd say that depth isn't the biggest issue in such a > case. > > Is it possible to workaround or simply miss maxDepth limitation (without > codebase modification) to train the tree until I hit the max number of > nodes? I'd assume that in most cases I simply won't hit it, but the depth > of the tree would be much more, than 30. > > > -- > Be well! > Jean Morozov > > On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley > wrote: > >> Hi Eugene, >> >> The maxDepth parameter exists because the implementation uses Integer >> node IDs which correspond to positions in the binary tree. This simplified >> the implementation. I'd like to eventually modify it to avoid depending on >> tree node IDs, but that is not yet on the roadmap. >> >> There is not an analogous limit for the GLMs you listed, but I'm not very >> familiar with the perceptron implementation. >> >> Joseph >> >> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov < >> evgeny.a.moro...@gmail.com> wrote: >> >>> Hello! >>> >>> I'm currently working on POC and try to use Random Forest >>> (classification and regression). I also have to check SVM and Multiclass >>> perceptron (other algos are less important at the moment). So far I've >>> discovered that Random Forest has a limitation of maxDepth for trees and >>> just out of curiosity I wonder why such a limitation has been introduced? >>> >>> An actual question is that I'm going to use Spark ML in production next >>> year and would like to know if there are other limitations like maxDepth in >>> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc. >>> >>> Thanks in advance for your time. >>> -- >>> Be well! >>> Jean Morozov >>> >> >> >
Re: SparkML algos limitations question.
Hi, Joseph, I thought I understood, why it has a limit of 30 levels for decision tree, but now I'm not that sure. I thought that's because the decision tree stored in the array, which has length of type int, which cannot be more, than 2^31-1. But here are my new discoveries. I've trained two different random forest models of 50 trees and different maxDepth (20 and 30) and specified node size = 5. Here are couple of those trees Model with maxDepth = 20: depth=20, numNodes=471 depth=19, numNodes=497 Model with maxDepth = 30: depth=30, numNodes=11347 depth=30, numNodes=10963 It looks like the tree is not pretty balanced and I understand why that happens, but I'm surprised that actual number of nodes way less, than 2^31 - 1. And now I'm not sure of why the limitation actually exists. With tree that consist of 2^31 nodes it'd required to have 8G of memory just to store those indexes, so I'd say that depth isn't the biggest issue in such a case. Is it possible to workaround or simply miss maxDepth limitation (without codebase modification) to train the tree until I hit the max number of nodes? I'd assume that in most cases I simply won't hit it, but the depth of the tree would be much more, than 30. -- Be well! Jean Morozov On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradleywrote: > Hi Eugene, > > The maxDepth parameter exists because the implementation uses Integer node > IDs which correspond to positions in the binary tree. This simplified the > implementation. I'd like to eventually modify it to avoid depending on > tree node IDs, but that is not yet on the roadmap. > > There is not an analogous limit for the GLMs you listed, but I'm not very > familiar with the perceptron implementation. > > Joseph > > On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov < > evgeny.a.moro...@gmail.com> wrote: > >> Hello! >> >> I'm currently working on POC and try to use Random Forest (classification >> and regression). I also have to check SVM and Multiclass perceptron (other >> algos are less important at the moment). So far I've discovered that Random >> Forest has a limitation of maxDepth for trees and just out of curiosity I >> wonder why such a limitation has been introduced? >> >> An actual question is that I'm going to use Spark ML in production next >> year and would like to know if there are other limitations like maxDepth in >> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc. >> >> Thanks in advance for your time. >> -- >> Be well! >> Jean Morozov >> > >
RE: SparkML algos limitations question.
Hi Yanbo, As long as two models fit into memory of a single machine, there should be no problems, so even 16GB machines can handle large models. (master should have more memory because it runs LBFGS) In my experiments, I’ve trained the models 12M and 32M parameters without issues. Best regards, Alexander From: Yanbo Liang [mailto:yblia...@gmail.com] Sent: Sunday, December 27, 2015 2:23 AM To: Joseph Bradley Cc: Eugene Morozov; user; d...@spark.apache.org Subject: Re: SparkML algos limitations question. Hi Eugene, AFAIK, the current implementation of MultilayerPerceptronClassifier have some scalability problems if the model is very huge (such as >10M), although I think the limitation can cover many use cases already. Yanbo 2015-12-16 6:00 GMT+08:00 Joseph Bradley <jos...@databricks.com<mailto:jos...@databricks.com>>: Hi Eugene, The maxDepth parameter exists because the implementation uses Integer node IDs which correspond to positions in the binary tree. This simplified the implementation. I'd like to eventually modify it to avoid depending on tree node IDs, but that is not yet on the roadmap. There is not an analogous limit for the GLMs you listed, but I'm not very familiar with the perceptron implementation. Joseph On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <evgeny.a.moro...@gmail.com<mailto:evgeny.a.moro...@gmail.com>> wrote: Hello! I'm currently working on POC and try to use Random Forest (classification and regression). I also have to check SVM and Multiclass perceptron (other algos are less important at the moment). So far I've discovered that Random Forest has a limitation of maxDepth for trees and just out of curiosity I wonder why such a limitation has been introduced? An actual question is that I'm going to use Spark ML in production next year and would like to know if there are other limitations like maxDepth in RF for other algorithms: Logistic Regression, Perceptron, SVM, etc. Thanks in advance for your time. -- Be well! Jean Morozov
Re: SparkML algos limitations question.
Hi Alexander, That's cool! Thanks for the clarification. Yanbo 2016-01-05 5:06 GMT+08:00 Ulanov, Alexander <alexander.ula...@hpe.com>: > Hi Yanbo, > > > > As long as two models fit into memory of a single machine, there should be > no problems, so even 16GB machines can handle large models. (master should > have more memory because it runs LBFGS) In my experiments, I’ve trained the > models 12M and 32M parameters without issues. > > > > Best regards, Alexander > > > > *From:* Yanbo Liang [mailto:yblia...@gmail.com] > *Sent:* Sunday, December 27, 2015 2:23 AM > *To:* Joseph Bradley > *Cc:* Eugene Morozov; user; d...@spark.apache.org > *Subject:* Re: SparkML algos limitations question. > > > > Hi Eugene, > > > > AFAIK, the current implementation of MultilayerPerceptronClassifier have > some scalability problems if the model is very huge (such as >10M), > although I think the limitation can cover many use cases already. > > > > Yanbo > > > > 2015-12-16 6:00 GMT+08:00 Joseph Bradley <jos...@databricks.com>: > > Hi Eugene, > > > > The maxDepth parameter exists because the implementation uses Integer node > IDs which correspond to positions in the binary tree. This simplified the > implementation. I'd like to eventually modify it to avoid depending on > tree node IDs, but that is not yet on the roadmap. > > > > There is not an analogous limit for the GLMs you listed, but I'm not very > familiar with the perceptron implementation. > > > > Joseph > > > > On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov < > evgeny.a.moro...@gmail.com> wrote: > > Hello! > > > > I'm currently working on POC and try to use Random Forest (classification > and regression). I also have to check SVM and Multiclass perceptron (other > algos are less important at the moment). So far I've discovered that Random > Forest has a limitation of maxDepth for trees and just out of curiosity I > wonder why such a limitation has been introduced? > > > > An actual question is that I'm going to use Spark ML in production next > year and would like to know if there are other limitations like maxDepth in > RF for other algorithms: Logistic Regression, Perceptron, SVM, etc. > > > > Thanks in advance for your time. > > -- > Be well! > Jean Morozov > > > > >
Re: SparkML algos limitations question.
Hi Eugene, AFAIK, the current implementation of MultilayerPerceptronClassifier have some scalability problems if the model is very huge (such as >10M), although I think the limitation can cover many use cases already. Yanbo 2015-12-16 6:00 GMT+08:00 Joseph Bradley: > Hi Eugene, > > The maxDepth parameter exists because the implementation uses Integer node > IDs which correspond to positions in the binary tree. This simplified the > implementation. I'd like to eventually modify it to avoid depending on > tree node IDs, but that is not yet on the roadmap. > > There is not an analogous limit for the GLMs you listed, but I'm not very > familiar with the perceptron implementation. > > Joseph > > On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov < > evgeny.a.moro...@gmail.com> wrote: > >> Hello! >> >> I'm currently working on POC and try to use Random Forest (classification >> and regression). I also have to check SVM and Multiclass perceptron (other >> algos are less important at the moment). So far I've discovered that Random >> Forest has a limitation of maxDepth for trees and just out of curiosity I >> wonder why such a limitation has been introduced? >> >> An actual question is that I'm going to use Spark ML in production next >> year and would like to know if there are other limitations like maxDepth in >> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc. >> >> Thanks in advance for your time. >> -- >> Be well! >> Jean Morozov >> > >
Re: SparkML algos limitations question.
Hi Eugene, The maxDepth parameter exists because the implementation uses Integer node IDs which correspond to positions in the binary tree. This simplified the implementation. I'd like to eventually modify it to avoid depending on tree node IDs, but that is not yet on the roadmap. There is not an analogous limit for the GLMs you listed, but I'm not very familiar with the perceptron implementation. Joseph On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozovwrote: > Hello! > > I'm currently working on POC and try to use Random Forest (classification > and regression). I also have to check SVM and Multiclass perceptron (other > algos are less important at the moment). So far I've discovered that Random > Forest has a limitation of maxDepth for trees and just out of curiosity I > wonder why such a limitation has been introduced? > > An actual question is that I'm going to use Spark ML in production next > year and would like to know if there are other limitations like maxDepth in > RF for other algorithms: Logistic Regression, Perceptron, SVM, etc. > > Thanks in advance for your time. > -- > Be well! > Jean Morozov >