Re: Random Forest on Spark

Laeeq Ahmed Fri, 18 Apr 2014 01:22:33 -0700

Have anyone tried mahout RF or Stratosphere RF with spark. Any comments.

Regards,
Laeeq
On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung <coded...@cs.stanford.edu> 
wrote:
 
Yes, it should be data specific and perhaps we're biased toward the data sets 
that we are playing with. To put things in perspective, we're highly interested 
in (and I believe, our customers are):


1. large (hundreds of millions of rows)
2. multi-class classification - nowadays, dozens of target categories are 
common and even thousands in some cases - you could imagine that this is a big 
reason for us requiring more 'complex' models
3. high dimensional with thousands of descriptive and sort-of-independent 
features

From the theoretical perspective, I would argue that it's usually in the best 
interest to prune as little as possible. I believe that pruning inherently 
increases bias of an individual tree, which RF can't do anything about while 
decreasing variance - which is what RF is for.


The default pruning criteria for R's reference implementation is min-node of 1 
(meaning fully-grown tree) for classification, and 5 for regression. I'd 
imagine they did at least some empirical testing to justify these values at the 
time - although at a time of small datasets :).

FYI, we are also considering the MLLib decision tree for our Gradient Boosting 
implementation, however, the memory requirement is still a bit too steep (we 
were getting heap exceptions at depth limit of 5 with 2GB per worker with 
approximately 1000 features). Now 2GB per worker is about what we expect our 
typical customers would tolerate and I don't think that it's unreasonable for 
shallow trees.




On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks <evan.spa...@gmail.com> wrote:

What kind of data are you training on? These effects are *highly* data 
dependent, and while saying "the depth of 10 is simply not adequate to build 
high-accuracy models" may be accurate for the particular problem you're 
modeling, it is not true in general. From a statistical perspective, I consider 
each node in each tree an additional degree of freedom for the model, and all 
else equal I'd expect a model with fewer degrees of freedom to generalize 
better. Regardless, if there are lots of use cases for really deep trees, we'd 
like to hear about them so that we can decide how important they are to 
support! 
>
>
>
>In the context of CART - pruning very specifically refers to a step *after* a 
>tree has been constructed to some depth using cross-validation. This was a 
>variance reduction technique in the original tree work that is unnecessary and 
>computationally expensive in the context of forests. In the original Random 
>Forests paper, there are still stopping criteria - usually either minimum leaf 
>size or minimum split improvement (or both), so "training to maximum depth" 
>doesn't mean "train until you've completely divided your dataset and there's 
>one point per leaf." My point is that if you set minimum leaf size to 
>something like 0.2% of the dataset, then you're not going to get deeper than 
>10 or 12 levels with a reasonably balanced tree.
>
>
>
>With respect to PLANET - our implementation is very much in the spirit of 
>planet, but has some key differences - there's good documentation on exactly 
>what the differences are forthcoming, so I won't belabor these here. The 
>differences are designed to 1) avoid data shuffling, and 2) minimize number of 
>passes over the training data. Of course, there are tradeoffs involved, and 
>there is at least one really good trick in the PLANET work that we should 
>leverage that we aren't yet - namely once the nodes get small enough for data 
>to fit easily on a single machine, data can be shuffled and then the remainder 
>of the tree can be trained in parallel from each lower node on a single 
>machine This would actually help with the memory overheads in model training 
>when trees get deep  - if someone wants to modify the current implementation 
>of trees in MLlib and contribute this optimization as a pull request, it would 
>be welcome!
>
>
>At any rate, we'll take this feedback into account with respect to improving 
>the tree implementation, but if anyone can send over use cases or (even 
>better) datasets where really deep trees are necessary, that would be great!
>
>
>
>
>
>
>
>On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung <coded...@cs.stanford.edu> 
>wrote:
>
>Well, if you read the original paper, 
>http://oz.berkeley.edu/~breiman/randomforest2001.pdf
>>"Grow the tree using CART methodology to maximum size and do not prune."
>>
>>
>>Now, the elements of statistical learning book on page 598 says that you 
>>could potentially overfit fully-grown regression random forest. However, this 
>>effect is very slight, and likely negligible for classifications.
>>http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>
>>
>>
>>In our experiments however, if the pruning is "drastic", then the performance 
>>actually becomes much worse. This makes intuitive sense IMO because a 
>>decision tree is a non-parametric model, and the expressibility of a tree 
>>depends on the number of nodes.
>>
>>
>>With a huge amount of data (millions or even billions of rows), we found that 
>>the depth of 10 is simply not adequate to build high-accuracy models.
>>
>>
>>
>>On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks <evan.spa...@gmail.com> 
>>wrote:
>>
>>Hmm... can you provide some pointers to examples where deep trees are 
>>helpful? 
>>>
>>>
>>>Typically with Decision Trees you limit depth (either directly or indirectly 
>>>with minimum node size and minimum improvement criteria) to avoid 
>>>overfitting. I agree with the assessment that forests are a variance 
>>>reduction technique, but I'd be a little surprised if a bunch of hugely deep 
>>>trees don't overfit to training data. I guess I view limiting tree depth as 
>>>an analogue to regularization in linear models.
>>>
>>>
>>>
>>>On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <coded...@cs.stanford.edu> 
>>>wrote:
>>>
>>>Evan,
>>>>
>>>>I actually haven't heard of 'shallow' random forest. I think that the only 
>>>>scenarios where shallow trees are useful are boosting scenarios.
>>>>
>>>>
>>>>AFAIK, Random Forest is a variance reducing technique and doesn't do much 
>>>>about bias (although some people claim that it does have some bias reducing 
>>>>effect). Because shallow trees typically have higher bias than fully-grown 
>>>>trees, people don't often use shallow trees with RF.
>>>>
>>>>
>>>>You can confirm this through some experiments with R's random forest 
>>>>implementation as well. They allow you to set some limits of depth and/or 
>>>>pruning.
>>>>
>>>>
>>>>In contrast, boosting is a bias reduction technique (and increases 
>>>>variance), so people typically use shallow trees.
>>>>
>>>>
>>>>Our empirical experiments also confirmed that shallow trees resulted in 
>>>>drastically lower accuracy for random forests.
>>>>
>>>>
>>>>There are some papers that mix boosting-like technique with bootstrap 
>>>>averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could 
>>>>potentially use shallow trees to build boosted learners, but then average 
>>>>the results of many boosted learners.
>>>>
>>>>
>>>>
>>>>On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <evan.spa...@gmail.com> 
>>>>wrote:
>>>>
>>>>Multiclass classification, Gradient Boosting, and Random Forest support for 
>>>>based on the recent Decision Tree implementation in MLlib. 
>>>>>
>>>>>
>>>>>Sung - I'd be curious to hear about your use of decision trees (and 
>>>>>forests) where you want to go to 100+ depth. My experience with random 
>>>>>forests has been that people typically build hundreds of shallow trees 
>>>>>(maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>>>>
>>>>>
>>>>>Generally speaking, we save passes over the data by computing histograms 
>>>>>per variable per split at each *level* of a decision tree. This can blow 
>>>>>up as the level of the decision tree gets deep, but I'd recommend a lot 
>>>>>more memory than 2-4GB per worker for most big data workloads.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung 
>>>>><coded...@cs.stanford.edu> wrote:
>>>>>
>>>>>Debasish, we've tested the MLLib decision tree a bit and it eats up too 
>>>>>much memory for RF purposes.
>>>>>>Once the tree got to depth 8~9, it was easy to get heap exception, even 
>>>>>>with 2~4 GB of memory per worker.
>>>>>>
>>>>>>
>>>>>>With RF, it's very easy to get 100+ depth in RF with even only 100,000+ 
>>>>>>rows (because trees usually are not balanced). Additionally, the lack of 
>>>>>>multi-class classification limits its applicability.
>>>>>>
>>>>>>
>>>>>>Also, RF requires random features per tree node to be effective (not just 
>>>>>>bootstrap samples), and MLLib decision tree doesn't support that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <debasish.da...@gmail.com> 
>>>>>>wrote:
>>>>>>
>>>>>>Mllib has decision tree....there is a rf pr which is not active 
>>>>>>now....take that and swap the tree builder with the fast tree builder 
>>>>>>that's in mllib...search for the spark jira...the code is based on google 
>>>>>>planet paper. ..
>>>>>>>I am sure people in devlist are already working on it...send an email to 
>>>>>>>know the status over there...
>>>>>>>There is also a rf in cloudera oryx but we could not run it on our data 
>>>>>>>yet....
>>>>>>>Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but 
>>>>>>>it does not scale...
>>>>>>>
>>>>>>>On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com> wrote:
>>>>>>>
>>>>>>>Hi,
>>>>>>>>
>>>>>>>>For one of my application, I want to use Random forests(RF) on top of 
>>>>>>>>spark. I see that currenlty MLLib does not have implementation for RF. 
>>>>>>>>What other opensource RF implementations will be great to use with 
>>>>>>>>spark in terms of speed?
>>>>>>>>
>>>>>>>>Regards,
>>>>>>>>Laeeq Ahmed,
>>>>>>>>KTH, Sweden.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Random Forest on Spark

Reply via email to