Re: Random Forest on Spark

Debasish Das Thu, 17 Apr 2014 14:22:24 -0700

Evan,

Was not mllib decision tree implemented using ideas from Google's PLANET
paper...do the paper also propose to grow a shallow tree ?


Thanks.
Deb


On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung
<coded...@cs.stanford.edu>wrote:

> Additionally, the 'random features per node' (or mtry in R) is a very
> important feature for Random Forest. The variance reduction comes if the
> trees are decorrelated from each other and often the random features per
> node does more than bootstrap samples. And this is something that would
> have to be supported at the tree level.
>
>
> On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung <coded...@cs.stanford.edu
> > wrote:
>
>> Well, if you read the original paper,
>> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
>> "Grow the tree using CART methodology to maximum size and do not prune."
>>
>> Now, the elements of statistical learning book on page 598 says that you
>> could potentially overfit fully-grown regression random forest. However,
>> this effect is very slight, and likely negligible for classifications.
>> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>
>> In our experiments however, if the pruning is "drastic", then the
>> performance actually becomes much worse. This makes intuitive sense IMO
>> because a decision tree is a non-parametric model, and the expressibility
>> of a tree depends on the number of nodes.
>>
>> With a huge amount of data (millions or even billions of rows), we found
>> that the depth of 10 is simply not adequate to build high-accuracy models.
>>
>>
>> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks 
>> <evan.spa...@gmail.com>wrote:
>>
>>> Hmm... can you provide some pointers to examples where deep trees are
>>> helpful?
>>>
>>> Typically with Decision Trees you limit depth (either directly or
>>> indirectly with minimum node size and minimum improvement criteria) to
>>> avoid overfitting. I agree with the assessment that forests are a variance
>>> reduction technique, but I'd be a little surprised if a bunch of hugely
>>> deep trees don't overfit to training data. I guess I view limiting tree
>>> depth as an analogue to regularization in linear models.
>>>
>>>
>>> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <
>>> coded...@cs.stanford.edu> wrote:
>>>
>>>> Evan,
>>>>
>>>> I actually haven't heard of 'shallow' random forest. I think that the
>>>> only scenarios where shallow trees are useful are boosting scenarios.
>>>>
>>>> AFAIK, Random Forest is a variance reducing technique and doesn't do
>>>> much about bias (although some people claim that it does have some bias
>>>> reducing effect). Because shallow trees typically have higher bias than
>>>> fully-grown trees, people don't often use shallow trees with RF.
>>>>
>>>> You can confirm this through some experiments with R's random forest
>>>> implementation as well. They allow you to set some limits of depth and/or
>>>> pruning.
>>>>
>>>> In contrast, boosting is a bias reduction technique (and increases
>>>> variance), so people typically use shallow trees.
>>>>
>>>> Our empirical experiments also confirmed that shallow trees resulted in
>>>> drastically lower accuracy for random forests.
>>>>
>>>> There are some papers that mix boosting-like technique with bootstrap
>>>> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
>>>> potentially use shallow trees to build boosted learners, but then average
>>>> the results of many boosted learners.
>>>>
>>>>
>>>> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <evan.spa...@gmail.com
>>>> > wrote:
>>>>
>>>>> Multiclass classification, Gradient Boosting, and Random Forest
>>>>> support for based on the recent Decision Tree implementation in MLlib.
>>>>>
>>>>> Sung - I'd be curious to hear about your use of decision trees (and
>>>>> forests) where you want to go to 100+ depth. My experience with random
>>>>> forests has been that people typically build hundreds of shallow trees
>>>>> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>>>>
>>>>> Generally speaking, we save passes over the data by computing
>>>>> histograms per variable per split at each *level* of a decision tree. This
>>>>> can blow up as the level of the decision tree gets deep, but I'd recommend
>>>>> a lot more memory than 2-4GB per worker for most big data workloads.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
>>>>> coded...@cs.stanford.edu> wrote:
>>>>>
>>>>>> Debasish, we've tested the MLLib decision tree a bit and it eats up
>>>>>> too much memory for RF purposes.
>>>>>> Once the tree got to depth 8~9, it was easy to get heap exception,
>>>>>> even with 2~4 GB of memory per worker.
>>>>>>
>>>>>> With RF, it's very easy to get 100+ depth in RF with even only
>>>>>> 100,000+ rows (because trees usually are not balanced). Additionally, the
>>>>>> lack of multi-class classification limits its applicability.
>>>>>>
>>>>>> Also, RF requires random features per tree node to be effective (not
>>>>>> just bootstrap samples), and MLLib decision tree doesn't support that.
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <
>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>
>>>>>>> Mllib has decision tree....there is a rf pr which is not active
>>>>>>> now....take that and swap the tree builder with the fast tree builder
>>>>>>> that's in mllib...search for the spark jira...the code is based on 
>>>>>>> google
>>>>>>> planet paper. ..
>>>>>>>
>>>>>>> I am sure people in devlist are already working on it...send an
>>>>>>> email to know the status over there...
>>>>>>>
>>>>>>> There is also a rf in cloudera oryx but we could not run it on our
>>>>>>> data yet....
>>>>>>>
>>>>>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs
>>>>>>> but it does not scale...
>>>>>>>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> For one of my application, I want to use Random forests(RF) on top
>>>>>>>> of spark. I see that currenlty MLLib does not have implementation for 
>>>>>>>> RF.
>>>>>>>> What other opensource RF implementations will be great to use with 
>>>>>>>> spark in
>>>>>>>> terms of speed?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Laeeq Ahmed,
>>>>>>>> KTH, Sweden.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Random Forest on Spark

Reply via email to