Evan, Was not mllib decision tree implemented using ideas from Google's PLANET paper...do the paper also propose to grow a shallow tree ?
Thanks. Deb On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung <coded...@cs.stanford.edu>wrote: > Additionally, the 'random features per node' (or mtry in R) is a very > important feature for Random Forest. The variance reduction comes if the > trees are decorrelated from each other and often the random features per > node does more than bootstrap samples. And this is something that would > have to be supported at the tree level. > > > On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung <coded...@cs.stanford.edu > > wrote: > >> Well, if you read the original paper, >> http://oz.berkeley.edu/~breiman/randomforest2001.pdf >> "Grow the tree using CART methodology to maximum size and do not prune." >> >> Now, the elements of statistical learning book on page 598 says that you >> could potentially overfit fully-grown regression random forest. However, >> this effect is very slight, and likely negligible for classifications. >> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf >> >> In our experiments however, if the pruning is "drastic", then the >> performance actually becomes much worse. This makes intuitive sense IMO >> because a decision tree is a non-parametric model, and the expressibility >> of a tree depends on the number of nodes. >> >> With a huge amount of data (millions or even billions of rows), we found >> that the depth of 10 is simply not adequate to build high-accuracy models. >> >> >> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks >> <evan.spa...@gmail.com>wrote: >> >>> Hmm... can you provide some pointers to examples where deep trees are >>> helpful? >>> >>> Typically with Decision Trees you limit depth (either directly or >>> indirectly with minimum node size and minimum improvement criteria) to >>> avoid overfitting. I agree with the assessment that forests are a variance >>> reduction technique, but I'd be a little surprised if a bunch of hugely >>> deep trees don't overfit to training data. I guess I view limiting tree >>> depth as an analogue to regularization in linear models. >>> >>> >>> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung < >>> coded...@cs.stanford.edu> wrote: >>> >>>> Evan, >>>> >>>> I actually haven't heard of 'shallow' random forest. I think that the >>>> only scenarios where shallow trees are useful are boosting scenarios. >>>> >>>> AFAIK, Random Forest is a variance reducing technique and doesn't do >>>> much about bias (although some people claim that it does have some bias >>>> reducing effect). Because shallow trees typically have higher bias than >>>> fully-grown trees, people don't often use shallow trees with RF. >>>> >>>> You can confirm this through some experiments with R's random forest >>>> implementation as well. They allow you to set some limits of depth and/or >>>> pruning. >>>> >>>> In contrast, boosting is a bias reduction technique (and increases >>>> variance), so people typically use shallow trees. >>>> >>>> Our empirical experiments also confirmed that shallow trees resulted in >>>> drastically lower accuracy for random forests. >>>> >>>> There are some papers that mix boosting-like technique with bootstrap >>>> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could >>>> potentially use shallow trees to build boosted learners, but then average >>>> the results of many boosted learners. >>>> >>>> >>>> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <evan.spa...@gmail.com >>>> > wrote: >>>> >>>>> Multiclass classification, Gradient Boosting, and Random Forest >>>>> support for based on the recent Decision Tree implementation in MLlib. >>>>> >>>>> Sung - I'd be curious to hear about your use of decision trees (and >>>>> forests) where you want to go to 100+ depth. My experience with random >>>>> forests has been that people typically build hundreds of shallow trees >>>>> (maybe depth 7 or 8), rather than a few (or many) really deep trees. >>>>> >>>>> Generally speaking, we save passes over the data by computing >>>>> histograms per variable per split at each *level* of a decision tree. This >>>>> can blow up as the level of the decision tree gets deep, but I'd recommend >>>>> a lot more memory than 2-4GB per worker for most big data workloads. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung < >>>>> coded...@cs.stanford.edu> wrote: >>>>> >>>>>> Debasish, we've tested the MLLib decision tree a bit and it eats up >>>>>> too much memory for RF purposes. >>>>>> Once the tree got to depth 8~9, it was easy to get heap exception, >>>>>> even with 2~4 GB of memory per worker. >>>>>> >>>>>> With RF, it's very easy to get 100+ depth in RF with even only >>>>>> 100,000+ rows (because trees usually are not balanced). Additionally, the >>>>>> lack of multi-class classification limits its applicability. >>>>>> >>>>>> Also, RF requires random features per tree node to be effective (not >>>>>> just bootstrap samples), and MLLib decision tree doesn't support that. >>>>>> >>>>>> >>>>>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das < >>>>>> debasish.da...@gmail.com> wrote: >>>>>> >>>>>>> Mllib has decision tree....there is a rf pr which is not active >>>>>>> now....take that and swap the tree builder with the fast tree builder >>>>>>> that's in mllib...search for the spark jira...the code is based on >>>>>>> google >>>>>>> planet paper. .. >>>>>>> >>>>>>> I am sure people in devlist are already working on it...send an >>>>>>> email to know the status over there... >>>>>>> >>>>>>> There is also a rf in cloudera oryx but we could not run it on our >>>>>>> data yet.... >>>>>>> >>>>>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs >>>>>>> but it does not scale... >>>>>>> On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> For one of my application, I want to use Random forests(RF) on top >>>>>>>> of spark. I see that currenlty MLLib does not have implementation for >>>>>>>> RF. >>>>>>>> What other opensource RF implementations will be great to use with >>>>>>>> spark in >>>>>>>> terms of speed? >>>>>>>> >>>>>>>> Regards, >>>>>>>> Laeeq Ahmed, >>>>>>>> KTH, Sweden. >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> >