Re: Re: Random Forest on Spark

Sebastian Schelter Fri, 18 Apr 2014 02:25:45 -0700

Hi,

Stratosphere does not have a real RF implementation yet, there is only aprototype that has been developed by students in a university coursewhich is far from production usage at this stage.


--sebastian

On 04/18/2014 10:31 AM, Sean Owen wrote:

Mahout RDF is fairly old code. If you try it, try to use 1.0-SNAPSHOT,
as you will almost certainly need this patch to make it run reasonably
fast: https://issues.apache.org/jira/browse/MAHOUT-1419

I have not tried Stratosphere here.

Since we are on the subject of RDF on Hadoop, possibly on M/R, I don't
feel too bad advertising this: oryx also does
classification/regression via RDF:
https://github.com/cloudera/oryx#classification--regression-example

This is a fairly different design choice than, say, what's in the
PLANET paper. The one big negative is that trees are built only over a
sub-sample of the data. But given that big simplifying assumption, a
lot of other things work well. It's not iterative so is not
handicapped by being M/R-based. May be of interest if building /
benchmarking stuff on Hadoop.

Personally, going forward, I'm interested in something smarter (like
what I see is going into the new Spark impl) although there really are
some big design tradeoffs here, yes.


--
Sean Owen | Director, Data Science | London


On Fri, Apr 18, 2014 at 9:21 AM, Laeeq Ahmed <laeeqsp...@yahoo.com> wrote:

Have anyone tried mahout RF or Stratosphere RF with spark. Any comments.

Regards,
Laeeq
On Friday, April 18, 2014 3:11 AM, Sung Hwan Chung
<coded...@cs.stanford.edu> wrote:
Yes, it should be data specific and perhaps we're biased toward the data
sets that we are playing with. To put things in perspective, we're highly
interested in (and I believe, our customers are):

1. large (hundreds of millions of rows)
2. multi-class classification - nowadays, dozens of target categories are
common and even thousands in some cases - you could imagine that this is a
big reason for us requiring more 'complex' models
3. high dimensional with thousands of descriptive and sort-of-independent
features

 From the theoretical perspective, I would argue that it's usually in the
best interest to prune as little as possible. I believe that pruning
inherently increases bias of an individual tree, which RF can't do anything
about while decreasing variance - which is what RF is for.

The default pruning criteria for R's reference implementation is min-node of
1 (meaning fully-grown tree) for classification, and 5 for regression. I'd
imagine they did at least some empirical testing to justify these values at
the time - although at a time of small datasets :).

FYI, we are also considering the MLLib decision tree for our Gradient
Boosting implementation, however, the memory requirement is still a bit too
steep (we were getting heap exceptions at depth limit of 5 with 2GB per
worker with approximately 1000 features). Now 2GB per worker is about what
we expect our typical customers would tolerate and I don't think that it's
unreasonable for shallow trees.



On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks <evan.spa...@gmail.com>
wrote:

What kind of data are you training on? These effects are *highly* data
dependent, and while saying "the depth of 10 is simply not adequate to build
high-accuracy models" may be accurate for the particular problem you're
modeling, it is not true in general. From a statistical perspective, I
consider each node in each tree an additional degree of freedom for the
model, and all else equal I'd expect a model with fewer degrees of freedom
to generalize better. Regardless, if there are lots of use cases for really
deep trees, we'd like to hear about them so that we can decide how important
they are to support!

In the context of CART - pruning very specifically refers to a step *after*
a tree has been constructed to some depth using cross-validation. This was a
variance reduction technique in the original tree work that is unnecessary
and computationally expensive in the context of forests. In the original
Random Forests paper, there are still stopping criteria - usually either
minimum leaf size or minimum split improvement (or both), so "training to
maximum depth" doesn't mean "train until you've completely divided your
dataset and there's one point per leaf." My point is that if you set minimum
leaf size to something like 0.2% of the dataset, then you're not going to
get deeper than 10 or 12 levels with a reasonably balanced tree.

With respect to PLANET - our implementation is very much in the spirit of
planet, but has some key differences - there's good documentation on exactly
what the differences are forthcoming, so I won't belabor these here. The
differences are designed to 1) avoid data shuffling, and 2) minimize number
of passes over the training data. Of course, there are tradeoffs involved,
and there is at least one really good trick in the PLANET work that we
should leverage that we aren't yet - namely once the nodes get small enough
for data to fit easily on a single machine, data can be shuffled and then
the remainder of the tree can be trained in parallel from each lower node on
a single machine This would actually help with the memory overheads in model
training when trees get deep  - if someone wants to modify the current
implementation of trees in MLlib and contribute this optimization as a pull
request, it would be welcome!

At any rate, we'll take this feedback into account with respect to improving
the tree implementation, but if anyone can send over use cases or (even
better) datasets where really deep trees are necessary, that would be great!




On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung <coded...@cs.stanford.edu>
wrote:

Well, if you read the original paper,
http://oz.berkeley.edu/~breiman/randomforest2001.pdf
"Grow the tree using CART methodology to maximum size and do not prune."

Now, the elements of statistical learning book on page 598 says that you
could potentially overfit fully-grown regression random forest. However,
this effect is very slight, and likely negligible for classifications.
http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf

In our experiments however, if the pruning is "drastic", then the
performance actually becomes much worse. This makes intuitive sense IMO
because a decision tree is a non-parametric model, and the expressibility of
a tree depends on the number of nodes.

With a huge amount of data (millions or even billions of rows), we found
that the depth of 10 is simply not adequate to build high-accuracy models.


On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks <evan.spa...@gmail.com>
wrote:

Hmm... can you provide some pointers to examples where deep trees are
helpful?

Typically with Decision Trees you limit depth (either directly or indirectly
with minimum node size and minimum improvement criteria) to avoid
overfitting. I agree with the assessment that forests are a variance
reduction technique, but I'd be a little surprised if a bunch of hugely deep
trees don't overfit to training data. I guess I view limiting tree depth as
an analogue to regularization in linear models.


On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <coded...@cs.stanford.edu>
wrote:

Evan,

I actually haven't heard of 'shallow' random forest. I think that the only
scenarios where shallow trees are useful are boosting scenarios.

AFAIK, Random Forest is a variance reducing technique and doesn't do much
about bias (although some people claim that it does have some bias reducing
effect). Because shallow trees typically have higher bias than fully-grown
trees, people don't often use shallow trees with RF.

You can confirm this through some experiments with R's random forest
implementation as well. They allow you to set some limits of depth and/or
pruning.

In contrast, boosting is a bias reduction technique (and increases
variance), so people typically use shallow trees.

Our empirical experiments also confirmed that shallow trees resulted in
drastically lower accuracy for random forests.

There are some papers that mix boosting-like technique with bootstrap
averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
potentially use shallow trees to build boosted learners, but then average
the results of many boosted learners.


On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <evan.spa...@gmail.com>
wrote:

Multiclass classification, Gradient Boosting, and Random Forest support for
based on the recent Decision Tree implementation in MLlib.

Sung - I'd be curious to hear about your use of decision trees (and forests)
where you want to go to 100+ depth. My experience with random forests has
been that people typically build hundreds of shallow trees (maybe depth 7 or
8), rather than a few (or many) really deep trees.

Generally speaking, we save passes over the data by computing histograms per
variable per split at each *level* of a decision tree. This can blow up as
the level of the decision tree gets deep, but I'd recommend a lot more
memory than 2-4GB per worker for most big data workloads.





On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <coded...@cs.stanford.edu>
wrote:

Debasish, we've tested the MLLib decision tree a bit and it eats up too much
memory for RF purposes.
Once the tree got to depth 8~9, it was easy to get heap exception, even with
2~4 GB of memory per worker.

With RF, it's very easy to get 100+ depth in RF with even only 100,000+ rows
(because trees usually are not balanced). Additionally, the lack of
multi-class classification limits its applicability.

Also, RF requires random features per tree node to be effective (not just
bootstrap samples), and MLLib decision tree doesn't support that.


On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <debasish.da...@gmail.com>
wrote:

Mllib has decision tree....there is a rf pr which is not active now....take
that and swap the tree builder with the fast tree builder that's in
mllib...search for the spark jira...the code is based on google planet
paper. ..
I am sure people in devlist are already working on it...send an email to
know the status over there...
There is also a rf in cloudera oryx but we could not run it on our data
yet....
Weka 3.7.10 has a multi thread rf that is good to do some adhoc runs but it
does not scale...
On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com> wrote:

Hi,

For one of my application, I want to use Random forests(RF) on top of spark.
I see that currenlty MLLib does not have implementation for RF. What other
opensource RF implementations will be great to use with spark in terms of
speed?

Regards,
Laeeq Ahmed,
KTH, Sweden.

Re: Re: Random Forest on Spark

Reply via email to