Re: Random Forest on Spark

Sung Hwan Chung Fri, 18 Apr 2014 11:32:37 -0700

Debasish,

Unfortunately, we are bound to YARN, at least for the time being, because
that's what most of our customers would be using (unless, all the Hadoop
vendors start supporting standalone Spark - I think Cloudera might do
that?).





On Fri, Apr 18, 2014 at 11:12 AM, Debasish Das <debasish.da...@gmail.com>wrote:

> Spark on YARN is a big pain due to the strict memory requirement per
> container...
>
> If you are stress testing it, could you use a standalone cluster and see
> at which feature upper bound does per worker RAM requirement reaches 16 GB
> or more...it is possible to get 16 GB instances on EC2 these days without
> much trouble.,..
>
> Another way is to run a feature selection algorithm to decrease features
> space before running decision tree or algorithm variants...There is a PR on
> entropy based feature selection algorithms...you don't want to use them to
> decrease features right ?
>
> A feature extraction algorithm like matrix factorization and it's variants
> could be used to decrease feature space as well...
>
>
>
> On Fri, Apr 18, 2014 at 10:53 AM, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Thanks for the info on mem requirement.
>>
>> I think that a lot of businesses would probably prefer to use Spark on
>> top of YARN, since that's what they invest on - a large Hadoop cluster. And
>> the default setting for YARN seems to cap memory per container to 8 GB - so
>> ideally, we would like to use a lot less than that (rather than telling
>> them, nooo change your YARN settings).
>>
>> A convenient feature would be to automatically figure things out, and try
>> to adapt the algorithm to memory limits (e.g., process X # of nodes at a
>> time, instead of all the nodes at the same level). Additionally, we noticed
>> that the default 'Double' usage for LabelPoint is very wasteful for a
>> majority of data sets. Float would do most of times and in fact, a lot of
>> datasets could get away with using Short or even Byte. Or in your case,
>> since you're transforming data to Bins anyways, you could probably cache
>> BIN IDs (for which you could use Short or Byte even)?
>>
>>
>>
>> On Fri, Apr 18, 2014 at 8:43 AM, Evan R. Sparks <evan.spa...@gmail.com>wrote:
>>
>>> Interesting, and thanks for the thoughts.
>>>
>>> I think we're on the same page with 100s of millions of records. We've
>>> tested the tree implementation in mllib on 1b rows and up to 100 features -
>>> though this isn't hitting the 1000s of features you mention.
>>>
>>> Obviously multi class support isn't there yet, but I can see your point
>>> about deeper trees for many class problems. Will try them out on some image
>>> processing stuff with 1k classes we're doing in the lab once they are more
>>> developed to get a sense for where the issues are.
>>>
>>> If you're only allocating 2GB/worker you're going to have a hard time
>>> getting the real advantages of Spark.
>>>
>>> For your 1k features causing heap exceptions at depth 5  - are these
>>> categorical or continuous? The categorical vars create much smaller
>>> histograms.
>>>
>>> If you're fitting all continuous features, the memory requirements are
>>> O(b*d*2^l) where b=number of histogram bins, d=number of features, and l =
>>> level of the tree. Even accounting for object overhead, with the default
>>> number of bins, the histograms at this depth should be order of 10s of MB,
>>> not 2GB - so I'm guessing your cached data is occupying a significant chunk
>>> of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m
>>> data points with 20 continuous features and was able to run without running
>>> into memory issues (and scaling properties got better as the depth grew).
>>> His worker mem was 7.5GB and 30% of that was reserved for caching. If you
>>> wanted to go 1000 features at depth 10 I'd estimate a couple of gigs
>>> necessary for heap space for the worker to compute/store the histograms,
>>> and I guess 2x that on the master to do the reduce.
>>>
>>> Again 2GB per worker is pretty tight, because there are overheads of
>>> just starting the jvm, launching a worker, loading libraries, etc.
>>>
>>> - Evan
>>>
>>> On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung <coded...@cs.stanford.edu>
>>> wrote:
>>>
>>> Yes, it should be data specific and perhaps we're biased toward the data
>>> sets that we are playing with. To put things in perspective, we're highly
>>> interested in (and I believe, our customers are):
>>>
>>> 1. large (hundreds of millions of rows)
>>> 2. multi-class classification - nowadays, dozens of target categories
>>> are common and even thousands in some cases - you could imagine that this
>>> is a big reason for us requiring more 'complex' models
>>> 3. high dimensional with thousands of descriptive and
>>> sort-of-independent features
>>>
>>> From the theoretical perspective, I would argue that it's usually in the
>>> best interest to prune as little as possible. I believe that pruning
>>> inherently increases bias of an individual tree, which RF can't do anything
>>> about while decreasing variance - which is what RF is for.
>>>
>>> The default pruning criteria for R's reference implementation is
>>> min-node of 1 (meaning fully-grown tree) for classification, and 5 for
>>> regression. I'd imagine they did at least some empirical testing to justify
>>> these values at the time - although at a time of small datasets :).
>>>
>>> FYI, we are also considering the MLLib decision tree for our Gradient
>>> Boosting implementation, however, the memory requirement is still a bit too
>>> steep (we were getting heap exceptions at depth limit of 5 with 2GB per
>>> worker with approximately 1000 features). Now 2GB per worker is about what
>>> we expect our typical customers would tolerate and I don't think that it's
>>> unreasonable for shallow trees.
>>>
>>>
>>>
>>> On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks 
>>> <evan.spa...@gmail.com>wrote:
>>>
>>>> What kind of data are you training on? These effects are *highly* data
>>>> dependent, and while saying "the depth of 10 is simply not adequate to
>>>> build high-accuracy models" may be accurate for the particular problem
>>>> you're modeling, it is not true in general. From a statistical perspective,
>>>> I consider each node in each tree an additional degree of freedom for the
>>>> model, and all else equal I'd expect a model with fewer degrees of freedom
>>>> to generalize better. Regardless, if there are lots of use cases for really
>>>> deep trees, we'd like to hear about them so that we can decide how
>>>> important they are to support!
>>>>
>>>> In the context of CART - pruning very specifically refers to a step
>>>> *after* a tree has been constructed to some depth using cross-validation.
>>>> This was a variance reduction technique in the original tree work that is
>>>> unnecessary and computationally expensive in the context of forests. In the
>>>> original Random Forests paper, there are still stopping criteria - usually
>>>> either minimum leaf size or minimum split improvement (or both), so
>>>> "training to maximum depth" doesn't mean "train until you've completely
>>>> divided your dataset and there's one point per leaf." My point is that if
>>>> you set minimum leaf size to something like 0.2% of the dataset, then
>>>> you're not going to get deeper than 10 or 12 levels with a reasonably
>>>> balanced tree.
>>>>
>>>> With respect to PLANET - our implementation is very much in the spirit
>>>> of planet, but has some key differences - there's good documentation on
>>>> exactly what the differences are forthcoming, so I won't belabor these
>>>> here. The differences are designed to 1) avoid data shuffling, and 2)
>>>> minimize number of passes over the training data. Of course, there are
>>>> tradeoffs involved, and there is at least one really good trick in the
>>>> PLANET work that we should leverage that we aren't yet - namely once the
>>>> nodes get small enough for data to fit easily on a single machine, data can
>>>> be shuffled and then the remainder of the tree can be trained in parallel
>>>> from each lower node on a single machine This would actually help with the
>>>> memory overheads in model training when trees get deep  - if someone wants
>>>> to modify the current implementation of trees in MLlib and contribute this
>>>> optimization as a pull request, it would be welcome!
>>>>
>>>> At any rate, we'll take this feedback into account with respect to
>>>> improving the tree implementation, but if anyone can send over use cases or
>>>> (even better) datasets where really deep trees are necessary, that would be
>>>> great!
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung <
>>>> coded...@cs.stanford.edu> wrote:
>>>>
>>>>> Well, if you read the original paper,
>>>>> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
>>>>> "Grow the tree using CART methodology to maximum size and do not
>>>>> prune."
>>>>>
>>>>> Now, the elements of statistical learning book on page 598 says that
>>>>> you could potentially overfit fully-grown regression random forest.
>>>>> However, this effect is very slight, and likely negligible for
>>>>> classifications.
>>>>> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>>>>
>>>>> In our experiments however, if the pruning is "drastic", then the
>>>>> performance actually becomes much worse. This makes intuitive sense IMO
>>>>> because a decision tree is a non-parametric model, and the expressibility
>>>>> of a tree depends on the number of nodes.
>>>>>
>>>>> With a huge amount of data (millions or even billions of rows), we
>>>>> found that the depth of 10 is simply not adequate to build high-accuracy
>>>>> models.
>>>>>
>>>>>
>>>>> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks <
>>>>> evan.spa...@gmail.com> wrote:
>>>>>
>>>>>> Hmm... can you provide some pointers to examples where deep trees are
>>>>>> helpful?
>>>>>>
>>>>>> Typically with Decision Trees you limit depth (either directly or
>>>>>> indirectly with minimum node size and minimum improvement criteria) to
>>>>>> avoid overfitting. I agree with the assessment that forests are a 
>>>>>> variance
>>>>>> reduction technique, but I'd be a little surprised if a bunch of hugely
>>>>>> deep trees don't overfit to training data. I guess I view limiting tree
>>>>>> depth as an analogue to regularization in linear models.
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <
>>>>>> coded...@cs.stanford.edu> wrote:
>>>>>>
>>>>>>> Evan,
>>>>>>>
>>>>>>> I actually haven't heard of 'shallow' random forest. I think that
>>>>>>> the only scenarios where shallow trees are useful are boosting 
>>>>>>> scenarios.
>>>>>>>
>>>>>>> AFAIK, Random Forest is a variance reducing technique and doesn't do
>>>>>>> much about bias (although some people claim that it does have some bias
>>>>>>> reducing effect). Because shallow trees typically have higher bias than
>>>>>>> fully-grown trees, people don't often use shallow trees with RF.
>>>>>>>
>>>>>>> You can confirm this through some experiments with R's random forest
>>>>>>> implementation as well. They allow you to set some limits of depth 
>>>>>>> and/or
>>>>>>> pruning.
>>>>>>>
>>>>>>> In contrast, boosting is a bias reduction technique (and increases
>>>>>>> variance), so people typically use shallow trees.
>>>>>>>
>>>>>>> Our empirical experiments also confirmed that shallow trees resulted
>>>>>>> in drastically lower accuracy for random forests.
>>>>>>>
>>>>>>> There are some papers that mix boosting-like technique with
>>>>>>> bootstrap averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where
>>>>>>> you could potentially use shallow trees to build boosted learners, but 
>>>>>>> then
>>>>>>> average the results of many boosted learners.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <
>>>>>>> evan.spa...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Multiclass classification, Gradient Boosting, and Random Forest
>>>>>>>> support for based on the recent Decision Tree implementation in MLlib.
>>>>>>>>
>>>>>>>> Sung - I'd be curious to hear about your use of decision trees (and
>>>>>>>> forests) where you want to go to 100+ depth. My experience with random
>>>>>>>> forests has been that people typically build hundreds of shallow trees
>>>>>>>> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>>>>>>>
>>>>>>>> Generally speaking, we save passes over the data by computing
>>>>>>>> histograms per variable per split at each *level* of a decision tree. 
>>>>>>>> This
>>>>>>>> can blow up as the level of the decision tree gets deep, but I'd 
>>>>>>>> recommend
>>>>>>>> a lot more memory than 2-4GB per worker for most big data workloads.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
>>>>>>>> coded...@cs.stanford.edu> wrote:
>>>>>>>>
>>>>>>>>> Debasish, we've tested the MLLib decision tree a bit and it eats
>>>>>>>>> up too much memory for RF purposes.
>>>>>>>>> Once the tree got to depth 8~9, it was easy to get heap exception,
>>>>>>>>> even with 2~4 GB of memory per worker.
>>>>>>>>>
>>>>>>>>> With RF, it's very easy to get 100+ depth in RF with even only
>>>>>>>>> 100,000+ rows (because trees usually are not balanced). Additionally, 
>>>>>>>>> the
>>>>>>>>> lack of multi-class classification limits its applicability.
>>>>>>>>>
>>>>>>>>> Also, RF requires random features per tree node to be effective
>>>>>>>>> (not just bootstrap samples), and MLLib decision tree doesn't support 
>>>>>>>>> that.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <
>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Mllib has decision tree....there is a rf pr which is not active
>>>>>>>>>> now....take that and swap the tree builder with the fast tree builder
>>>>>>>>>> that's in mllib...search for the spark jira...the code is based on 
>>>>>>>>>> google
>>>>>>>>>> planet paper. ..
>>>>>>>>>>
>>>>>>>>>> I am sure people in devlist are already working on it...send an
>>>>>>>>>> email to know the status over there...
>>>>>>>>>>
>>>>>>>>>> There is also a rf in cloudera oryx but we could not run it on
>>>>>>>>>> our data yet....
>>>>>>>>>>
>>>>>>>>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc
>>>>>>>>>> runs but it does not scale...
>>>>>>>>>>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> For one of my application, I want to use Random forests(RF) on
>>>>>>>>>>> top of spark. I see that currenlty MLLib does not have 
>>>>>>>>>>> implementation for
>>>>>>>>>>> RF. What other opensource RF implementations will be great to use 
>>>>>>>>>>> with
>>>>>>>>>>> spark in terms of speed?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Laeeq Ahmed,
>>>>>>>>>>> KTH, Sweden.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Random Forest on Spark

Reply via email to