Debasish,

Unfortunately, we are bound to YARN, at least for the time being, because
that's what most of our customers would be using (unless, all the Hadoop
vendors start supporting standalone Spark - I think Cloudera might do
that?).




On Fri, Apr 18, 2014 at 11:12 AM, Debasish Das <debasish.da...@gmail.com>wrote:

> Spark on YARN is a big pain due to the strict memory requirement per
> container...
>
> If you are stress testing it, could you use a standalone cluster and see
> at which feature upper bound does per worker RAM requirement reaches 16 GB
> or more...it is possible to get 16 GB instances on EC2 these days without
> much trouble.,..
>
> Another way is to run a feature selection algorithm to decrease features
> space before running decision tree or algorithm variants...There is a PR on
> entropy based feature selection algorithms...you don't want to use them to
> decrease features right ?
>
> A feature extraction algorithm like matrix factorization and it's variants
> could be used to decrease feature space as well...
>
>
>
> On Fri, Apr 18, 2014 at 10:53 AM, Sung Hwan Chung <
> coded...@cs.stanford.edu> wrote:
>
>> Thanks for the info on mem requirement.
>>
>> I think that a lot of businesses would probably prefer to use Spark on
>> top of YARN, since that's what they invest on - a large Hadoop cluster. And
>> the default setting for YARN seems to cap memory per container to 8 GB - so
>> ideally, we would like to use a lot less than that (rather than telling
>> them, nooo change your YARN settings).
>>
>> A convenient feature would be to automatically figure things out, and try
>> to adapt the algorithm to memory limits (e.g., process X # of nodes at a
>> time, instead of all the nodes at the same level). Additionally, we noticed
>> that the default 'Double' usage for LabelPoint is very wasteful for a
>> majority of data sets. Float would do most of times and in fact, a lot of
>> datasets could get away with using Short or even Byte. Or in your case,
>> since you're transforming data to Bins anyways, you could probably cache
>> BIN IDs (for which you could use Short or Byte even)?
>>
>>
>>
>> On Fri, Apr 18, 2014 at 8:43 AM, Evan R. Sparks <evan.spa...@gmail.com>wrote:
>>
>>> Interesting, and thanks for the thoughts.
>>>
>>> I think we're on the same page with 100s of millions of records. We've
>>> tested the tree implementation in mllib on 1b rows and up to 100 features -
>>> though this isn't hitting the 1000s of features you mention.
>>>
>>> Obviously multi class support isn't there yet, but I can see your point
>>> about deeper trees for many class problems. Will try them out on some image
>>> processing stuff with 1k classes we're doing in the lab once they are more
>>> developed to get a sense for where the issues are.
>>>
>>> If you're only allocating 2GB/worker you're going to have a hard time
>>> getting the real advantages of Spark.
>>>
>>> For your 1k features causing heap exceptions at depth 5  - are these
>>> categorical or continuous? The categorical vars create much smaller
>>> histograms.
>>>
>>> If you're fitting all continuous features, the memory requirements are
>>> O(b*d*2^l) where b=number of histogram bins, d=number of features, and l =
>>> level of the tree. Even accounting for object overhead, with the default
>>> number of bins, the histograms at this depth should be order of 10s of MB,
>>> not 2GB - so I'm guessing your cached data is occupying a significant chunk
>>> of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m
>>> data points with 20 continuous features and was able to run without running
>>> into memory issues (and scaling properties got better as the depth grew).
>>> His worker mem was 7.5GB and 30% of that was reserved for caching. If you
>>> wanted to go 1000 features at depth 10 I'd estimate a couple of gigs
>>> necessary for heap space for the worker to compute/store the histograms,
>>> and I guess 2x that on the master to do the reduce.
>>>
>>> Again 2GB per worker is pretty tight, because there are overheads of
>>> just starting the jvm, launching a worker, loading libraries, etc.
>>>
>>> - Evan
>>>
>>> On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung <coded...@cs.stanford.edu>
>>> wrote:
>>>
>>> Yes, it should be data specific and perhaps we're biased toward the data
>>> sets that we are playing with. To put things in perspective, we're highly
>>> interested in (and I believe, our customers are):
>>>
>>> 1. large (hundreds of millions of rows)
>>> 2. multi-class classification - nowadays, dozens of target categories
>>> are common and even thousands in some cases - you could imagine that this
>>> is a big reason for us requiring more 'complex' models
>>> 3. high dimensional with thousands of descriptive and
>>> sort-of-independent features
>>>
>>> From the theoretical perspective, I would argue that it's usually in the
>>> best interest to prune as little as possible. I believe that pruning
>>> inherently increases bias of an individual tree, which RF can't do anything
>>> about while decreasing variance - which is what RF is for.
>>>
>>> The default pruning criteria for R's reference implementation is
>>> min-node of 1 (meaning fully-grown tree) for classification, and 5 for
>>> regression. I'd imagine they did at least some empirical testing to justify
>>> these values at the time - although at a time of small datasets :).
>>>
>>> FYI, we are also considering the MLLib decision tree for our Gradient
>>> Boosting implementation, however, the memory requirement is still a bit too
>>> steep (we were getting heap exceptions at depth limit of 5 with 2GB per
>>> worker with approximately 1000 features). Now 2GB per worker is about what
>>> we expect our typical customers would tolerate and I don't think that it's
>>> unreasonable for shallow trees.
>>>
>>>
>>>
>>> On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks 
>>> <evan.spa...@gmail.com>wrote:
>>>
>>>> What kind of data are you training on? These effects are *highly* data
>>>> dependent, and while saying "the depth of 10 is simply not adequate to
>>>> build high-accuracy models" may be accurate for the particular problem
>>>> you're modeling, it is not true in general. From a statistical perspective,
>>>> I consider each node in each tree an additional degree of freedom for the
>>>> model, and all else equal I'd expect a model with fewer degrees of freedom
>>>> to generalize better. Regardless, if there are lots of use cases for really
>>>> deep trees, we'd like to hear about them so that we can decide how
>>>> important they are to support!
>>>>
>>>> In the context of CART - pruning very specifically refers to a step
>>>> *after* a tree has been constructed to some depth using cross-validation.
>>>> This was a variance reduction technique in the original tree work that is
>>>> unnecessary and computationally expensive in the context of forests. In the
>>>> original Random Forests paper, there are still stopping criteria - usually
>>>> either minimum leaf size or minimum split improvement (or both), so
>>>> "training to maximum depth" doesn't mean "train until you've completely
>>>> divided your dataset and there's one point per leaf." My point is that if
>>>> you set minimum leaf size to something like 0.2% of the dataset, then
>>>> you're not going to get deeper than 10 or 12 levels with a reasonably
>>>> balanced tree.
>>>>
>>>> With respect to PLANET - our implementation is very much in the spirit
>>>> of planet, but has some key differences - there's good documentation on
>>>> exactly what the differences are forthcoming, so I won't belabor these
>>>> here. The differences are designed to 1) avoid data shuffling, and 2)
>>>> minimize number of passes over the training data. Of course, there are
>>>> tradeoffs involved, and there is at least one really good trick in the
>>>> PLANET work that we should leverage that we aren't yet - namely once the
>>>> nodes get small enough for data to fit easily on a single machine, data can
>>>> be shuffled and then the remainder of the tree can be trained in parallel
>>>> from each lower node on a single machine This would actually help with the
>>>> memory overheads in model training when trees get deep  - if someone wants
>>>> to modify the current implementation of trees in MLlib and contribute this
>>>> optimization as a pull request, it would be welcome!
>>>>
>>>> At any rate, we'll take this feedback into account with respect to
>>>> improving the tree implementation, but if anyone can send over use cases or
>>>> (even better) datasets where really deep trees are necessary, that would be
>>>> great!
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung <
>>>> coded...@cs.stanford.edu> wrote:
>>>>
>>>>> Well, if you read the original paper,
>>>>> http://oz.berkeley.edu/~breiman/randomforest2001.pdf
>>>>> "Grow the tree using CART methodology to maximum size and do not
>>>>> prune."
>>>>>
>>>>> Now, the elements of statistical learning book on page 598 says that
>>>>> you could potentially overfit fully-grown regression random forest.
>>>>> However, this effect is very slight, and likely negligible for
>>>>> classifications.
>>>>> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>>>>
>>>>> In our experiments however, if the pruning is "drastic", then the
>>>>> performance actually becomes much worse. This makes intuitive sense IMO
>>>>> because a decision tree is a non-parametric model, and the expressibility
>>>>> of a tree depends on the number of nodes.
>>>>>
>>>>> With a huge amount of data (millions or even billions of rows), we
>>>>> found that the depth of 10 is simply not adequate to build high-accuracy
>>>>> models.
>>>>>
>>>>>
>>>>> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks <
>>>>> evan.spa...@gmail.com> wrote:
>>>>>
>>>>>> Hmm... can you provide some pointers to examples where deep trees are
>>>>>> helpful?
>>>>>>
>>>>>> Typically with Decision Trees you limit depth (either directly or
>>>>>> indirectly with minimum node size and minimum improvement criteria) to
>>>>>> avoid overfitting. I agree with the assessment that forests are a 
>>>>>> variance
>>>>>> reduction technique, but I'd be a little surprised if a bunch of hugely
>>>>>> deep trees don't overfit to training data. I guess I view limiting tree
>>>>>> depth as an analogue to regularization in linear models.
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung <
>>>>>> coded...@cs.stanford.edu> wrote:
>>>>>>
>>>>>>> Evan,
>>>>>>>
>>>>>>> I actually haven't heard of 'shallow' random forest. I think that
>>>>>>> the only scenarios where shallow trees are useful are boosting 
>>>>>>> scenarios.
>>>>>>>
>>>>>>> AFAIK, Random Forest is a variance reducing technique and doesn't do
>>>>>>> much about bias (although some people claim that it does have some bias
>>>>>>> reducing effect). Because shallow trees typically have higher bias than
>>>>>>> fully-grown trees, people don't often use shallow trees with RF.
>>>>>>>
>>>>>>> You can confirm this through some experiments with R's random forest
>>>>>>> implementation as well. They allow you to set some limits of depth 
>>>>>>> and/or
>>>>>>> pruning.
>>>>>>>
>>>>>>> In contrast, boosting is a bias reduction technique (and increases
>>>>>>> variance), so people typically use shallow trees.
>>>>>>>
>>>>>>> Our empirical experiments also confirmed that shallow trees resulted
>>>>>>> in drastically lower accuracy for random forests.
>>>>>>>
>>>>>>> There are some papers that mix boosting-like technique with
>>>>>>> bootstrap averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where
>>>>>>> you could potentially use shallow trees to build boosted learners, but 
>>>>>>> then
>>>>>>> average the results of many boosted learners.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks <
>>>>>>> evan.spa...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Multiclass classification, Gradient Boosting, and Random Forest
>>>>>>>> support for based on the recent Decision Tree implementation in MLlib.
>>>>>>>>
>>>>>>>> Sung - I'd be curious to hear about your use of decision trees (and
>>>>>>>> forests) where you want to go to 100+ depth. My experience with random
>>>>>>>> forests has been that people typically build hundreds of shallow trees
>>>>>>>> (maybe depth 7 or 8), rather than a few (or many) really deep trees.
>>>>>>>>
>>>>>>>> Generally speaking, we save passes over the data by computing
>>>>>>>> histograms per variable per split at each *level* of a decision tree. 
>>>>>>>> This
>>>>>>>> can blow up as the level of the decision tree gets deep, but I'd 
>>>>>>>> recommend
>>>>>>>> a lot more memory than 2-4GB per worker for most big data workloads.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung <
>>>>>>>> coded...@cs.stanford.edu> wrote:
>>>>>>>>
>>>>>>>>> Debasish, we've tested the MLLib decision tree a bit and it eats
>>>>>>>>> up too much memory for RF purposes.
>>>>>>>>> Once the tree got to depth 8~9, it was easy to get heap exception,
>>>>>>>>> even with 2~4 GB of memory per worker.
>>>>>>>>>
>>>>>>>>> With RF, it's very easy to get 100+ depth in RF with even only
>>>>>>>>> 100,000+ rows (because trees usually are not balanced). Additionally, 
>>>>>>>>> the
>>>>>>>>> lack of multi-class classification limits its applicability.
>>>>>>>>>
>>>>>>>>> Also, RF requires random features per tree node to be effective
>>>>>>>>> (not just bootstrap samples), and MLLib decision tree doesn't support 
>>>>>>>>> that.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das <
>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Mllib has decision tree....there is a rf pr which is not active
>>>>>>>>>> now....take that and swap the tree builder with the fast tree builder
>>>>>>>>>> that's in mllib...search for the spark jira...the code is based on 
>>>>>>>>>> google
>>>>>>>>>> planet paper. ..
>>>>>>>>>>
>>>>>>>>>> I am sure people in devlist are already working on it...send an
>>>>>>>>>> email to know the status over there...
>>>>>>>>>>
>>>>>>>>>> There is also a rf in cloudera oryx but we could not run it on
>>>>>>>>>> our data yet....
>>>>>>>>>>
>>>>>>>>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc
>>>>>>>>>> runs but it does not scale...
>>>>>>>>>>  On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> For one of my application, I want to use Random forests(RF) on
>>>>>>>>>>> top of spark. I see that currenlty MLLib does not have 
>>>>>>>>>>> implementation for
>>>>>>>>>>> RF. What other opensource RF implementations will be great to use 
>>>>>>>>>>> with
>>>>>>>>>>> spark in terms of speed?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Laeeq Ahmed,
>>>>>>>>>>> KTH, Sweden.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to