Debasish, Unfortunately, we are bound to YARN, at least for the time being, because that's what most of our customers would be using (unless, all the Hadoop vendors start supporting standalone Spark - I think Cloudera might do that?).
On Fri, Apr 18, 2014 at 11:12 AM, Debasish Das <debasish.da...@gmail.com>wrote: > Spark on YARN is a big pain due to the strict memory requirement per > container... > > If you are stress testing it, could you use a standalone cluster and see > at which feature upper bound does per worker RAM requirement reaches 16 GB > or more...it is possible to get 16 GB instances on EC2 these days without > much trouble.,.. > > Another way is to run a feature selection algorithm to decrease features > space before running decision tree or algorithm variants...There is a PR on > entropy based feature selection algorithms...you don't want to use them to > decrease features right ? > > A feature extraction algorithm like matrix factorization and it's variants > could be used to decrease feature space as well... > > > > On Fri, Apr 18, 2014 at 10:53 AM, Sung Hwan Chung < > coded...@cs.stanford.edu> wrote: > >> Thanks for the info on mem requirement. >> >> I think that a lot of businesses would probably prefer to use Spark on >> top of YARN, since that's what they invest on - a large Hadoop cluster. And >> the default setting for YARN seems to cap memory per container to 8 GB - so >> ideally, we would like to use a lot less than that (rather than telling >> them, nooo change your YARN settings). >> >> A convenient feature would be to automatically figure things out, and try >> to adapt the algorithm to memory limits (e.g., process X # of nodes at a >> time, instead of all the nodes at the same level). Additionally, we noticed >> that the default 'Double' usage for LabelPoint is very wasteful for a >> majority of data sets. Float would do most of times and in fact, a lot of >> datasets could get away with using Short or even Byte. Or in your case, >> since you're transforming data to Bins anyways, you could probably cache >> BIN IDs (for which you could use Short or Byte even)? >> >> >> >> On Fri, Apr 18, 2014 at 8:43 AM, Evan R. Sparks <evan.spa...@gmail.com>wrote: >> >>> Interesting, and thanks for the thoughts. >>> >>> I think we're on the same page with 100s of millions of records. We've >>> tested the tree implementation in mllib on 1b rows and up to 100 features - >>> though this isn't hitting the 1000s of features you mention. >>> >>> Obviously multi class support isn't there yet, but I can see your point >>> about deeper trees for many class problems. Will try them out on some image >>> processing stuff with 1k classes we're doing in the lab once they are more >>> developed to get a sense for where the issues are. >>> >>> If you're only allocating 2GB/worker you're going to have a hard time >>> getting the real advantages of Spark. >>> >>> For your 1k features causing heap exceptions at depth 5 - are these >>> categorical or continuous? The categorical vars create much smaller >>> histograms. >>> >>> If you're fitting all continuous features, the memory requirements are >>> O(b*d*2^l) where b=number of histogram bins, d=number of features, and l = >>> level of the tree. Even accounting for object overhead, with the default >>> number of bins, the histograms at this depth should be order of 10s of MB, >>> not 2GB - so I'm guessing your cached data is occupying a significant chunk >>> of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m >>> data points with 20 continuous features and was able to run without running >>> into memory issues (and scaling properties got better as the depth grew). >>> His worker mem was 7.5GB and 30% of that was reserved for caching. If you >>> wanted to go 1000 features at depth 10 I'd estimate a couple of gigs >>> necessary for heap space for the worker to compute/store the histograms, >>> and I guess 2x that on the master to do the reduce. >>> >>> Again 2GB per worker is pretty tight, because there are overheads of >>> just starting the jvm, launching a worker, loading libraries, etc. >>> >>> - Evan >>> >>> On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung <coded...@cs.stanford.edu> >>> wrote: >>> >>> Yes, it should be data specific and perhaps we're biased toward the data >>> sets that we are playing with. To put things in perspective, we're highly >>> interested in (and I believe, our customers are): >>> >>> 1. large (hundreds of millions of rows) >>> 2. multi-class classification - nowadays, dozens of target categories >>> are common and even thousands in some cases - you could imagine that this >>> is a big reason for us requiring more 'complex' models >>> 3. high dimensional with thousands of descriptive and >>> sort-of-independent features >>> >>> From the theoretical perspective, I would argue that it's usually in the >>> best interest to prune as little as possible. I believe that pruning >>> inherently increases bias of an individual tree, which RF can't do anything >>> about while decreasing variance - which is what RF is for. >>> >>> The default pruning criteria for R's reference implementation is >>> min-node of 1 (meaning fully-grown tree) for classification, and 5 for >>> regression. I'd imagine they did at least some empirical testing to justify >>> these values at the time - although at a time of small datasets :). >>> >>> FYI, we are also considering the MLLib decision tree for our Gradient >>> Boosting implementation, however, the memory requirement is still a bit too >>> steep (we were getting heap exceptions at depth limit of 5 with 2GB per >>> worker with approximately 1000 features). Now 2GB per worker is about what >>> we expect our typical customers would tolerate and I don't think that it's >>> unreasonable for shallow trees. >>> >>> >>> >>> On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks >>> <evan.spa...@gmail.com>wrote: >>> >>>> What kind of data are you training on? These effects are *highly* data >>>> dependent, and while saying "the depth of 10 is simply not adequate to >>>> build high-accuracy models" may be accurate for the particular problem >>>> you're modeling, it is not true in general. From a statistical perspective, >>>> I consider each node in each tree an additional degree of freedom for the >>>> model, and all else equal I'd expect a model with fewer degrees of freedom >>>> to generalize better. Regardless, if there are lots of use cases for really >>>> deep trees, we'd like to hear about them so that we can decide how >>>> important they are to support! >>>> >>>> In the context of CART - pruning very specifically refers to a step >>>> *after* a tree has been constructed to some depth using cross-validation. >>>> This was a variance reduction technique in the original tree work that is >>>> unnecessary and computationally expensive in the context of forests. In the >>>> original Random Forests paper, there are still stopping criteria - usually >>>> either minimum leaf size or minimum split improvement (or both), so >>>> "training to maximum depth" doesn't mean "train until you've completely >>>> divided your dataset and there's one point per leaf." My point is that if >>>> you set minimum leaf size to something like 0.2% of the dataset, then >>>> you're not going to get deeper than 10 or 12 levels with a reasonably >>>> balanced tree. >>>> >>>> With respect to PLANET - our implementation is very much in the spirit >>>> of planet, but has some key differences - there's good documentation on >>>> exactly what the differences are forthcoming, so I won't belabor these >>>> here. The differences are designed to 1) avoid data shuffling, and 2) >>>> minimize number of passes over the training data. Of course, there are >>>> tradeoffs involved, and there is at least one really good trick in the >>>> PLANET work that we should leverage that we aren't yet - namely once the >>>> nodes get small enough for data to fit easily on a single machine, data can >>>> be shuffled and then the remainder of the tree can be trained in parallel >>>> from each lower node on a single machine This would actually help with the >>>> memory overheads in model training when trees get deep - if someone wants >>>> to modify the current implementation of trees in MLlib and contribute this >>>> optimization as a pull request, it would be welcome! >>>> >>>> At any rate, we'll take this feedback into account with respect to >>>> improving the tree implementation, but if anyone can send over use cases or >>>> (even better) datasets where really deep trees are necessary, that would be >>>> great! >>>> >>>> >>>> >>>> >>>> On Thu, Apr 17, 2014 at 1:43 PM, Sung Hwan Chung < >>>> coded...@cs.stanford.edu> wrote: >>>> >>>>> Well, if you read the original paper, >>>>> http://oz.berkeley.edu/~breiman/randomforest2001.pdf >>>>> "Grow the tree using CART methodology to maximum size and do not >>>>> prune." >>>>> >>>>> Now, the elements of statistical learning book on page 598 says that >>>>> you could potentially overfit fully-grown regression random forest. >>>>> However, this effect is very slight, and likely negligible for >>>>> classifications. >>>>> http://www.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf >>>>> >>>>> In our experiments however, if the pruning is "drastic", then the >>>>> performance actually becomes much worse. This makes intuitive sense IMO >>>>> because a decision tree is a non-parametric model, and the expressibility >>>>> of a tree depends on the number of nodes. >>>>> >>>>> With a huge amount of data (millions or even billions of rows), we >>>>> found that the depth of 10 is simply not adequate to build high-accuracy >>>>> models. >>>>> >>>>> >>>>> On Thu, Apr 17, 2014 at 12:30 PM, Evan R. Sparks < >>>>> evan.spa...@gmail.com> wrote: >>>>> >>>>>> Hmm... can you provide some pointers to examples where deep trees are >>>>>> helpful? >>>>>> >>>>>> Typically with Decision Trees you limit depth (either directly or >>>>>> indirectly with minimum node size and minimum improvement criteria) to >>>>>> avoid overfitting. I agree with the assessment that forests are a >>>>>> variance >>>>>> reduction technique, but I'd be a little surprised if a bunch of hugely >>>>>> deep trees don't overfit to training data. I guess I view limiting tree >>>>>> depth as an analogue to regularization in linear models. >>>>>> >>>>>> >>>>>> On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung < >>>>>> coded...@cs.stanford.edu> wrote: >>>>>> >>>>>>> Evan, >>>>>>> >>>>>>> I actually haven't heard of 'shallow' random forest. I think that >>>>>>> the only scenarios where shallow trees are useful are boosting >>>>>>> scenarios. >>>>>>> >>>>>>> AFAIK, Random Forest is a variance reducing technique and doesn't do >>>>>>> much about bias (although some people claim that it does have some bias >>>>>>> reducing effect). Because shallow trees typically have higher bias than >>>>>>> fully-grown trees, people don't often use shallow trees with RF. >>>>>>> >>>>>>> You can confirm this through some experiments with R's random forest >>>>>>> implementation as well. They allow you to set some limits of depth >>>>>>> and/or >>>>>>> pruning. >>>>>>> >>>>>>> In contrast, boosting is a bias reduction technique (and increases >>>>>>> variance), so people typically use shallow trees. >>>>>>> >>>>>>> Our empirical experiments also confirmed that shallow trees resulted >>>>>>> in drastically lower accuracy for random forests. >>>>>>> >>>>>>> There are some papers that mix boosting-like technique with >>>>>>> bootstrap averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where >>>>>>> you could potentially use shallow trees to build boosted learners, but >>>>>>> then >>>>>>> average the results of many boosted learners. >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks < >>>>>>> evan.spa...@gmail.com> wrote: >>>>>>> >>>>>>>> Multiclass classification, Gradient Boosting, and Random Forest >>>>>>>> support for based on the recent Decision Tree implementation in MLlib. >>>>>>>> >>>>>>>> Sung - I'd be curious to hear about your use of decision trees (and >>>>>>>> forests) where you want to go to 100+ depth. My experience with random >>>>>>>> forests has been that people typically build hundreds of shallow trees >>>>>>>> (maybe depth 7 or 8), rather than a few (or many) really deep trees. >>>>>>>> >>>>>>>> Generally speaking, we save passes over the data by computing >>>>>>>> histograms per variable per split at each *level* of a decision tree. >>>>>>>> This >>>>>>>> can blow up as the level of the decision tree gets deep, but I'd >>>>>>>> recommend >>>>>>>> a lot more memory than 2-4GB per worker for most big data workloads. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung < >>>>>>>> coded...@cs.stanford.edu> wrote: >>>>>>>> >>>>>>>>> Debasish, we've tested the MLLib decision tree a bit and it eats >>>>>>>>> up too much memory for RF purposes. >>>>>>>>> Once the tree got to depth 8~9, it was easy to get heap exception, >>>>>>>>> even with 2~4 GB of memory per worker. >>>>>>>>> >>>>>>>>> With RF, it's very easy to get 100+ depth in RF with even only >>>>>>>>> 100,000+ rows (because trees usually are not balanced). Additionally, >>>>>>>>> the >>>>>>>>> lack of multi-class classification limits its applicability. >>>>>>>>> >>>>>>>>> Also, RF requires random features per tree node to be effective >>>>>>>>> (not just bootstrap samples), and MLLib decision tree doesn't support >>>>>>>>> that. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Apr 17, 2014 at 10:27 AM, Debasish Das < >>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Mllib has decision tree....there is a rf pr which is not active >>>>>>>>>> now....take that and swap the tree builder with the fast tree builder >>>>>>>>>> that's in mllib...search for the spark jira...the code is based on >>>>>>>>>> google >>>>>>>>>> planet paper. .. >>>>>>>>>> >>>>>>>>>> I am sure people in devlist are already working on it...send an >>>>>>>>>> email to know the status over there... >>>>>>>>>> >>>>>>>>>> There is also a rf in cloudera oryx but we could not run it on >>>>>>>>>> our data yet.... >>>>>>>>>> >>>>>>>>>> Weka 3.7.10 has a multi thread rf that is good to do some adhoc >>>>>>>>>> runs but it does not scale... >>>>>>>>>> On Apr 17, 2014 2:45 AM, "Laeeq Ahmed" <laeeqsp...@yahoo.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> For one of my application, I want to use Random forests(RF) on >>>>>>>>>>> top of spark. I see that currenlty MLLib does not have >>>>>>>>>>> implementation for >>>>>>>>>>> RF. What other opensource RF implementations will be great to use >>>>>>>>>>> with >>>>>>>>>>> spark in terms of speed? >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Laeeq Ahmed, >>>>>>>>>>> KTH, Sweden. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >