Sorry, that was incomplete information, I think Spark's compression helped (not sure how much though) that the actual memory requirement may have been smaller.
On Fri, Apr 18, 2014 at 3:16 PM, Sung Hwan Chung <coded...@cs.stanford.edu>wrote: > I would argue that memory in clusters is still a limited resource and it's > still beneficial to use memory as economically as possible. Let's say that > you are training a gradient boosted model in Spark, which could conceivably > take several hours to build hundreds to thousands of trees. You do not want > to be occupying a significant portion of the cluster memory such that > nobody else can run anything of significance. > > We have a dataset that's only ~10GB CSV in the file system, now once we > cached the whole thing in Spark, it ballooned to 64 GB or so in memory and > so we had to use a lot more workers with memory just so that we could cache > the whole thing - this was due to the fact that although all the features > were byte-sized, MLLib defaults to Double. > > > On Fri, Apr 18, 2014 at 1:39 PM, Sandy Ryza <sandy.r...@cloudera.com>wrote: > >> I don't think the YARN default of max 8GB container size is a good >> justification for limiting memory per worker. This is a sort of arbitrary >> number that came from an era where MapReduce was the main YARN application >> and machines generally had less memory. I expect to see this to get >> configured as much higher in practice on most clusters running Spark. >> >> YARN integration is actually complete in CDH5.0. We support it as well >> as standalone mode. >> >> >> >> >> On Fri, Apr 18, 2014 at 11:49 AM, Sean Owen <so...@cloudera.com> wrote: >> >>> On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung >>> <coded...@cs.stanford.edu> wrote: >>> > Debasish, >>> > >>> > Unfortunately, we are bound to YARN, at least for the time being, >>> because >>> > that's what most of our customers would be using (unless, all the >>> Hadoop >>> > vendors start supporting standalone Spark - I think Cloudera might do >>> > that?). >>> >>> Yes the CDH5.0.0 distro just runs Spark in stand-alone mode. Using the >>> YARN integration is still being worked on. >>> >> >> >