Re: Random Forest on Spark

Sung Hwan Chung Fri, 18 Apr 2014 15:21:28 -0700

Sorry, that was incomplete information, I think Spark's compression helped
(not sure how much though) that the actual memory requirement may have been
smaller.



On Fri, Apr 18, 2014 at 3:16 PM, Sung Hwan Chung
<coded...@cs.stanford.edu>wrote:

> I would argue that memory in clusters is still a limited resource and it's
> still beneficial to use memory as economically as possible. Let's say that
> you are training a gradient boosted model in Spark, which could conceivably
> take several hours to build hundreds to thousands of trees. You do not want
> to be occupying a significant portion of the cluster memory such that
> nobody else can run anything of significance.
>
> We have a dataset that's only ~10GB CSV in the file system, now once we
> cached the whole thing in Spark, it ballooned to 64 GB or so in memory and
> so we had to use a lot more workers with memory just so that we could cache
> the whole thing - this was due to the fact that although all the features
> were byte-sized, MLLib defaults to Double.
>
>
> On Fri, Apr 18, 2014 at 1:39 PM, Sandy Ryza <sandy.r...@cloudera.com>wrote:
>
>> I don't think the YARN default of max 8GB container size is a good
>> justification for limiting memory per worker.  This is a sort of arbitrary
>> number that came from an era where MapReduce was the main YARN application
>> and machines generally had less memory.  I expect to see this to get
>> configured as much higher in practice on most clusters running Spark.
>>
>> YARN integration is actually complete in CDH5.0.  We support it as well
>> as standalone mode.
>>
>>
>>
>>
>> On Fri, Apr 18, 2014 at 11:49 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> On Fri, Apr 18, 2014 at 7:31 PM, Sung Hwan Chung
>>> <coded...@cs.stanford.edu> wrote:
>>> > Debasish,
>>> >
>>> > Unfortunately, we are bound to YARN, at least for the time being,
>>> because
>>> > that's what most of our customers would be using (unless, all the
>>> Hadoop
>>> > vendors start supporting standalone Spark - I think Cloudera might do
>>> > that?).
>>>
>>> Yes the CDH5.0.0 distro just runs Spark in stand-alone mode. Using the
>>> YARN integration is still being worked on.
>>>
>>
>>
>

Re: Random Forest on Spark

Reply via email to