Re: StandardScaler failing with OOM errors in PySpark

Rok Roskar Tue, 28 Apr 2015 04:07:25 -0700

That's exactly what I'm saying -- I specify the memory options using spark
options, but this is not reflected in how the JVM is created. No matter
which memory settings I specify, the JVM for the driver is always made with
512Mb of memory. So I'm not sure if this is a feature or a bug?


rok

On Mon, Apr 27, 2015 at 6:54 PM, Xiangrui Meng <men...@gmail.com> wrote:

> You might need to specify driver memory in spark-submit instead of
> passing JVM options. spark-submit is designed to handle different
> deployments correctly. -Xiangrui
>
> On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar <rokros...@gmail.com> wrote:
> > ok yes, I think I have narrowed it down to being a problem with driver
> > memory settings. It looks like the application master/driver is not being
> > launched with the settings specified:
> >
> > For the driver process on the main node I see "-XX:MaxPermSize=128m
> -Xms512m
> > -Xmx512m" as options used to start the JVM, even though I specified
> >
> > 'spark.yarn.am.memory', '5g'
> > 'spark.yarn.am.memoryOverhead', '2000'
> >
> > The info shows that these options were read:
> >
> > 15/04/23 13:47:47 INFO yarn.Client: Will allocate AM container, with
> 7120 MB
> > memory including 2000 MB overhead
> >
> > Is there some reason why these options are being ignored and instead
> > starting the driver with just 512Mb of heap?
> >
> > On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar <rokros...@gmail.com> wrote:
> >>
> >> the feature dimension is 800k.
> >>
> >> yes, I believe the driver memory is likely the problem since it doesn't
> >> crash until the very last part of the tree aggregation.
> >>
> >> I'm running it via pyspark through YARN -- I have to run in client mode
> so
> >> I can't set spark.driver.memory -- I've tried setting the
> >> spark.yarn.am.memory and overhead parameters but it doesn't seem to
> have an
> >> effect.
> >>
> >> Thanks,
> >>
> >> Rok
> >>
> >> On Apr 23, 2015, at 7:47 AM, Xiangrui Meng <men...@gmail.com> wrote:
> >>
> >> > What is the feature dimension? Did you set the driver memory?
> -Xiangrui
> >> >
> >> > On Tue, Apr 21, 2015 at 6:59 AM, rok <rokros...@gmail.com> wrote:
> >> >> I'm trying to use the StandardScaler in pyspark on a relatively small
> >> >> (a few
> >> >> hundred Mb) dataset of sparse vectors with 800k features. The fit
> >> >> method of
> >> >> StandardScaler crashes with Java heap space or Direct buffer memory
> >> >> errors.
> >> >> There should be plenty of memory around -- 10 executors with 2 cores
> >> >> each
> >> >> and 8 Gb per core. I'm giving the executors 9g of memory and have
> also
> >> >> tried
> >> >> lots of overhead (3g), thinking it might be the array creation in the
> >> >> aggregators that's causing issues.
> >> >>
> >> >> The bizarre thing is that this isn't always reproducible -- sometimes
> >> >> it
> >> >> actually works without problems. Should I be setting up executors
> >> >> differently?
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Rok
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
> >> >> Sent from the Apache Spark User List mailing list archive at
> >> >> Nabble.com.
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> >> For additional commands, e-mail: user-h...@spark.apache.org
> >> >>
> >>
> >
>

Re: StandardScaler failing with OOM errors in PySpark

Reply via email to