That's exactly what I'm saying -- I specify the memory options using spark options, but this is not reflected in how the JVM is created. No matter which memory settings I specify, the JVM for the driver is always made with 512Mb of memory. So I'm not sure if this is a feature or a bug?
rok On Mon, Apr 27, 2015 at 6:54 PM, Xiangrui Meng <men...@gmail.com> wrote: > You might need to specify driver memory in spark-submit instead of > passing JVM options. spark-submit is designed to handle different > deployments correctly. -Xiangrui > > On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar <rokros...@gmail.com> wrote: > > ok yes, I think I have narrowed it down to being a problem with driver > > memory settings. It looks like the application master/driver is not being > > launched with the settings specified: > > > > For the driver process on the main node I see "-XX:MaxPermSize=128m > -Xms512m > > -Xmx512m" as options used to start the JVM, even though I specified > > > > 'spark.yarn.am.memory', '5g' > > 'spark.yarn.am.memoryOverhead', '2000' > > > > The info shows that these options were read: > > > > 15/04/23 13:47:47 INFO yarn.Client: Will allocate AM container, with > 7120 MB > > memory including 2000 MB overhead > > > > Is there some reason why these options are being ignored and instead > > starting the driver with just 512Mb of heap? > > > > On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar <rokros...@gmail.com> wrote: > >> > >> the feature dimension is 800k. > >> > >> yes, I believe the driver memory is likely the problem since it doesn't > >> crash until the very last part of the tree aggregation. > >> > >> I'm running it via pyspark through YARN -- I have to run in client mode > so > >> I can't set spark.driver.memory -- I've tried setting the > >> spark.yarn.am.memory and overhead parameters but it doesn't seem to > have an > >> effect. > >> > >> Thanks, > >> > >> Rok > >> > >> On Apr 23, 2015, at 7:47 AM, Xiangrui Meng <men...@gmail.com> wrote: > >> > >> > What is the feature dimension? Did you set the driver memory? > -Xiangrui > >> > > >> > On Tue, Apr 21, 2015 at 6:59 AM, rok <rokros...@gmail.com> wrote: > >> >> I'm trying to use the StandardScaler in pyspark on a relatively small > >> >> (a few > >> >> hundred Mb) dataset of sparse vectors with 800k features. The fit > >> >> method of > >> >> StandardScaler crashes with Java heap space or Direct buffer memory > >> >> errors. > >> >> There should be plenty of memory around -- 10 executors with 2 cores > >> >> each > >> >> and 8 Gb per core. I'm giving the executors 9g of memory and have > also > >> >> tried > >> >> lots of overhead (3g), thinking it might be the array creation in the > >> >> aggregators that's causing issues. > >> >> > >> >> The bizarre thing is that this isn't always reproducible -- sometimes > >> >> it > >> >> actually works without problems. Should I be setting up executors > >> >> differently? > >> >> > >> >> Thanks, > >> >> > >> >> Rok > >> >> > >> >> > >> >> > >> >> > >> >> -- > >> >> View this message in context: > >> >> > http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html > >> >> Sent from the Apache Spark User List mailing list archive at > >> >> Nabble.com. > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> >> For additional commands, e-mail: user-h...@spark.apache.org > >> >> > >> > > >