Re: StandardScaler failing with OOM errors in PySpark
AFAIK, there are two places where you can specify the driver memory. One is via spark-summit --driver-memory and the other is via spark.driver.memory in spark-defaults.conf. Please try these approaches and see whether they work or not. You can find detailed instructions at http://spark.apache.org/docs/latest/configuration.html and http://spark.apache.org/docs/latest/submitting-applications.html. -Xiangrui On Tue, Apr 28, 2015 at 4:06 AM, Rok Roskar wrote: > That's exactly what I'm saying -- I specify the memory options using spark > options, but this is not reflected in how the JVM is created. No matter > which memory settings I specify, the JVM for the driver is always made with > 512Mb of memory. So I'm not sure if this is a feature or a bug? > > rok > > On Mon, Apr 27, 2015 at 6:54 PM, Xiangrui Meng wrote: >> >> You might need to specify driver memory in spark-submit instead of >> passing JVM options. spark-submit is designed to handle different >> deployments correctly. -Xiangrui >> >> On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar wrote: >> > ok yes, I think I have narrowed it down to being a problem with driver >> > memory settings. It looks like the application master/driver is not >> > being >> > launched with the settings specified: >> > >> > For the driver process on the main node I see "-XX:MaxPermSize=128m >> > -Xms512m >> > -Xmx512m" as options used to start the JVM, even though I specified >> > >> > 'spark.yarn.am.memory', '5g' >> > 'spark.yarn.am.memoryOverhead', '2000' >> > >> > The info shows that these options were read: >> > >> > 15/04/23 13:47:47 INFO yarn.Client: Will allocate AM container, with >> > 7120 MB >> > memory including 2000 MB overhead >> > >> > Is there some reason why these options are being ignored and instead >> > starting the driver with just 512Mb of heap? >> > >> > On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar wrote: >> >> >> >> the feature dimension is 800k. >> >> >> >> yes, I believe the driver memory is likely the problem since it doesn't >> >> crash until the very last part of the tree aggregation. >> >> >> >> I'm running it via pyspark through YARN -- I have to run in client mode >> >> so >> >> I can't set spark.driver.memory -- I've tried setting the >> >> spark.yarn.am.memory and overhead parameters but it doesn't seem to >> >> have an >> >> effect. >> >> >> >> Thanks, >> >> >> >> Rok >> >> >> >> On Apr 23, 2015, at 7:47 AM, Xiangrui Meng wrote: >> >> >> >> > What is the feature dimension? Did you set the driver memory? >> >> > -Xiangrui >> >> > >> >> > On Tue, Apr 21, 2015 at 6:59 AM, rok wrote: >> >> >> I'm trying to use the StandardScaler in pyspark on a relatively >> >> >> small >> >> >> (a few >> >> >> hundred Mb) dataset of sparse vectors with 800k features. The fit >> >> >> method of >> >> >> StandardScaler crashes with Java heap space or Direct buffer memory >> >> >> errors. >> >> >> There should be plenty of memory around -- 10 executors with 2 cores >> >> >> each >> >> >> and 8 Gb per core. I'm giving the executors 9g of memory and have >> >> >> also >> >> >> tried >> >> >> lots of overhead (3g), thinking it might be the array creation in >> >> >> the >> >> >> aggregators that's causing issues. >> >> >> >> >> >> The bizarre thing is that this isn't always reproducible -- >> >> >> sometimes >> >> >> it >> >> >> actually works without problems. Should I be setting up executors >> >> >> differently? >> >> >> >> >> >> Thanks, >> >> >> >> >> >> Rok >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> View this message in context: >> >> >> >> >> >> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html >> >> >> Sent from the Apache Spark User List mailing list archive at >> >> >> Nabble.com. >> >> >> >> >> >> >> >> >> - >> >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> >> >> >> > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: StandardScaler failing with OOM errors in PySpark
That's exactly what I'm saying -- I specify the memory options using spark options, but this is not reflected in how the JVM is created. No matter which memory settings I specify, the JVM for the driver is always made with 512Mb of memory. So I'm not sure if this is a feature or a bug? rok On Mon, Apr 27, 2015 at 6:54 PM, Xiangrui Meng wrote: > You might need to specify driver memory in spark-submit instead of > passing JVM options. spark-submit is designed to handle different > deployments correctly. -Xiangrui > > On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar wrote: > > ok yes, I think I have narrowed it down to being a problem with driver > > memory settings. It looks like the application master/driver is not being > > launched with the settings specified: > > > > For the driver process on the main node I see "-XX:MaxPermSize=128m > -Xms512m > > -Xmx512m" as options used to start the JVM, even though I specified > > > > 'spark.yarn.am.memory', '5g' > > 'spark.yarn.am.memoryOverhead', '2000' > > > > The info shows that these options were read: > > > > 15/04/23 13:47:47 INFO yarn.Client: Will allocate AM container, with > 7120 MB > > memory including 2000 MB overhead > > > > Is there some reason why these options are being ignored and instead > > starting the driver with just 512Mb of heap? > > > > On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar wrote: > >> > >> the feature dimension is 800k. > >> > >> yes, I believe the driver memory is likely the problem since it doesn't > >> crash until the very last part of the tree aggregation. > >> > >> I'm running it via pyspark through YARN -- I have to run in client mode > so > >> I can't set spark.driver.memory -- I've tried setting the > >> spark.yarn.am.memory and overhead parameters but it doesn't seem to > have an > >> effect. > >> > >> Thanks, > >> > >> Rok > >> > >> On Apr 23, 2015, at 7:47 AM, Xiangrui Meng wrote: > >> > >> > What is the feature dimension? Did you set the driver memory? > -Xiangrui > >> > > >> > On Tue, Apr 21, 2015 at 6:59 AM, rok wrote: > >> >> I'm trying to use the StandardScaler in pyspark on a relatively small > >> >> (a few > >> >> hundred Mb) dataset of sparse vectors with 800k features. The fit > >> >> method of > >> >> StandardScaler crashes with Java heap space or Direct buffer memory > >> >> errors. > >> >> There should be plenty of memory around -- 10 executors with 2 cores > >> >> each > >> >> and 8 Gb per core. I'm giving the executors 9g of memory and have > also > >> >> tried > >> >> lots of overhead (3g), thinking it might be the array creation in the > >> >> aggregators that's causing issues. > >> >> > >> >> The bizarre thing is that this isn't always reproducible -- sometimes > >> >> it > >> >> actually works without problems. Should I be setting up executors > >> >> differently? > >> >> > >> >> Thanks, > >> >> > >> >> Rok > >> >> > >> >> > >> >> > >> >> > >> >> -- > >> >> View this message in context: > >> >> > http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html > >> >> Sent from the Apache Spark User List mailing list archive at > >> >> Nabble.com. > >> >> > >> >> - > >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> >> For additional commands, e-mail: user-h...@spark.apache.org > >> >> > >> > > >
Re: StandardScaler failing with OOM errors in PySpark
You might need to specify driver memory in spark-submit instead of passing JVM options. spark-submit is designed to handle different deployments correctly. -Xiangrui On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar wrote: > ok yes, I think I have narrowed it down to being a problem with driver > memory settings. It looks like the application master/driver is not being > launched with the settings specified: > > For the driver process on the main node I see "-XX:MaxPermSize=128m -Xms512m > -Xmx512m" as options used to start the JVM, even though I specified > > 'spark.yarn.am.memory', '5g' > 'spark.yarn.am.memoryOverhead', '2000' > > The info shows that these options were read: > > 15/04/23 13:47:47 INFO yarn.Client: Will allocate AM container, with 7120 MB > memory including 2000 MB overhead > > Is there some reason why these options are being ignored and instead > starting the driver with just 512Mb of heap? > > On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar wrote: >> >> the feature dimension is 800k. >> >> yes, I believe the driver memory is likely the problem since it doesn't >> crash until the very last part of the tree aggregation. >> >> I'm running it via pyspark through YARN -- I have to run in client mode so >> I can't set spark.driver.memory -- I've tried setting the >> spark.yarn.am.memory and overhead parameters but it doesn't seem to have an >> effect. >> >> Thanks, >> >> Rok >> >> On Apr 23, 2015, at 7:47 AM, Xiangrui Meng wrote: >> >> > What is the feature dimension? Did you set the driver memory? -Xiangrui >> > >> > On Tue, Apr 21, 2015 at 6:59 AM, rok wrote: >> >> I'm trying to use the StandardScaler in pyspark on a relatively small >> >> (a few >> >> hundred Mb) dataset of sparse vectors with 800k features. The fit >> >> method of >> >> StandardScaler crashes with Java heap space or Direct buffer memory >> >> errors. >> >> There should be plenty of memory around -- 10 executors with 2 cores >> >> each >> >> and 8 Gb per core. I'm giving the executors 9g of memory and have also >> >> tried >> >> lots of overhead (3g), thinking it might be the array creation in the >> >> aggregators that's causing issues. >> >> >> >> The bizarre thing is that this isn't always reproducible -- sometimes >> >> it >> >> actually works without problems. Should I be setting up executors >> >> differently? >> >> >> >> Thanks, >> >> >> >> Rok >> >> >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> >> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html >> >> Sent from the Apache Spark User List mailing list archive at >> >> Nabble.com. >> >> >> >> - >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: StandardScaler failing with OOM errors in PySpark
ok yes, I think I have narrowed it down to being a problem with driver memory settings. It looks like the application master/driver is not being launched with the settings specified: For the driver process on the main node I see "-XX:MaxPermSize=128m -Xms512m -Xmx512m" as options used to start the JVM, even though I specified 'spark.yarn.am.memory', '5g' 'spark.yarn.am.memoryOverhead', '2000' The info shows that these options were read: 15/04/23 13:47:47 INFO yarn.Client: Will allocate AM container, with 7120 MB memory including 2000 MB overhead Is there some reason why these options are being ignored and instead starting the driver with just 512Mb of heap? On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar wrote: > the feature dimension is 800k. > > yes, I believe the driver memory is likely the problem since it doesn't > crash until the very last part of the tree aggregation. > > I'm running it via pyspark through YARN -- I have to run in client mode so > I can't set spark.driver.memory -- I've tried setting the > spark.yarn.am.memory and overhead parameters but it doesn't seem to have an > effect. > > Thanks, > > Rok > > On Apr 23, 2015, at 7:47 AM, Xiangrui Meng wrote: > > > What is the feature dimension? Did you set the driver memory? -Xiangrui > > > > On Tue, Apr 21, 2015 at 6:59 AM, rok wrote: > >> I'm trying to use the StandardScaler in pyspark on a relatively small > (a few > >> hundred Mb) dataset of sparse vectors with 800k features. The fit > method of > >> StandardScaler crashes with Java heap space or Direct buffer memory > errors. > >> There should be plenty of memory around -- 10 executors with 2 cores > each > >> and 8 Gb per core. I'm giving the executors 9g of memory and have also > tried > >> lots of overhead (3g), thinking it might be the array creation in the > >> aggregators that's causing issues. > >> > >> The bizarre thing is that this isn't always reproducible -- sometimes it > >> actually works without problems. Should I be setting up executors > >> differently? > >> > >> Thanks, > >> > >> Rok > >> > >> > >> > >> > >> -- > >> View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html > >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > >> > >> - > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> For additional commands, e-mail: user-h...@spark.apache.org > >> > >
Re: StandardScaler failing with OOM errors in PySpark
the feature dimension is 800k. yes, I believe the driver memory is likely the problem since it doesn't crash until the very last part of the tree aggregation. I'm running it via pyspark through YARN -- I have to run in client mode so I can't set spark.driver.memory -- I've tried setting the spark.yarn.am.memory and overhead parameters but it doesn't seem to have an effect. Thanks, Rok On Apr 23, 2015, at 7:47 AM, Xiangrui Meng wrote: > What is the feature dimension? Did you set the driver memory? -Xiangrui > > On Tue, Apr 21, 2015 at 6:59 AM, rok wrote: >> I'm trying to use the StandardScaler in pyspark on a relatively small (a few >> hundred Mb) dataset of sparse vectors with 800k features. The fit method of >> StandardScaler crashes with Java heap space or Direct buffer memory errors. >> There should be plenty of memory around -- 10 executors with 2 cores each >> and 8 Gb per core. I'm giving the executors 9g of memory and have also tried >> lots of overhead (3g), thinking it might be the array creation in the >> aggregators that's causing issues. >> >> The bizarre thing is that this isn't always reproducible -- sometimes it >> actually works without problems. Should I be setting up executors >> differently? >> >> Thanks, >> >> Rok >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: StandardScaler failing with OOM errors in PySpark
What is the feature dimension? Did you set the driver memory? -Xiangrui On Tue, Apr 21, 2015 at 6:59 AM, rok wrote: > I'm trying to use the StandardScaler in pyspark on a relatively small (a few > hundred Mb) dataset of sparse vectors with 800k features. The fit method of > StandardScaler crashes with Java heap space or Direct buffer memory errors. > There should be plenty of memory around -- 10 executors with 2 cores each > and 8 Gb per core. I'm giving the executors 9g of memory and have also tried > lots of overhead (3g), thinking it might be the array creation in the > aggregators that's causing issues. > > The bizarre thing is that this isn't always reproducible -- sometimes it > actually works without problems. Should I be setting up executors > differently? > > Thanks, > > Rok > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
StandardScaler failing with OOM errors in PySpark
I'm trying to use the StandardScaler in pyspark on a relatively small (a few hundred Mb) dataset of sparse vectors with 800k features. The fit method of StandardScaler crashes with Java heap space or Direct buffer memory errors. There should be plenty of memory around -- 10 executors with 2 cores each and 8 Gb per core. I'm giving the executors 9g of memory and have also tried lots of overhead (3g), thinking it might be the array creation in the aggregators that's causing issues. The bizarre thing is that this isn't always reproducible -- sometimes it actually works without problems. Should I be setting up executors differently? Thanks, Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org