Re: StandardScaler failing with OOM errors in PySpark

2015-05-17 Thread Xiangrui Meng
AFAIK, there are two places where you can specify the driver memory.
One is via spark-summit --driver-memory and the other is via
spark.driver.memory in spark-defaults.conf. Please try these
approaches and see whether they work or not. You can find detailed
instructions at http://spark.apache.org/docs/latest/configuration.html
and http://spark.apache.org/docs/latest/submitting-applications.html.
-Xiangrui

On Tue, Apr 28, 2015 at 4:06 AM, Rok Roskar  wrote:
> That's exactly what I'm saying -- I specify the memory options using spark
> options, but this is not reflected in how the JVM is created. No matter
> which memory settings I specify, the JVM for the driver is always made with
> 512Mb of memory. So I'm not sure if this is a feature or a bug?
>
> rok
>
> On Mon, Apr 27, 2015 at 6:54 PM, Xiangrui Meng  wrote:
>>
>> You might need to specify driver memory in spark-submit instead of
>> passing JVM options. spark-submit is designed to handle different
>> deployments correctly. -Xiangrui
>>
>> On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar  wrote:
>> > ok yes, I think I have narrowed it down to being a problem with driver
>> > memory settings. It looks like the application master/driver is not
>> > being
>> > launched with the settings specified:
>> >
>> > For the driver process on the main node I see "-XX:MaxPermSize=128m
>> > -Xms512m
>> > -Xmx512m" as options used to start the JVM, even though I specified
>> >
>> > 'spark.yarn.am.memory', '5g'
>> > 'spark.yarn.am.memoryOverhead', '2000'
>> >
>> > The info shows that these options were read:
>> >
>> > 15/04/23 13:47:47 INFO yarn.Client: Will allocate AM container, with
>> > 7120 MB
>> > memory including 2000 MB overhead
>> >
>> > Is there some reason why these options are being ignored and instead
>> > starting the driver with just 512Mb of heap?
>> >
>> > On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar  wrote:
>> >>
>> >> the feature dimension is 800k.
>> >>
>> >> yes, I believe the driver memory is likely the problem since it doesn't
>> >> crash until the very last part of the tree aggregation.
>> >>
>> >> I'm running it via pyspark through YARN -- I have to run in client mode
>> >> so
>> >> I can't set spark.driver.memory -- I've tried setting the
>> >> spark.yarn.am.memory and overhead parameters but it doesn't seem to
>> >> have an
>> >> effect.
>> >>
>> >> Thanks,
>> >>
>> >> Rok
>> >>
>> >> On Apr 23, 2015, at 7:47 AM, Xiangrui Meng  wrote:
>> >>
>> >> > What is the feature dimension? Did you set the driver memory?
>> >> > -Xiangrui
>> >> >
>> >> > On Tue, Apr 21, 2015 at 6:59 AM, rok  wrote:
>> >> >> I'm trying to use the StandardScaler in pyspark on a relatively
>> >> >> small
>> >> >> (a few
>> >> >> hundred Mb) dataset of sparse vectors with 800k features. The fit
>> >> >> method of
>> >> >> StandardScaler crashes with Java heap space or Direct buffer memory
>> >> >> errors.
>> >> >> There should be plenty of memory around -- 10 executors with 2 cores
>> >> >> each
>> >> >> and 8 Gb per core. I'm giving the executors 9g of memory and have
>> >> >> also
>> >> >> tried
>> >> >> lots of overhead (3g), thinking it might be the array creation in
>> >> >> the
>> >> >> aggregators that's causing issues.
>> >> >>
>> >> >> The bizarre thing is that this isn't always reproducible --
>> >> >> sometimes
>> >> >> it
>> >> >> actually works without problems. Should I be setting up executors
>> >> >> differently?
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Rok
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >> >> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
>> >> >> Sent from the Apache Spark User List mailing list archive at
>> >> >> Nabble.com.
>> >> >>
>> >> >>
>> >> >> -
>> >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >> >>
>> >>
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: StandardScaler failing with OOM errors in PySpark

2015-04-28 Thread Rok Roskar
That's exactly what I'm saying -- I specify the memory options using spark
options, but this is not reflected in how the JVM is created. No matter
which memory settings I specify, the JVM for the driver is always made with
512Mb of memory. So I'm not sure if this is a feature or a bug?

rok

On Mon, Apr 27, 2015 at 6:54 PM, Xiangrui Meng  wrote:

> You might need to specify driver memory in spark-submit instead of
> passing JVM options. spark-submit is designed to handle different
> deployments correctly. -Xiangrui
>
> On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar  wrote:
> > ok yes, I think I have narrowed it down to being a problem with driver
> > memory settings. It looks like the application master/driver is not being
> > launched with the settings specified:
> >
> > For the driver process on the main node I see "-XX:MaxPermSize=128m
> -Xms512m
> > -Xmx512m" as options used to start the JVM, even though I specified
> >
> > 'spark.yarn.am.memory', '5g'
> > 'spark.yarn.am.memoryOverhead', '2000'
> >
> > The info shows that these options were read:
> >
> > 15/04/23 13:47:47 INFO yarn.Client: Will allocate AM container, with
> 7120 MB
> > memory including 2000 MB overhead
> >
> > Is there some reason why these options are being ignored and instead
> > starting the driver with just 512Mb of heap?
> >
> > On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar  wrote:
> >>
> >> the feature dimension is 800k.
> >>
> >> yes, I believe the driver memory is likely the problem since it doesn't
> >> crash until the very last part of the tree aggregation.
> >>
> >> I'm running it via pyspark through YARN -- I have to run in client mode
> so
> >> I can't set spark.driver.memory -- I've tried setting the
> >> spark.yarn.am.memory and overhead parameters but it doesn't seem to
> have an
> >> effect.
> >>
> >> Thanks,
> >>
> >> Rok
> >>
> >> On Apr 23, 2015, at 7:47 AM, Xiangrui Meng  wrote:
> >>
> >> > What is the feature dimension? Did you set the driver memory?
> -Xiangrui
> >> >
> >> > On Tue, Apr 21, 2015 at 6:59 AM, rok  wrote:
> >> >> I'm trying to use the StandardScaler in pyspark on a relatively small
> >> >> (a few
> >> >> hundred Mb) dataset of sparse vectors with 800k features. The fit
> >> >> method of
> >> >> StandardScaler crashes with Java heap space or Direct buffer memory
> >> >> errors.
> >> >> There should be plenty of memory around -- 10 executors with 2 cores
> >> >> each
> >> >> and 8 Gb per core. I'm giving the executors 9g of memory and have
> also
> >> >> tried
> >> >> lots of overhead (3g), thinking it might be the array creation in the
> >> >> aggregators that's causing issues.
> >> >>
> >> >> The bizarre thing is that this isn't always reproducible -- sometimes
> >> >> it
> >> >> actually works without problems. Should I be setting up executors
> >> >> differently?
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Rok
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
> >> >> Sent from the Apache Spark User List mailing list archive at
> >> >> Nabble.com.
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> >> For additional commands, e-mail: user-h...@spark.apache.org
> >> >>
> >>
> >
>


Re: StandardScaler failing with OOM errors in PySpark

2015-04-27 Thread Xiangrui Meng
You might need to specify driver memory in spark-submit instead of
passing JVM options. spark-submit is designed to handle different
deployments correctly. -Xiangrui

On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar  wrote:
> ok yes, I think I have narrowed it down to being a problem with driver
> memory settings. It looks like the application master/driver is not being
> launched with the settings specified:
>
> For the driver process on the main node I see "-XX:MaxPermSize=128m -Xms512m
> -Xmx512m" as options used to start the JVM, even though I specified
>
> 'spark.yarn.am.memory', '5g'
> 'spark.yarn.am.memoryOverhead', '2000'
>
> The info shows that these options were read:
>
> 15/04/23 13:47:47 INFO yarn.Client: Will allocate AM container, with 7120 MB
> memory including 2000 MB overhead
>
> Is there some reason why these options are being ignored and instead
> starting the driver with just 512Mb of heap?
>
> On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar  wrote:
>>
>> the feature dimension is 800k.
>>
>> yes, I believe the driver memory is likely the problem since it doesn't
>> crash until the very last part of the tree aggregation.
>>
>> I'm running it via pyspark through YARN -- I have to run in client mode so
>> I can't set spark.driver.memory -- I've tried setting the
>> spark.yarn.am.memory and overhead parameters but it doesn't seem to have an
>> effect.
>>
>> Thanks,
>>
>> Rok
>>
>> On Apr 23, 2015, at 7:47 AM, Xiangrui Meng  wrote:
>>
>> > What is the feature dimension? Did you set the driver memory? -Xiangrui
>> >
>> > On Tue, Apr 21, 2015 at 6:59 AM, rok  wrote:
>> >> I'm trying to use the StandardScaler in pyspark on a relatively small
>> >> (a few
>> >> hundred Mb) dataset of sparse vectors with 800k features. The fit
>> >> method of
>> >> StandardScaler crashes with Java heap space or Direct buffer memory
>> >> errors.
>> >> There should be plenty of memory around -- 10 executors with 2 cores
>> >> each
>> >> and 8 Gb per core. I'm giving the executors 9g of memory and have also
>> >> tried
>> >> lots of overhead (3g), thinking it might be the array creation in the
>> >> aggregators that's causing issues.
>> >>
>> >> The bizarre thing is that this isn't always reproducible -- sometimes
>> >> it
>> >> actually works without problems. Should I be setting up executors
>> >> differently?
>> >>
>> >> Thanks,
>> >>
>> >> Rok
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
>> >> Sent from the Apache Spark User List mailing list archive at
>> >> Nabble.com.
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: StandardScaler failing with OOM errors in PySpark

2015-04-23 Thread Rok Roskar
ok yes, I think I have narrowed it down to being a problem with driver
memory settings. It looks like the application master/driver is not being
launched with the settings specified:

For the driver process on the main node I see "-XX:MaxPermSize=128m
-Xms512m -Xmx512m" as options used to start the JVM, even though I
specified

'spark.yarn.am.memory', '5g'
'spark.yarn.am.memoryOverhead', '2000'

The info shows that these options were read:

15/04/23 13:47:47 INFO yarn.Client: Will allocate AM container, with 7120
MB memory including 2000 MB overhead

Is there some reason why these options are being ignored and instead
starting the driver with just 512Mb of heap?

On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar  wrote:

> the feature dimension is 800k.
>
> yes, I believe the driver memory is likely the problem since it doesn't
> crash until the very last part of the tree aggregation.
>
> I'm running it via pyspark through YARN -- I have to run in client mode so
> I can't set spark.driver.memory -- I've tried setting the
> spark.yarn.am.memory and overhead parameters but it doesn't seem to have an
> effect.
>
> Thanks,
>
> Rok
>
> On Apr 23, 2015, at 7:47 AM, Xiangrui Meng  wrote:
>
> > What is the feature dimension? Did you set the driver memory? -Xiangrui
> >
> > On Tue, Apr 21, 2015 at 6:59 AM, rok  wrote:
> >> I'm trying to use the StandardScaler in pyspark on a relatively small
> (a few
> >> hundred Mb) dataset of sparse vectors with 800k features. The fit
> method of
> >> StandardScaler crashes with Java heap space or Direct buffer memory
> errors.
> >> There should be plenty of memory around -- 10 executors with 2 cores
> each
> >> and 8 Gb per core. I'm giving the executors 9g of memory and have also
> tried
> >> lots of overhead (3g), thinking it might be the array creation in the
> >> aggregators that's causing issues.
> >>
> >> The bizarre thing is that this isn't always reproducible -- sometimes it
> >> actually works without problems. Should I be setting up executors
> >> differently?
> >>
> >> Thanks,
> >>
> >> Rok
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
>
>


Re: StandardScaler failing with OOM errors in PySpark

2015-04-22 Thread Rok Roskar
the feature dimension is 800k.

yes, I believe the driver memory is likely the problem since it doesn't crash 
until the very last part of the tree aggregation. 

I'm running it via pyspark through YARN -- I have to run in client mode so I 
can't set spark.driver.memory -- I've tried setting the spark.yarn.am.memory 
and overhead parameters but it doesn't seem to have an effect. 

Thanks,

Rok

On Apr 23, 2015, at 7:47 AM, Xiangrui Meng  wrote:

> What is the feature dimension? Did you set the driver memory? -Xiangrui
> 
> On Tue, Apr 21, 2015 at 6:59 AM, rok  wrote:
>> I'm trying to use the StandardScaler in pyspark on a relatively small (a few
>> hundred Mb) dataset of sparse vectors with 800k features. The fit method of
>> StandardScaler crashes with Java heap space or Direct buffer memory errors.
>> There should be plenty of memory around -- 10 executors with 2 cores each
>> and 8 Gb per core. I'm giving the executors 9g of memory and have also tried
>> lots of overhead (3g), thinking it might be the array creation in the
>> aggregators that's causing issues.
>> 
>> The bizarre thing is that this isn't always reproducible -- sometimes it
>> actually works without problems. Should I be setting up executors
>> differently?
>> 
>> Thanks,
>> 
>> Rok
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: StandardScaler failing with OOM errors in PySpark

2015-04-22 Thread Xiangrui Meng
What is the feature dimension? Did you set the driver memory? -Xiangrui

On Tue, Apr 21, 2015 at 6:59 AM, rok  wrote:
> I'm trying to use the StandardScaler in pyspark on a relatively small (a few
> hundred Mb) dataset of sparse vectors with 800k features. The fit method of
> StandardScaler crashes with Java heap space or Direct buffer memory errors.
> There should be plenty of memory around -- 10 executors with 2 cores each
> and 8 Gb per core. I'm giving the executors 9g of memory and have also tried
> lots of overhead (3g), thinking it might be the array creation in the
> aggregators that's causing issues.
>
> The bizarre thing is that this isn't always reproducible -- sometimes it
> actually works without problems. Should I be setting up executors
> differently?
>
> Thanks,
>
> Rok
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



StandardScaler failing with OOM errors in PySpark

2015-04-21 Thread rok
I'm trying to use the StandardScaler in pyspark on a relatively small (a few
hundred Mb) dataset of sparse vectors with 800k features. The fit method of
StandardScaler crashes with Java heap space or Direct buffer memory errors.
There should be plenty of memory around -- 10 executors with 2 cores each
and 8 Gb per core. I'm giving the executors 9g of memory and have also tried
lots of overhead (3g), thinking it might be the array creation in the
aggregators that's causing issues. 

The bizarre thing is that this isn't always reproducible -- sometimes it
actually works without problems. Should I be setting up executors
differently? 

Thanks,

Rok




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org