I'm trying to use the StandardScaler in pyspark on a relatively small (a few
hundred Mb) dataset of sparse vectors with 800k features. The fit method of
StandardScaler crashes with Java heap space or Direct buffer memory errors.
There should be plenty of memory around -- 10 executors with 2 cores each
and 8 Gb per core. I'm giving the executors 9g of memory and have also tried
lots of overhead (3g), thinking it might be the array creation in the
aggregators that's causing issues. 

The bizarre thing is that this isn't always reproducible -- sometimes it
actually works without problems. Should I be setting up executors
differently? 

Thanks,

Rok




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to