When you do groupBy(), it wish to load all the data into memory for best
performance, then you should specify the number of partitions carefully.

In Spark master or upcoming 1.1 release, PySpark can do external groupBy(),
it means that it will dumps the data into disks if there is not enough memory
to hold all the data. It also will help in this case.

On Fri, Jul 18, 2014 at 1:56 AM, Roch Denis <rde...@exostatic.com> wrote:
> Well, for what it's worth, I found the issue after spending the whole night
> running experiments;).
>
> Basically, I needed to give a higher number of partition for the groupByKey.
> I was simply using the default, which generated only 4 partitions and so the
> whole thing blew up.
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Last-step-of-processing-is-using-too-much-memory-tp10134p10147.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to