You can actually look at this code base

https://github.com/apache/spark/blob/f85aa06464a10f5d1563302fd76465dded475a12/python/pyspark/rdd.py#L1825



_memory_limit function returns the amount of memory that you set with
spark.python.worker.memory and is used for groupBy and such operations.


Thanks
Best Regards

On Fri, Oct 23, 2015 at 11:46 PM, Connor Zanin <cnnr...@udel.edu> wrote:

> Hi all,
>
> I am running a simple word count job on a cluster of 4 nodes (24 cores per
> node). I am varying two parameter in the configuration,
> spark.python.worker.memory and the number of partitions in the RDD. My job
> is written in python.
>
> I am observing a discontinuity in the run time of the job when the
> spark.python.worker.memory is increased past a threshold. Unfortunately, I
> am having trouble understanding exactly what this parameter is doing to
> Spark internally and how it changes Spark's behavior to create this
> discontinuity.
>
> The documentation describes this parameter as "Amount of memory to use
> per python worker process during aggregation," but I find this is vague (or
> I do not know enough Spark terminology to know what it means).
>
> I have been pointed to the source code in the past, specifically the
> shuffle.py file where _spill() appears.
>
> Can anyone explain how this parameter behaves or point me to more
> descriptive documentation? Thanks!
>
> --
> Regards,
>
> Connor Zanin
> Computer Science
> University of Delaware
>

Reply via email to