Hi As the documentation said: spark.python.worker.memory Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings (e.g. 512m, 2g). If the memory used during aggregation goes above this amount, it will spill the data into disks.
I search the config in spark source code, only rdd.py use the option. It means that the option only work in python *rdd.groupByKey or* *rdd.sortByKey *etc. The python* ExternalSorter or ExternalMerger* will spill data to disk when memory reach the spark.python.worker.memory limit. When PythonRunner fork a python worker subprocess, what is the memory limit for each python worker? does spark.python.worker.memory affect the memory of a python worker? -- Best & Regards Cyanny LIANG email: lgrcya...@gmail.com