Xinrong Meng created SPARK-40281:
------------------------------------

             Summary: Memory Profiler on Executors
                 Key: SPARK-40281
                 URL: https://issues.apache.org/jira/browse/SPARK-40281
             Project: Spark
          Issue Type: Umbrella
          Components: PySpark
    Affects Versions: 3.4.0
            Reporter: Xinrong Meng


Profiling is critical to performance engineering. Memory consumption is a key 
indicator of how efficient a PySpark program is. There is an existing effort on 
memory profiling of Python progrms, Memory Profiler 
([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/]

PySpark applications run as independent sets of processes on a cluster, 
coordinated by the SparkContext object in the driver program. On the driver 
side, PySpark is a regular Python process, thus, we can profile it as a normal 
Python program using Memory Profiler.

However, on the executors side, we are missing such memory profiler. Since 
executors are distributed on different nodes in the cluster, we need to need to 
aggregate profiles. Furthermore, Python worker processes are spawned per 
executor for the Python/Pandas UDF execution, which makes the memory profiling 
more intricate.

The umbrella proposes to implement a Memory Profiler on Executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to