mcdull_zhang created SPARK-43911:
------------------------------------

             Summary: Directly use Set to consume iterator data to deduplicate, 
thereby reducing memory usage
                 Key: SPARK-43911
                 URL: https://issues.apache.org/jira/browse/SPARK-43911
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.4.0
            Reporter: mcdull_zhang


When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for 
dynamic partition pruning, it will put all the keys in an Array, and then call 
the distinct of the Array to remove the duplicates.

In general, Broadcast HashedRelation may have many rows, and the repetition 
rate of this key is high. Doing so will cause this Array to occupy a large 
amount of memory (and this memory is not managed by MemoryManager), which may 
trigger OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to