Gengliang Wang created SPARK-30964:
--------------------------------------

             Summary: Accelerate InMemoryStore with a new index
                 Key: SPARK-30964
                 URL: https://issues.apache.org/jira/browse/SPARK-30964
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, Web UI
    Affects Versions: 3.1.0
            Reporter: Gengliang Wang
            Assignee: Gengliang Wang


Spark uses the class `InMemoryStore` as the KV storage for live UI and history 
server(by default if no LevelDB file path is provided).
In `InMemoryStore`, all the task data in one application is stored in a 
hashmap, which key is the task ID and the value is the task data. This fine for 
getting or deleting with a provided task ID.
However, Spark stage UI always shows all the task data in one stage and the 
current implementation is to look up all the values in the hashmap. The time 
complexity is O(numOfTasks). 
Also, when there are too many stages (>spark.ui.retainedStages), Spark will 
linearly try to look up all the task data of the stages to be deleted as well.

This can be very bad for a large application with many stages and tasks. We can 
improve it by allowing the natural key of an entity to have a real parent 
index. So that on each lookup with parent node provided, Spark can look up all 
the natural keys(in our case, the task IDs) first, and then find the data with 
the natural keys in the hashmap.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to