Attila Zsolt Piros created SPARK-33711:
------------------------------------------

             Summary:  Race condition in Spark k8s Pod lifecycle management 
that leads to shutdowns
                 Key: SPARK-33711
                 URL: https://issues.apache.org/jira/browse/SPARK-33711
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.0.1, 3.0.0, 2.4.7
            Reporter: Attila Zsolt Piros


Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
changes which could wrongfully lead to detecting of missing PODs (PODs known by 
scheduler backend but missing from POD snapshots) by the executor POD lifecycle 
manager.

A key indicator of this is seeing this log msg:

"The executor with ID [some_id] was not found in the cluster but we didn't get 
a reason why. Marking the executor as failed. The executor may have been 
deleted but the driver missed the deletion event."

So one of the problem is running the missing POD detection even when a single 
pod is changed without having a full consistent snapshot about all the PODs 
(see ExecutorPodsPollingSnapshotSource). The other could be a race between the 
executor POD lifecycle manager and the scheduler backend.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to