Weiwei Yang created YUNIKORN-176: ------------------------------------ Summary: schedulerCache might become inconsistent sometimes depending on the ordering of the events Key: YUNIKORN-176 URL: https://issues.apache.org/jira/browse/YUNIKORN-176 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Weiwei Yang Assignee: Weiwei Yang
Sometimes, we found some nodes are stuck at pending when working with the auto-scaler. Because some daemon set pods were pending to schedule. The root cause is: # auto-scaler scales up a node # the daemon set controller creates pod for e.g fluentd (it sets the pod.spec.nodeName="newly-added-host") # YK got informed from pod informer: add pod # add pod to cache (schedulerCache), since the {{pod.spec.nodeName}} is not nil, it adds a {{new nodeInfo}} # node informer got informed: add node # add node to scheduler cache, the node already exists, skip calling SetNode # scheduler tries to allocate the pod to the node # predicates failed: NodeUnknownCondition (node x doesn't exist in schedulerCache) # the allocation always fail and pod pending.. # since the daemon set pod could not be started, node status will be NotReady -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org