Jie Yu created MESOS-2047: ----------------------------- Summary: Isolator cleanup failures shouldn't cause TASK_LOST. Key: MESOS-2047 URL: https://issues.apache.org/jira/browse/MESOS-2047 Project: Mesos Issue Type: Bug Affects Versions: 0.21.0 Reporter: Jie Yu
Right now, if isolator cleanup fails, we'll transition all pending tasks to TASK_LOST (even in the OOM case, we should have transitioned it to TASK_FAILED). The problematic code is here: {noformat} 1052 void MesosContainerizerProcess::___destroy( 1053 const ContainerID& containerId, 1054 const Future<Option<int>>& status, 1055 const Future<list<Future<Nothing>>>& cleanups) 1056 { 1057 // This should not occur because we only use the Future<list> to 1058 // facilitate chaining. 1059 CHECK_READY(cleanups); 1060 1061 // Check cleanup succeeded for all isolators. If not, we'll fail the 1062 // container termination and remove the 'destroying' flag but leave 1063 // all other state. The container is now in an inconsistent state. 1064 foreach (const Future<Nothing>& cleanup, cleanups.get()) { 1065 if (!cleanup.isReady()) { 1066 promises[containerId]->fail( 1067 "Failed to clean up an isolator when destroying container '" + 1068 stringify(containerId) + "' :" + 1069 (cleanup.isFailed() ? cleanup.failure() : "discarded future")); 1070 1071 destroying.erase(containerId); 1072 1073 return; 1074 } 1075 } {noformat} Since launcher->destroy already succeeds (all processes are killed), instead of failing the promises[containerId], we probably should just export the error through metrics (so that people can get alerted on that) and still set the termination appropriately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)