[jira] [Commented] (MESOS-10006) Crash in Sorter: "Check failed: resources.contains(slaveId)"
[ https://issues.apache.org/jira/browse/MESOS-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944592#comment-16944592 ] Meng Zhu commented on MESOS-10006: -- Cross-posting from slack: thanks for the ticket! Unfortunately, the log does not contain much useful information. Alas, we did not print out the slaveID upon check failure. Sent out a patch to print more info upon check failure: I send out https://reviews.apache.org/r/71581 Consider backport. Also, some hunch diagnosis: such CHECK failure on sorter function input args are almost always bugs on the caller side, in this case, most likely some race/inconsistencies between master and allocator during recovery > Crash in Sorter: "Check failed: resources.contains(slaveId)" > > > Key: MESOS-10006 > URL: https://issues.apache.org/jira/browse/MESOS-10006 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0, 1.4.1, 1.9.0 > Environment: Ubuntu Bionic 18.04, Mesos 1.1.0, 1.4.1, 1.9.0 (logs are > from 1.9.0). >Reporter: Terra Field >Priority: Major > Attachments: mesos-master.log.gz > > > We've hit a similar exception on 3 different versions of the Mesos master > (the line #/file name changes but the Check failed is the same), usually when > under very high load: > {noformat} > F1003 22:06:54.463502 8579 sorter.hpp:339] Check failed: > resources.contains(slaveId) > {noformat} > This particular occurrence happened after the election of a new master that > was then stuck doing framework update broadcasts, as documented in > MESOS-10005. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9962) Mesos may report completed task as running in the state.
[ https://issues.apache.org/jira/browse/MESOS-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944641#comment-16944641 ] Benjamin Bannier commented on MESOS-9962: - A related issue is the early exit for the case where the framework is not connected, https://github.com/apache/mesos/blob/f1789b0fe5cad221b79a0bc2adfe2036cce6f33d/src/slave/slave.cpp#L5803-L5810. > Mesos may report completed task as running in the state. > > > Key: MESOS-9962 > URL: https://issues.apache.org/jira/browse/MESOS-9962 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Meng Zhu >Assignee: Benjamin Bannier >Priority: Major > Labels: foundations > > When the following steps occur: > 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or > /master/machine/down). > 2) The executor is sent a kill, and the agent counts down on > executor_shutdown_grace_period. > 3) The executor exits, before all terminal status updates reach the agent. > This is more likely if executor_shutdown_grace_period passes. > This results in a completed executor, with non-terminal tasks (according to > status updates). > This would produce a confusing report where completed tasks are still > TASK_RUNNING. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10007) random "Failed to get exit status for Command" for short-lived commands
Charles created MESOS-10007: --- Summary: random "Failed to get exit status for Command" for short-lived commands Key: MESOS-10007 URL: https://issues.apache.org/jira/browse/MESOS-10007 Project: Mesos Issue Type: Bug Components: executor Reporter: Charles Attachments: test_scheduler.py Hi, While testing Mesos to see if we could use it at work, I encountered a random bug which I believe happens when a command exits really quickly, when run via the command executor. See the attached test case, but basically all it does is constantly start "exit 0" tasks. At some point, a task randomly fails with the error "Failed to get exit status for Command": {noformat} 'state': 'TASK_FAILED', 'message': 'Failed to get exit status for Command', 'source': 'SOURCE_EXECUTOR',{noformat} I've had a look at the code, and I found something which could potentially explain it - it's the first time I look at the code so apologies if I'm missing something. We can see the error originates from `reaped`: [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L1017] {noformat} } else if (status_->isNone()) { taskState = TASK_FAILED; message = "Failed to get exit status for Command"; } else {{noformat} Looking at the code, we can see that the `status_` future can be set to `None` in `ReaperProcess::reap`: [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L69] {noformat} Future> ReaperProcess::reap(pid_t pid) { // Check to see if this pid exists. if (os::exists(pid)) { Owned>> promise(new Promise>()); promises.put(pid, promise); return promise->future(); } else { return None(); } }{noformat} So we could have this if the process has already been reaped (`kill -0` will fail). Now, looking at the code path which spawns the process: `launchTaskSubprocess` [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L724] calls `subprocess`: [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L315] If we look at the bottom of the function we can see the following: [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462] {noformat} // We need to bind a copy of this Subprocess into the onAny callback // below to ensure that we don't close the file descriptors before // the subprocess has terminated (i.e., because the caller doesn't // keep a copy of this Subprocess around themselves). process::reap(process.data->pid) .onAny(lambda::bind(internal::cleanup, lambda::_1, promise, process)); return process;{noformat} So at this point we've already called `process::reap`. And after that, the executor also calls `process::reap`: [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801] {noformat} // Monitor this process. process::reap(pid.get()) .onAny(defer(self(), ::reaped, pid.get(), lambda::_1));{noformat} But if we look at the implementation of `process::reap`: [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L152] {noformat} Future> reap(pid_t pid) { // The reaper process is instantiated in `process::initialize`. process::initialize(); return dispatch( internal::reaper, ::ReaperProcess::reap, pid); }{noformat} We can see that `ReaperProcess::reap` is going to get called asynchronously. Doesn't this mean that it's possible that the first call to `reap` set up by `subprocess` ([https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462)|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462] will get executed first, and if the task has already exited by that time, the child will get reaped before the call to `reap` set up by the executor ([https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801]) gets a chance to run? In that case, when it runs {noformat} if (os::exists(pid)) {{noformat} would return false, `reap` would set the future to None which would result in this error. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10006) Crash in Sorter: "Check failed: resources.contains(slaveId)"
[ https://issues.apache.org/jira/browse/MESOS-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944691#comment-16944691 ] Meng Zhu commented on MESOS-10006: -- Debug patch landed in master and 1.9.x, 1.8.x (will be included in 1.9.1 and 1.8.2) {noformat} commit 3457771b42993c85e3da3c4550b233f61b14bc99 (origin/master, apache/master, master, check_slaveID) Author: Meng Zhu Date: Fri Oct 4 10:48:40 2019 -0400 Made `CHECK` in sorter print out more info upon failure. Review: https://reviews.apache.org/r/71581 {noformat} > Crash in Sorter: "Check failed: resources.contains(slaveId)" > > > Key: MESOS-10006 > URL: https://issues.apache.org/jira/browse/MESOS-10006 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0, 1.4.1, 1.9.0 > Environment: Ubuntu Bionic 18.04, Mesos 1.1.0, 1.4.1, 1.9.0 (logs are > from 1.9.0). >Reporter: Terra Field >Priority: Major > Attachments: mesos-master.log.gz > > > We've hit a similar exception on 3 different versions of the Mesos master > (the line #/file name changes but the Check failed is the same), usually when > under very high load: > {noformat} > F1003 22:06:54.463502 8579 sorter.hpp:339] Check failed: > resources.contains(slaveId) > {noformat} > This particular occurrence happened after the election of a new master that > was then stuck doing framework update broadcasts, as documented in > MESOS-10005. > -- This message was sent by Atlassian Jira (v8.3.4#803005)