[jira] [Commented] (MESOS-10006) Crash in Sorter: "Check failed: resources.contains(slaveId)"

2019-10-04 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944592#comment-16944592
 ] 

Meng Zhu commented on MESOS-10006:
--

 Cross-posting from slack:

thanks for the ticket! Unfortunately, the log does not contain much useful 
information. Alas, we did not print out the slaveID upon check failure. Sent 
out a patch to print more info upon check failure:
I send out https://reviews.apache.org/r/71581
Consider backport.

Also, some hunch diagnosis: such CHECK failure on sorter function input args 
are almost always bugs on the caller side, in this case, most likely some 
race/inconsistencies between master and allocator during recovery



> Crash in Sorter: "Check failed: resources.contains(slaveId)"
> 
>
> Key: MESOS-10006
> URL: https://issues.apache.org/jira/browse/MESOS-10006
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0, 1.4.1, 1.9.0
> Environment: Ubuntu Bionic 18.04, Mesos 1.1.0, 1.4.1, 1.9.0 (logs are 
> from 1.9.0).
>Reporter: Terra Field
>Priority: Major
> Attachments: mesos-master.log.gz
>
>
> We've hit a similar exception on 3 different versions of the Mesos master 
> (the line #/file name changes but the Check failed is the same), usually when 
> under very high load:
> {noformat}
> F1003 22:06:54.463502  8579 sorter.hpp:339] Check failed: 
> resources.contains(slaveId)
> {noformat}
> This particular occurrence happened after the election of a new master that 
> was then stuck doing framework update broadcasts, as documented in 
> MESOS-10005.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9962) Mesos may report completed task as running in the state.

2019-10-04 Thread Benjamin Bannier (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944641#comment-16944641
 ] 

Benjamin Bannier commented on MESOS-9962:
-

A related issue is the early exit for the case where the framework is not 
connected, 
https://github.com/apache/mesos/blob/f1789b0fe5cad221b79a0bc2adfe2036cce6f33d/src/slave/slave.cpp#L5803-L5810.

> Mesos may report completed task as running in the state.
> 
>
> Key: MESOS-9962
> URL: https://issues.apache.org/jira/browse/MESOS-9962
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Meng Zhu
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: foundations
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
> /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on 
> executor_shutdown_grace_period.
> 3) The executor exits, before all terminal status updates reach the agent. 
> This is more likely if executor_shutdown_grace_period passes.
> This results in a completed executor, with non-terminal tasks (according to 
> status updates).
> This would produce a confusing report where completed tasks are still 
> TASK_RUNNING.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10007) random "Failed to get exit status for Command" for short-lived commands

2019-10-04 Thread Charles (Jira)
Charles created MESOS-10007:
---

 Summary: random "Failed to get exit status for Command" for 
short-lived commands
 Key: MESOS-10007
 URL: https://issues.apache.org/jira/browse/MESOS-10007
 Project: Mesos
  Issue Type: Bug
  Components: executor
Reporter: Charles
 Attachments: test_scheduler.py

Hi,

While testing Mesos to see if we could use it at work, I encountered a random 
bug which I believe happens when a command exits really quickly, when run via 
the command executor.

See the attached test case, but basically all it does is constantly start "exit 
0" tasks.

At some point, a task randomly fails with the error "Failed to get exit status 
for Command":

 
{noformat}
'state': 'TASK_FAILED', 'message': 'Failed to get exit status for Command', 
'source': 'SOURCE_EXECUTOR',{noformat}
  

I've had a look at the code, and I found something which could potentially 
explain it - it's the first time I look at the code so apologies if I'm missing 
something.

 We can see the error originates from `reaped`:

[https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L1017]
{noformat}
} else if (status_->isNone()) {
  taskState = TASK_FAILED;
  message = "Failed to get exit status for Command";
} else {{noformat}
 

Looking at the code, we can see that the `status_` future can be set to `None` 
in `ReaperProcess::reap`:

[https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L69]

 

 
{noformat}
Future> ReaperProcess::reap(pid_t pid)
{
  // Check to see if this pid exists.
  if (os::exists(pid)) {
Owned>> promise(new Promise>());
promises.put(pid, promise);
return promise->future();
  } else {
return None();
  }
}{noformat}
 

 

So we could have this if the process has already been reaped (`kill -0` will 
fail).

 

Now, looking at the code path which spawns the process:

`launchTaskSubprocess`

[https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L724]

 

calls `subprocess`:

[https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L315]

 

If we look at the bottom of the function we can see the following:

[https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462]

 

 
{noformat}
  // We need to bind a copy of this Subprocess into the onAny callback
  // below to ensure that we don't close the file descriptors before
  // the subprocess has terminated (i.e., because the caller doesn't
  // keep a copy of this Subprocess around themselves).
  process::reap(process.data->pid)
.onAny(lambda::bind(internal::cleanup, lambda::_1, promise, process));  
return process;{noformat}
 

 

So at this point we've already called `process::reap`.

 

And after that, the executor also calls `process::reap`:

[https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801]

 

 
{noformat}
// Monitor this process.
process::reap(pid.get())
  .onAny(defer(self(), ::reaped, pid.get(), lambda::_1));{noformat}
 

 

But if we look at the implementation of `process::reap`:

[https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L152]

 

 
{noformat}
Future> reap(pid_t pid)
{
  // The reaper process is instantiated in `process::initialize`.
  process::initialize();  return dispatch(
  internal::reaper,
  ::ReaperProcess::reap,
  pid);
}{noformat}
We can see that `ReaperProcess::reap` is going to get called asynchronously.

 

Doesn't this mean that it's possible that the first call to `reap` set up by 
`subprocess` 
([https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462)|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462]

will get executed first, and if the task has already exited by that time, the 
child will get reaped before the call to `reap` set up by the executor 
([https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801]) 
gets a chance to run?

 

In that case, when it runs

 
{noformat}
if (os::exists(pid)) {{noformat}
would return false, `reap` would set the future to None which would result in 
this error.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10006) Crash in Sorter: "Check failed: resources.contains(slaveId)"

2019-10-04 Thread Meng Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944691#comment-16944691
 ] 

Meng Zhu commented on MESOS-10006:
--

Debug patch landed in master and 1.9.x, 1.8.x (will be included in 1.9.1 and 
1.8.2)
{noformat}
commit 3457771b42993c85e3da3c4550b233f61b14bc99 (origin/master, apache/master, 
master, check_slaveID)
Author: Meng Zhu 
Date:   Fri Oct 4 10:48:40 2019 -0400

Made `CHECK` in sorter print out more info upon failure.

Review: https://reviews.apache.org/r/71581
{noformat}


> Crash in Sorter: "Check failed: resources.contains(slaveId)"
> 
>
> Key: MESOS-10006
> URL: https://issues.apache.org/jira/browse/MESOS-10006
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0, 1.4.1, 1.9.0
> Environment: Ubuntu Bionic 18.04, Mesos 1.1.0, 1.4.1, 1.9.0 (logs are 
> from 1.9.0).
>Reporter: Terra Field
>Priority: Major
> Attachments: mesos-master.log.gz
>
>
> We've hit a similar exception on 3 different versions of the Mesos master 
> (the line #/file name changes but the Check failed is the same), usually when 
> under very high load:
> {noformat}
> F1003 22:06:54.463502  8579 sorter.hpp:339] Check failed: 
> resources.contains(slaveId)
> {noformat}
> This particular occurrence happened after the election of a new master that 
> was then stuck doing framework update broadcasts, as documented in 
> MESOS-10005.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)