[jira] [Commented] (MESOS-8750) Check failed: !slaves.registered.contains(task->slave_id)

2019-02-15 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769727#comment-16769727
 ] 

Vinod Kone commented on MESOS-8750:
---

[~megha.sharma] [~xujyan] Why was this not backported to older versions?

> Check failed: !slaves.registered.contains(task->slave_id)
> -
>
> Key: MESOS-8750
> URL: https://issues.apache.org/jira/browse/MESOS-8750
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>Priority: Critical
> Fix For: 1.6.0
>
>
> It appears that in certain circumstances an unreachable task doesn't get 
> cleaned up from the framework.unreachableTasks when the respective agent 
> re-registers leading to this check failure later when the framework is being 
> removed. When an agent goes unreachable master adds the tasks from this agent 
> to {{framework.unreachableTasks}} and when such an agent re-registers the 
> master removes the tasks that it specifies during re-registeration from this 
> datastructure but there could be tasks that the agent doesn't know about e.g. 
> if the runTask message for them got dropped and so such tasks will not get 
> removed from unreachableTasks.
> {noformat}
> F0310 13:30:58.856665 62740 master.cpp:9671] Check failed: 
> !slaves.registered.contains(task->slave_id()) Unreachable task  of 
> framework 4f57975b-05dd-4118-8674-5b29a86c6a6c-0850 was found on registered 
> agent 683c4a92-b5a0-490c-998a-6113fc86d37a-S1428
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8750) Check failed: !slaves.registered.contains(task->slave_id)

2018-05-03 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463379#comment-16463379
 ] 

Yan Xu commented on MESOS-8750:
---

{code:title=}
commit 520b729857223aeade345cbdf61209ec4f395ad9
Author: Megha Sharma 
Date:   Thu May 3 22:09:02 2018 -0700

Remove unknown unreachable tasks when agent reregisters.

A RunTaskMesssage could get dropped for an agent while it's
disconnected from the master and when such an agent goes unreachable
then this dropped task message gets added to the unreachable tasks.
When the agent reregisters, the master sends status updates for the
tasks that the agent reported when re-registering and these tasks are
also removed from the unreachableTasks on the framework but since the
agent doesn't know about the dropped task so it doesn't get removed
from the unreachableTasks leading to a check failure when
this inconsistency is detected during framework removal.

Review: https://reviews.apache.org/r/66644/
{code}

> Check failed: !slaves.registered.contains(task->slave_id)
> -
>
> Key: MESOS-8750
> URL: https://issues.apache.org/jira/browse/MESOS-8750
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>Priority: Critical
>
> It appears that in certain circumstances an unreachable task doesn't get 
> cleaned up from the framework.unreachableTasks when the respective agent 
> re-registers leading to this check failure later when the framework is being 
> removed. When an agent goes unreachable master adds the tasks from this agent 
> to {{framework.unreachableTasks}} and when such an agent re-registers the 
> master removes the tasks that it specifies during re-registeration from this 
> datastructure but there could be tasks that the agent doesn't know about e.g. 
> if the runTask message for them got dropped and so such tasks will not get 
> removed from unreachableTasks.
> {noformat}
> F0310 13:30:58.856665 62740 master.cpp:9671] Check failed: 
> !slaves.registered.contains(task->slave_id()) Unreachable task  of 
> framework 4f57975b-05dd-4118-8674-5b29a86c6a6c-0850 was found on registered 
> agent 683c4a92-b5a0-490c-998a-6113fc86d37a-S1428
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8750) Check failed: !slaves.registered.contains(task->slave_id)

2018-04-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448618#comment-16448618
 ] 

ASF GitHub Bot commented on MESOS-8750:
---

Github user m9a closed the pull request at:

https://github.com/apache/mesos/pull/279


> Check failed: !slaves.registered.contains(task->slave_id)
> -
>
> Key: MESOS-8750
> URL: https://issues.apache.org/jira/browse/MESOS-8750
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>Priority: Critical
>
> It appears that in certain circumstances an unreachable task doesn't get 
> cleaned up from the framework.unreachableTasks when the respective agent 
> re-registers leading to this check failure later when the framework is being 
> removed. When an agent goes unreachable master adds the tasks from this agent 
> to {{framework.unreachableTasks}} and when such an agent re-registers the 
> master removes the tasks that it specifies during re-registeration from this 
> datastructure but there could be tasks that the agent doesn't know about e.g. 
> if the runTask message for them got dropped and so such tasks will not get 
> removed from unreachableTasks.
> {noformat}
> F0310 13:30:58.856665 62740 master.cpp:9671] Check failed: 
> !slaves.registered.contains(task->slave_id()) Unreachable task  of 
> framework 4f57975b-05dd-4118-8674-5b29a86c6a6c-0850 was found on registered 
> agent 683c4a92-b5a0-490c-998a-6113fc86d37a-S1428
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8750) Check failed: !slaves.registered.contains(task->slave_id)

2018-04-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424306#comment-16424306
 ] 

ASF GitHub Bot commented on MESOS-8750:
---

Github user m9a commented on the issue:

https://github.com/apache/mesos/pull/279
  
The JIRA for this PR: https://issues.apache.org/jira/browse/MESOS-8750
Since @xujyan is shepherding it I intended to set him as the reviewer but 
it doesn't look like I can change those fields on the PR.


> Check failed: !slaves.registered.contains(task->slave_id)
> -
>
> Key: MESOS-8750
> URL: https://issues.apache.org/jira/browse/MESOS-8750
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>Priority: Major
>
> It appears that in certain circumstances an unreachable task doesn't get 
> cleaned up from the framework.unreachableTasks when the respective agent 
> re-registers leading to this check failure later when the framework is being 
> removed. When an agent goes unreachable master adds the tasks from this agent 
> to framework.unreachableTasks and when such an agent re-registers the master 
> removes the tasks that it specifies during re-registeration from this 
> datastructure but there could be tasks that the agent doesn't know about e.g. 
> if the runTask message for them got dropped and so such tasks will not get 
> removed from unreachableTasks.
> F0112 21:50:39.272985 44038 master.cpp:9617] Check failed: 
> !slaves.registered.contains(task->slave_id())
> Check failure stack trace: ***
>  @ 0x7fb7260692bd (unknown)
>  @ 0x7fb72606b04d (unknown)
>  @ 0x7fb726068e42 (unknown)
>  @ 0x7fb72606ba29 (unknown)
>  @ 0x7fb7251f5226 (unknown)
>  @ 0x7fb725120081 (unknown)
>  @ 0x7fb72519ca37 (unknown)
>  @ 0x7fb725fbb2fe (unknown)
>  @ 0x7fb724f75de9 (unknown)
>  @ 0x7fb725fb4fc2 (unknown)
>  @ 0x7fb725fc4a17 (unknown)
>  @ 0x7fb725fca276 (unknown)
>  @ 0x7fb72352d470 (unknown)
>  @ 0x7fb723784aa1 start_thread
>  @ 0x7fb722f47bcd clone
>  @ (nil) (unknown)
>  Aborted
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)