[ 
https://issues.apache.org/jira/browse/MESOS-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839811#comment-17839811
 ] 

Benjamin Mahler commented on MESOS-7187:
----------------------------------------

Added a mitigation of the bug I commented on above: 
https://github.com/apache/mesos/pull/558
It does not fix the overall issue here due to a lack of a connection construct, 
but it prevents the agent from getting stuck sending TASK_DROPPED for all 
incoming tasks.

> Master can neglect to update agent metadata in a re-registration corner case.
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7187
>                 URL: https://issues.apache.org/jira/browse/MESOS-7187
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Priority: Major
>              Labels: tech-debt
>
> If the agent is re-registering with the master for the first time, the master 
> will drop any re-registration messages that arrive while the registry 
> operation is in progress.
> These dropped messages can have different metadata (e.g. version, 
> capabilities, etc) that gets dropped. Since the master doesn't distinguish 
> between different instances of the agent (both share the same UPID and there 
> is no instance identifying information), the master can't tell whether this 
> is a retry from the original instance of the agent or a re-registration from 
> a new instance of the agent.
> The following is an example:
> (1) Master restarts.
> (2) Agent re-registers with OLD_VERSION / OLD_CAPABILITIES.
> (3) While registry operation is in progress, agent is upgraded and 
> re-registers with NEW_VERSION / NEW_CAPABILITIES.
> (4) Registry operation completes, new agent receives the re-registration 
> acknowledgement message and so, does not retry.
> (5) Now, the master's memory reflects OLD_VERSION / OLD_CAPABILITIES for the 
> agent which remains inconsistent until a later re-registration occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to