[ https://issues.apache.org/jira/browse/MESOS-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839811#comment-17839811 ]
Benjamin Mahler commented on MESOS-7187: ---------------------------------------- Added a mitigation of the bug I commented on above: https://github.com/apache/mesos/pull/558 It does not fix the overall issue here due to a lack of a connection construct, but it prevents the agent from getting stuck sending TASK_DROPPED for all incoming tasks. > Master can neglect to update agent metadata in a re-registration corner case. > ----------------------------------------------------------------------------- > > Key: MESOS-7187 > URL: https://issues.apache.org/jira/browse/MESOS-7187 > Project: Mesos > Issue Type: Bug > Reporter: Benjamin Mahler > Priority: Major > Labels: tech-debt > > If the agent is re-registering with the master for the first time, the master > will drop any re-registration messages that arrive while the registry > operation is in progress. > These dropped messages can have different metadata (e.g. version, > capabilities, etc) that gets dropped. Since the master doesn't distinguish > between different instances of the agent (both share the same UPID and there > is no instance identifying information), the master can't tell whether this > is a retry from the original instance of the agent or a re-registration from > a new instance of the agent. > The following is an example: > (1) Master restarts. > (2) Agent re-registers with OLD_VERSION / OLD_CAPABILITIES. > (3) While registry operation is in progress, agent is upgraded and > re-registers with NEW_VERSION / NEW_CAPABILITIES. > (4) Registry operation completes, new agent receives the re-registration > acknowledgement message and so, does not retry. > (5) Now, the master's memory reflects OLD_VERSION / OLD_CAPABILITIES for the > agent which remains inconsistent until a later re-registration occurs. -- This message was sent by Atlassian Jira (v8.20.10#820010)