IIRC you can avoid the issue by either using a different work_dir for the agent, or removing (and, possibly, re-creating) it.
I'm afraid I don't have a running instance of Mesos on this machine and can't test it out. Also (and this is strictly my opinion :) I would consider a change of attribute a "material" change for the Agent and I would avoid trying to recover state from previous runs; but, again, there may be perfectly legitimate cases in which this is desirable. -- *Marco Massenzio* http://codetrips.com On Mon, Feb 22, 2016 at 12:11 PM, Zhitao Li <[email protected]> wrote: > Hi, > > We recently discovered that updating attributes on Mesos agents is a very > risk operation, and has a potential to send agent(s) into a crash loop if > not done properly with errors like "Failed to perform recovery: > Incompatible slave info detected". This combined with --recovery_timeout > made the situation even worse. > > In our setup, some of the attributes are generated from automated > configuration management system, so this opens a possibility that "bad" > configuration could be left on the machine and causing big trouble on next > agent upgrade, if the USR1 signal was not sent on time. > > Some questions: > > 1. Does anyone have a good practice recommended on managing these > attributes safely? > 2. Has Mesos considered to fallback to old metadata if it detects > incompatibility, so agents would keep running with old attributes instead > of falling into crash loop? > > Thanks. > > -- > Cheers, > > Zhitao Li >

