IIRC you can avoid the issue by either using a different work_dir for the
agent, or removing (and, possibly, re-creating) it.

I'm afraid I don't have a running instance of Mesos on this machine and
can't test it out.

Also (and this is strictly my opinion :) I would consider a change of
attribute a "material" change for the Agent and I would avoid trying to
recover state from previous runs; but, again, there may be perfectly
legitimate cases in which this is desirable.

-- 
*Marco Massenzio*
http://codetrips.com

On Mon, Feb 22, 2016 at 12:11 PM, Zhitao Li <[email protected]> wrote:

> Hi,
>
> We recently discovered that updating attributes on Mesos agents is a very
> risk operation, and has a potential to send agent(s) into a crash loop if
> not done properly with errors like "Failed to perform recovery:
> Incompatible slave info detected". This combined with --recovery_timeout
> made the situation even worse.
>
> In our setup, some of the attributes are generated from automated
> configuration management system, so this opens a possibility that "bad"
> configuration could be left on the machine and causing big trouble on next
> agent upgrade, if the USR1 signal was not sent on time.
>
> Some questions:
>
> 1. Does anyone have a good practice recommended on managing these
> attributes safely?
> 2. Has Mesos considered to fallback to old metadata if it detects
> incompatibility, so agents would keep running with old attributes instead
> of falling into crash loop?
>
> Thanks.
>
> --
> Cheers,
>
> Zhitao Li
>

Reply via email to