Hi,

We recently discovered that updating attributes on Mesos agents is a very
risk operation, and has a potential to send agent(s) into a crash loop if
not done properly with errors like "Failed to perform recovery:
Incompatible slave
info detected". This combined with --recovery_timeout made the situation
even worse.

In our setup, some of the attributes are generated from automated
configuration management system, so this opens a possibility that "bad"
configuration could be left on the machine and causing big trouble on next
agent upgrade, if the USR1 signal was not sent on time.

Some questions:

1. Does anyone have a good practice recommended on managing these
attributes safely?
2. Has Mesos considered to fallback to old metadata if it detects
incompatibility, so agents would keep running with old attributes instead
of falling into crash loop?

Thanks.

-- 
Cheers,

Zhitao Li

Reply via email to