Hi, We recently discovered that updating attributes on Mesos agents is a very risk operation, and has a potential to send agent(s) into a crash loop if not done properly with errors like "Failed to perform recovery: Incompatible slave info detected". This combined with --recovery_timeout made the situation even worse.
In our setup, some of the attributes are generated from automated configuration management system, so this opens a possibility that "bad" configuration could be left on the machine and causing big trouble on next agent upgrade, if the USR1 signal was not sent on time. Some questions: 1. Does anyone have a good practice recommended on managing these attributes safely? 2. Has Mesos considered to fallback to old metadata if it detects incompatibility, so agents would keep running with old attributes instead of falling into crash loop? Thanks. -- Cheers, Zhitao Li