Hey folks,

We got our Mesos cluster (0.28.1) into an interesting state during some
chaos monkey testing. I killed off 5 of our 16 agents to simulate an AZ
outage, and then accidentally killed off almost all running tasks (a little
more than 1,000 of our ~1,300 tasks -- not intentional but an interesting
test nonetheless!). During this time, we noticed:

- The time between our framework accepting an offer and the master
considering the task as launched spiked to ~2 minutes (which became doubly
problematic due to our 2 minute offer timeout)
- It would take up to 8 minutes for TASK_KILLED status updates from a slave
to be acknowledged by the master.
- The master logs contained tons of log lines mentioning "Performing
explicit task state reconciliation..."
- The killed agents took ~5 minutes to recover after I booted them back up.
- The whole time, resources were offered to the framework at a normal rate.

I understand that this is an exceptional situation, but does anyone ahve
any insight into exactly what's going on behind the scenes? Sounds like all
the status updates were backed up in a queue and the master was processing
them one at a time. Is there something we could have done better in our
framework <https://github.com/hubspot/singularity> to do this more
gracefully? Is there any sort of monitoring of the master backlog that we
can take advantage of?

Happy to include master / slave / framework logs if necessary.

Thanks,
Tom

Reply via email to