Zhitao Li created MESOS-7648: -------------------------------- Summary: Mesos master should not return `/state` before finishing recovering agents from registry Key: MESOS-7648 URL: https://issues.apache.org/jira/browse/MESOS-7648 Project: Mesos Issue Type: Bug Reporter: Zhitao Li
We are working on relying on {{recovered_agents}} in MESOS-6177. However, we discovered that master could start to respond to {{/state.json}} endpoint before it finishes processing result from registry::recover. The sequence seems to be registry was recovered -> /state query comes in -> recovered agents from registry. See the following logs: {noformat} I0608 22:29:57.147212 6407 master.cpp:2124] Elected as the leading master! I0608 22:29:57.147274 6407 master.cpp:1646] Recovering from registrar I0608 22:29:57.148114 6412 log.cpp:553] Attempting to start the writer I0608 22:29:57.149339 6411 replica.cpp:495] Replica received implicit promise request from __req_res__(2)@10.162.9.54:5050 with proposal 105 I0608 22:29:57.149860 6411 replica.cpp:344] Persisted promised to 105 I0608 22:29:57.151495 6410 coordinator.cpp:238] Coordinator attempting to fill missing positions I0608 22:29:57.151595 6412 log.cpp:569] Writer started with ending position 36816 I0608 22:29:58.111565 6423 registrar.cpp:362] Successfully fetched the registry (1200222B) in 934048us I0608 22:29:58.214422 6423 registrar.cpp:461] Applied 1 operations in 25.893664ms; attempting to update the registry I0608 22:29:58.300578 6421 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 36817 I0608 22:29:58.307567 6410 replica.cpp:539] Replica received write request for position 36817 from __req_res__(7)@10.162.9.54:5050 I0608 22:29:58.344857 6421 replica.cpp:693] Replica received learned notice for position 36817 from @0.0.0.0:0 I0608 22:29:58.378731 6408 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 36818 I0608 22:29:58.382043 6416 replica.cpp:539] Replica received write request for position 36818 from __req_res__(12)@10.162.9.54:5050 I0608 22:29:58.384946 6410 replica.cpp:693] Replica received learned notice for position 36818 from @0.0.0.0:0 I0608 22:29:59.507297 6423 registrar.cpp:506] Successfully updated the registry in 1.282937088secs I0608 22:29:59.580960 6423 registrar.cpp:392] Successfully recovered registrar I0608 22:29:59.940066 6415 http.cpp:420] HTTP GET for /master/state from 10.67.139.161:57197 with User-Agent='mesos-uns-bridge' I0608 22:30:00.342932 6425 master.cpp:1762] Recovered 3549 agents from the registry (1200220B); allowing 15mins for agents to re-register {noformat} We found that the request corresponding to second to last line above returns 0 registered or recovered agents, thus incorrectly rendered its client to think it's an empty cluster. [~anandmazumdar] [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.15#6346)