Zhitao Li created MESOS-7648:
--------------------------------

             Summary: Mesos master should not return `/state` before finishing 
recovering agents from registry
                 Key: MESOS-7648
                 URL: https://issues.apache.org/jira/browse/MESOS-7648
             Project: Mesos
          Issue Type: Bug
            Reporter: Zhitao Li


We are working on relying on {{recovered_agents}} in MESOS-6177. However, we 
discovered that master could start to respond to {{/state.json}} endpoint 
before it finishes processing result from registry::recover.

The sequence seems to be registry was recovered -> /state query comes in -> 
recovered agents from registry.

See the following logs:

{noformat}
I0608 22:29:57.147212  6407 master.cpp:2124] Elected as the leading master!
I0608 22:29:57.147274  6407 master.cpp:1646] Recovering from registrar
I0608 22:29:57.148114  6412 log.cpp:553] Attempting to start the writer
I0608 22:29:57.149339  6411 replica.cpp:495] Replica received implicit promise 
request from __req_res__(2)@10.162.9.54:5050 with proposal 105
I0608 22:29:57.149860  6411 replica.cpp:344] Persisted promised to 105
I0608 22:29:57.151495  6410 coordinator.cpp:238] Coordinator attempting to fill 
missing positions
I0608 22:29:57.151595  6412 log.cpp:569] Writer started with ending position 
36816
I0608 22:29:58.111565  6423 registrar.cpp:362] Successfully fetched the 
registry (1200222B) in 934048us
I0608 22:29:58.214422  6423 registrar.cpp:461] Applied 1 operations in 
25.893664ms; attempting to update the registry
I0608 22:29:58.300578  6421 coordinator.cpp:348] Coordinator attempting to 
write APPEND action at position 36817
I0608 22:29:58.307567  6410 replica.cpp:539] Replica received write request for 
position 36817 from __req_res__(7)@10.162.9.54:5050
I0608 22:29:58.344857  6421 replica.cpp:693] Replica received learned notice 
for position 36817 from @0.0.0.0:0
I0608 22:29:58.378731  6408 coordinator.cpp:348] Coordinator attempting to 
write TRUNCATE action at position 36818
I0608 22:29:58.382043  6416 replica.cpp:539] Replica received write request for 
position 36818 from __req_res__(12)@10.162.9.54:5050
I0608 22:29:58.384946  6410 replica.cpp:693] Replica received learned notice 
for position 36818 from @0.0.0.0:0
I0608 22:29:59.507297  6423 registrar.cpp:506] Successfully updated the 
registry in 1.282937088secs
I0608 22:29:59.580960  6423 registrar.cpp:392] Successfully recovered registrar
I0608 22:29:59.940066  6415 http.cpp:420] HTTP GET for /master/state from 
10.67.139.161:57197 with User-Agent='mesos-uns-bridge'
I0608 22:30:00.342932  6425 master.cpp:1762] Recovered 3549 agents from the 
registry (1200220B); allowing 15mins for agents to re-register
{noformat}

We found that the request corresponding to second to last line above returns 0 
registered or recovered agents, thus incorrectly rendered its client to think 
it's an empty cluster.

[~anandmazumdar] [~vinodkone]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to