Alexey Serbin created KUDU-3530:
-----------------------------------

             Summary: Add guardrails to prevent inconsistencies on attemps to 
add multiple Kudu masters at once in a cluster
                 Key: KUDU-3530
                 URL: https://issues.apache.org/jira/browse/KUDU-3530
             Project: Kudu
          Issue Type: Improvement
          Components: master
            Reporter: Alexey Serbin


There have been a few reports on inconsistencies in the system catalog tablet's 
Raft configuration upon trying to add multiple new masters at once into a Kudu 
cluster.  It seems the current implementation of the {{AddMaster}} RPC isn't 
thread-safe, since the Raft configuration of the system catalog tablet became 
corrupted after an attempt to add multiple extra masters at once (i.e. starting 
multiple of those to-be-added-masters at once).  The original Kudu master 
reported an error like below upon next restart:

{noformat}
Invalid argument: RunMasterServer() failed: Unable to initialize catalog 
manager: Failed to initialize sys tables async: on-disk master list (:0) and 
provided master list (m1.my.org:7051, m2.my.org:7051, m3.my.org:7051) differ by 
more than one address. Their symmetric difference is: :0, m1.my.org:7051, 
m2.my.org:7051, m3.my.org:7051
{noformat}

It would be great to have guardrails preventing such a corruption.  
Essentially, we should enforce the one-new-master-at-a-time invariant which the 
current implementation implicitly assumes, but doesn't consistently enforce.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to