Hi Ray, It seems the problem is that kudu user is not authorized to UpdateConsensus on the other masters. What user are the other two masters started with?
I wouldn't recommend wiping the master, it most likely wouldn't solve the problem, and Kudu can't automatically recover from a deleted master, you would need to recreate it manually. Attila On Wed, Jul 15, 2020 at 06:41:34AM +0000, Ray Liu (rayliu) wrote: > We have a Kudu cluster with 3 Masters and 9 tablet servers. > When we try to drop a table with more a thousand tablets the Leader Master > crashed. > The last logs for crashed master are a bunch of > W0715 04:00:57.330158 30337 catalog_manager.cc:3485] TS > cd17b92888a84d39b2adcad1ca947037 (hdsj1kud005.webex.com:7050): delete failed > for tablet 4250e813a29e4ca7a2633c6015c5530d because the tablet was not found. > No further retry: Not found: Tablet not found: > 4250e813a29e4ca7a2633c6015c5530d > > Before these delete failed logs, there are many: > W0715 03:59:40.047675 30336 connection.cc:361] RPC call timeout handler was > delayed by 11.8487s! This may be due to a process-wide pause such as > swapping, logging-related delays, or allocator lock contention. Will allow an > additional 3s for a response. > > So, when this leader master crashed, a new leader master was elected from the > remaining two masters. > But when I try to restart the crashed master, it stuck forever (2 hours for > now). > The logs are a repetition of these: > > I0715 06:30:36.438797 18042 raft_consensus.cc:465] T > 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 > FOLLOWER]: Starting pre-election (no leader contacted us within the election > timeout) > I0715 06:30:36.438868 18042 raft_consensus.cc:487] T > 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 > FOLLOWER]: Starting pre-election with config: opid_index: -1 OBSOLETE_local: > false peers { permanent_uuid: "e8f90d84b4754a379ffedaa32b528fb4" member_type: > VOTER last_known_addr { host: "master1" port: 7051 } } peers { > permanent_uuid: "ba89996893e44391a10a2fc1f2c2ada3" member_type: VOTER > last_known_addr { host: "master2" port: 7051 } } peers { permanent_uuid: > "81338568ef854b10ac0acac1d9eeeb6c" member_type: VOTER last_known_addr { host: > "master3" port: 7051 } } > I0715 06:30:36.439072 18042 leader_election.cc:296] T > 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c > [CANDIDATE]: Term 5 pre-election: Requested pre-vote from peers > e8f90d84b4754a379ffedaa32b528fb4 (master1:7051), > ba89996893e44391a10a2fc1f2c2ada3 (master2:7051) > W0715 06:30:36.439657 13256 leader_election.cc:341] T > 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c > [CANDIDATE]: Term 5 pre-election: RPC error from VoteRequest() call to peer > ba89996893e44391a10a2fc1f2c2ada3 (master2:7051): Remote error: Not > authorized: unauthorized access to method: RequestConsensusVote > W0715 06:30:36.439787 13255 leader_election.cc:341] T > 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c > [CANDIDATE]: Term 5 pre-election: RPC error from VoteRequest() call to peer > e8f90d84b4754a379ffedaa32b528fb4 (master1:7051): Remote error: Not > authorized: unauthorized access to method: RequestConsensusVote > I0715 06:30:36.439808 13255 leader_election.cc:310] T > 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c > [CANDIDATE]: Term 5 pre-election: Election decided. Result: candidate lost. > Election summary: received 3 responses out of 3 voters: 1 yes votes; 2 no > votes. yes voters: 81338568ef854b10ac0acac1d9eeeb6c; no voters: > ba89996893e44391a10a2fc1f2c2ada3, e8f90d84b4754a379ffedaa32b528fb4 > I0715 06:30:36.439839 18042 raft_consensus.cc:2597] T > 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 > FOLLOWER]: Leader pre-election lost for term 5. Reason: could not achieve > majority > W0715 06:30:36.531898 13465 server_base.cc:587] Unauthorized access attempt > to method kudu.consensus.ConsensusService.UpdateConsensus from > {username='kudu'} at ip:port > > > The master summary from ksck is > Master Summary > UUID | Address | Status > ----------------------------------+-----------------------+-------------- > ba89996893e44391a10a2fc1f2c2ada3 | master1 | HEALTHY > e8f90d84b4754a379ffedaa32b528fb4 | master2 | HEALTHY > 81338568ef854b10ac0acac1d9eeeb6c | master3 | UNAUTHORIZED > Error from master3: Remote error: could not fetch consensus info from master: > Not authorized: unauthorized access to method: GetConsensusState > (UNAUTHORIZED) > All reported replicas are: > A = ba89996893e44391a10a2fc1f2c2ada3 > B = e8f90d84b4754a379ffedaa32b528fb4 > C = 81338568ef854b10ac0acac1d9eeeb6c > The consensus matrix is: > Config source | Replicas | Current term | Config index | > Committed? > ---------------+------------------------+--------------+--------------+------------ > A | A* B C | 4 | -1 > | Yes > B | A* B C | 4 | -1 > | Yes > C | [config not available] | | > | > > What can I do if these three masters can’t achieve consensus forever? > Is it safe to delete --fs_data_dirs/ --fs_metadata_dir/ --fs_wal_dir for the > crashed master if in order to get it online without any data loss? > > Thanks