Re: Master can not rejoin cluster when failed.

Attila Bukor Wed, 15 Jul 2020 06:23:43 -0700

Hi Ray,

It seems the problem is that kudu user is not authorized to UpdateConsensus on
the other masters. What user are the other two masters started with?


I wouldn't recommend wiping the master, it most likely wouldn't solve the
problem, and Kudu can't automatically recover from a deleted master, you would
need to recreate it manually.

Attila

On Wed, Jul 15, 2020 at 06:41:34AM +0000, Ray Liu (rayliu) wrote:
> We have a Kudu cluster with 3 Masters and 9 tablet servers.
> When we try to drop a table with more a thousand tablets the Leader Master 
> crashed.
> The last logs for crashed master are a bunch of
> W0715 04:00:57.330158 30337 catalog_manager.cc:3485] TS 
> cd17b92888a84d39b2adcad1ca947037 (hdsj1kud005.webex.com:7050): delete failed 
> for tablet 4250e813a29e4ca7a2633c6015c5530d because the tablet was not found. 
> No further retry: Not found: Tablet not found: 
> 4250e813a29e4ca7a2633c6015c5530d
> 
> Before these delete failed logs, there are many:
> W0715 03:59:40.047675 30336 connection.cc:361] RPC call timeout handler was 
> delayed by 11.8487s! This may be due to a process-wide pause such as 
> swapping, logging-related delays, or allocator lock contention. Will allow an 
> additional 3s for a response.
> 
> So, when this leader master crashed, a new leader master was elected from the 
> remaining two masters.
> But when I try to restart the crashed master, it stuck forever (2 hours for 
> now).
> The logs are a repetition of these:
> 
>   I0715 06:30:36.438797 18042 raft_consensus.cc:465] T 
> 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 
> FOLLOWER]: Starting pre-election (no leader contacted us within the election 
> timeout)
>   I0715 06:30:36.438868 18042 raft_consensus.cc:487] T 
> 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4  
> FOLLOWER]: Starting pre-election with config: opid_index: -1 OBSOLETE_local: 
> false peers { permanent_uuid: "e8f90d84b4754a379ffedaa32b528fb4" member_type: 
> VOTER last_known_addr { host: "master1" port: 7051 } } peers { 
> permanent_uuid: "ba89996893e44391a10a2fc1f2c2ada3" member_type: VOTER 
> last_known_addr { host: "master2" port: 7051 } } peers { permanent_uuid: 
> "81338568ef854b10ac0acac1d9eeeb6c" member_type: VOTER last_known_addr { host: 
> "master3" port: 7051 } }
>   I0715 06:30:36.439072 18042 leader_election.cc:296] T 
> 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c 
> [CANDIDATE]: Term 5 pre-election: Requested pre-vote from peers 
> e8f90d84b4754a379ffedaa32b528fb4 (master1:7051), 
> ba89996893e44391a10a2fc1f2c2ada3 (master2:7051)
>   W0715 06:30:36.439657 13256 leader_election.cc:341] T 
> 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c 
> [CANDIDATE]: Term 5 pre-election: RPC error from VoteRequest() call to peer 
> ba89996893e44391a10a2fc1f2c2ada3 (master2:7051): Remote error: Not 
> authorized: unauthorized access to method: RequestConsensusVote
>   W0715 06:30:36.439787 13255 leader_election.cc:341] T 
> 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c 
> [CANDIDATE]: Term 5 pre-election: RPC error from VoteRequest() call to peer 
> e8f90d84b4754a379ffedaa32b528fb4 (master1:7051): Remote error: Not 
> authorized: unauthorized access to method: RequestConsensusVote
>   I0715 06:30:36.439808 13255 leader_election.cc:310] T 
> 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c 
> [CANDIDATE]: Term 5 pre-election: Election decided. Result: candidate lost. 
> Election summary: received 3 responses out of 3 voters: 1 yes votes; 2 no 
> votes. yes voters: 81338568ef854b10ac0acac1d9eeeb6c; no voters: 
> ba89996893e44391a10a2fc1f2c2ada3, e8f90d84b4754a379ffedaa32b528fb4
>   I0715 06:30:36.439839 18042 raft_consensus.cc:2597] T 
> 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 
> FOLLOWER]: Leader pre-election lost for term 5. Reason: could not achieve 
> majority
>   W0715 06:30:36.531898 13465 server_base.cc:587] Unauthorized access attempt 
> to method kudu.consensus.ConsensusService.UpdateConsensus from 
> {username='kudu'} at ip:port
> 
> 
> The master summary from ksck is
>   Master Summary
>                UUID               |        Address        |    Status
> ----------------------------------+-----------------------+--------------
> ba89996893e44391a10a2fc1f2c2ada3 | master1               | HEALTHY
> e8f90d84b4754a379ffedaa32b528fb4 | master2               | HEALTHY
> 81338568ef854b10ac0acac1d9eeeb6c | master3               | UNAUTHORIZED
> Error from master3: Remote error: could not fetch consensus info from master: 
> Not authorized: unauthorized access to method: GetConsensusState 
> (UNAUTHORIZED)
> All reported replicas are:
>   A = ba89996893e44391a10a2fc1f2c2ada3
>   B = e8f90d84b4754a379ffedaa32b528fb4
>   C = 81338568ef854b10ac0acac1d9eeeb6c
> The consensus matrix is:
> Config source |        Replicas        | Current term | Config index | 
> Committed?
> ---------------+------------------------+--------------+--------------+------------
> A                        | A*  B   C              | 4            | -1         
>   | Yes
> B                        | A*  B   C              | 4            | -1         
>   | Yes
> C                        | [config not available] |              |            
>   |
> 
> What can I do if these three masters can’t achieve consensus forever?
> Is it safe to delete --fs_data_dirs/ --fs_metadata_dir/ --fs_wal_dir for the 
> crashed master if in order to get it online without any data loss?
> 
> Thanks

Re: Master can not rejoin cluster when failed.

Reply via email to