[jira] [Commented] (KUDU-3325) When wal is deleted, fault recovery and load balancing are abnormal

Andrew Wong (Jira) Mon, 11 Oct 2021 13:55:13 -0700


    [ 
https://issues.apache.org/jira/browse/KUDU-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427361#comment-17427361
 ]


Andrew Wong commented on KUDU-3325:
-----------------------------------

I'm curious -- why was the WAL deleted in the first place? In general, Kudu 
never expects that files are deleted from out underneath it. Was this caused by 
some power failure? Some disk loss? I think the best route forward would be to 
treat the tablet as failed, and re-replicate from another replica if available.

> When wal is deleted, fault recovery and load balancing are abnormal
> -------------------------------------------------------------------
>
>                 Key: KUDU-3325
>                 URL: https://issues.apache.org/jira/browse/KUDU-3325
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>            Reporter: yejiabao_h
>            Priority: Major
>         Attachments: image-2021-10-06-15-36-40-996.png, 
> image-2021-10-06-15-36-53-813.png, image-2021-10-06-15-37-09-520.png, 
> image-2021-10-06-15-37-24-776.png, image-2021-10-06-15-37-42-533.png, 
> image-2021-10-06-15-37-54-782.png, image-2021-10-06-15-38-06-575.png, 
> image-2021-10-06-15-38-17-388.png, image-2021-10-06-15-38-29-176.png, 
> image-2021-10-06-15-38-39-852.png, image-2021-10-06-15-38-53-343.png, 
> image-2021-10-06-15-39-03-296.png, image-2021-10-06-19-23-51-769.png
>
>
> h3. 1、using kudu leader step down to create multiple wal message
> ./kudu tablet leader_step_down  $MASTER_IP   1299f5a939d2453c83104a6db0cae3e7 
> h4. wal
> !image-2021-10-06-15-36-40-996.png!
> h4. cmeta
> !image-2021-10-06-15-36-53-813.png!
> h3. 2、stop one of tserver to start tablet recovery，so that we can make 
> opid_index flush to cmeta
> !image-2021-10-06-15-37-09-520.png!
> h4. wal
> !image-2021-10-06-15-37-24-776.png!
> h4. cmeta
> !image-2021-10-06-15-37-42-533.png!
> h3. 3、stop all tservers，and delete tablet wal
> !image-2021-10-06-15-37-54-782.png!
> h3. 4、start all tservers
> we can see the index in wal starts counting from 1, but the opid_index 
> recorded in cmeta is the value 20 which is before deleting wal
>  
> h4. wal
> !image-2021-10-06-15-38-06-575.png!
>  
> h4. cmeta
> !image-2021-10-06-15-38-17-388.png!
>  
> h3. 5、stop a tserver，trigger fault recovery
> !image-2021-10-06-15-38-29-176.png!
> when the leader recovery a replica, and master request change raft config to 
> add the new replica to new raft config, leader replica while ignored because 
> the opindex is smaller than that in cmeta.
>  
> h3. 6、delete all wals
> !image-2021-10-06-15-38-39-852.png!
> h3. 7、kudu cluster rebalance
> ./kudu cluster rebalance $MASTER_IP
> !image-2021-10-06-15-38-53-343.png!
> !image-2021-10-06-15-39-03-296.png!
> rebalance is also failed when change raft config



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KUDU-3325) When wal is deleted, fault recovery and load balancing are abnormal

Reply via email to