[ 
https://issues.apache.org/jira/browse/KUDU-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823558#comment-15823558
 ] 

zhangsong commented on KUDU-1579:
---------------------------------

After experiencing several node failure cases(using kudu-tserver revision 
b906affcdee3ec814c9e96d35fea715fdbb4c330-dirty), i found these two fact.
1 when multiple kudu-tserver nodes crash at same time(not exact at same time), 
(let say 5 kudu nodes), there willl be failed tablet , reasons of the failed 
tablets should be thoses described in issue kudu-1449. Also from kudu-master ui 
i can see a lot of addServer/removeServer task hang there and there is no sign 
that they will recover automatically.
2 when facing multiply nodes crash, stop kudu-master until whole cluster is 
stable(no more node crash), restart kudu-master . After recovered all crashed 
kudu-tserver node , no failed tablet found. 

So for my case, i seems kudu-master should freeze for sometime when facing 
multiple node crashed at same time (eg.within some period of time) freeze here 
, means it stop servicing addServer/RemoveServer rpc . 
Just some thoughts today , may complete this later.

> into "safe mode"   when large number of node crash
> --------------------------------------------------
>
>                 Key: KUDU-1579
>                 URL: https://issues.apache.org/jira/browse/KUDU-1579
>             Project: Kudu
>          Issue Type: New Feature
>            Reporter: zhangsong
>
> Currently, replication will happen when met node crash .
> However when met large number of node crash , it will lead to replicate storm
> which will cause mess and data loss.
> replication should be prudent and the cluster should be into a "safe mode" in 
> aboved node crash case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to