[ https://issues.apache.org/jira/browse/KUDU-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823558#comment-15823558 ]
zhangsong commented on KUDU-1579: --------------------------------- After experiencing several node failure cases(using kudu-tserver revision b906affcdee3ec814c9e96d35fea715fdbb4c330-dirty), i found these two fact. 1 when multiple kudu-tserver nodes crash at same time(not exact at same time), (let say 5 kudu nodes), there willl be failed tablet , reasons of the failed tablets should be thoses described in issue kudu-1449. Also from kudu-master ui i can see a lot of addServer/removeServer task hang there and there is no sign that they will recover automatically. 2 when facing multiply nodes crash, stop kudu-master until whole cluster is stable(no more node crash), restart kudu-master . After recovered all crashed kudu-tserver node , no failed tablet found. So for my case, i seems kudu-master should freeze for sometime when facing multiple node crashed at same time (eg.within some period of time) freeze here , means it stop servicing addServer/RemoveServer rpc . Just some thoughts today , may complete this later. > into "safe mode" when large number of node crash > -------------------------------------------------- > > Key: KUDU-1579 > URL: https://issues.apache.org/jira/browse/KUDU-1579 > Project: Kudu > Issue Type: New Feature > Reporter: zhangsong > > Currently, replication will happen when met node crash . > However when met large number of node crash , it will lead to replicate storm > which will cause mess and data loss. > replication should be prudent and the cluster should be into a "safe mode" in > aboved node crash case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)