[jira] [Commented] (RATIS-100) Test multiple raft groups with a state machine

2017-08-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135815#comment-16135815
 ] 

Tsz Wo Nicholas Sze commented on RATIS-100:
---

A large part of the latest patch is just code refactoring and improving log 
messages.  Let me separate them to another JIRA so that it is easier to review.

> Test multiple raft groups with a state machine
> --
>
> Key: RATIS-100
> URL: https://issues.apache.org/jira/browse/RATIS-100
> Project: Ratis
>  Issue Type: Test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: r100_20170804.patch, r100_20170809b.patch, 
> r100_20170809c.patch, r100_20170809.patch, r100_20170810.patch, 
> r100_20170811.patch, r100_no_leader_case.log
>
>
> We propose to add a test similar to 
> ReinitializationBaseTest.runTestReinitializeMultiGroups(..) with a state 
> machine so that it can test if the states are recorded correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RATIS-100) Test multiple raft groups with a state machine

2017-08-10 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122890#comment-16122890
 ] 

Jing Zhao commented on RATIS-100:
-

Thanks for updating the patch Nicholas! I will review the patch.

> Test multiple raft groups with a state machine
> --
>
> Key: RATIS-100
> URL: https://issues.apache.org/jira/browse/RATIS-100
> Project: Ratis
>  Issue Type: Test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: r100_20170804.patch, r100_20170809b.patch, 
> r100_20170809c.patch, r100_20170809.patch, r100_20170810.patch, 
> r100_no_leader_case.log
>
>
> We propose to add a test similar to 
> ReinitializationBaseTest.runTestReinitializeMultiGroups(..) with a state 
> machine so that it can test if the states are recorded correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RATIS-100) Test multiple raft groups with a state machine

2017-08-09 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120873#comment-16120873
 ] 

Tsz Wo Nicholas Sze commented on RATIS-100:
---

> ... a server should not response to another peer if it is not in current 
> conf. ...

Instead of using conf, it is better to check if they are in the same group.

r100_20170809c.patch:
- In RaftServerImpl.appendEntries, check if the leader and the server are in 
the same group.
- In LeaderState.checkNewPeers(),
-* fails stagingState only if #no-progress-servers > sender.size()/2;
-* applyOldNewConf if caughtUp + attendVote > senders.size()/2.

> Test multiple raft groups with a state machine
> --
>
> Key: RATIS-100
> URL: https://issues.apache.org/jira/browse/RATIS-100
> Project: Ratis
>  Issue Type: Test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: r100_20170804.patch, r100_20170809b.patch, 
> r100_20170809c.patch, r100_20170809.patch, r100_no_leader_case.log
>
>
> We propose to add a test similar to 
> ReinitializationBaseTest.runTestReinitializeMultiGroups(..) with a state 
> machine so that it can test if the states are recorded correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RATIS-100) Test multiple raft groups with a state machine

2017-08-09 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120758#comment-16120758
 ] 

Tsz Wo Nicholas Sze commented on RATIS-100:
---

I found the bug: a server should not response to another peer if it is not in 
current conf.  Will update the patch.

> Test multiple raft groups with a state machine
> --
>
> Key: RATIS-100
> URL: https://issues.apache.org/jira/browse/RATIS-100
> Project: Ratis
>  Issue Type: Test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: r100_20170804.patch, r100_20170809b.patch, 
> r100_20170809.patch, r100_no_leader_case.log
>
>
> We propose to add a test similar to 
> ReinitializationBaseTest.runTestReinitializeMultiGroups(..) with a state 
> machine so that it can test if the states are recorded correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RATIS-100) Test multiple raft groups with a state machine

2017-08-09 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120697#comment-16120697
 ] 

Tsz Wo Nicholas Sze commented on RATIS-100:
---

TestMultiRaftGroup:
# starts 5 three-node groups (i.e. 15 nodes totally)
# close 4 of the 5 three-node groups
# update all 15 nodes to 1 group by
#* calling setConf to the chosen group
#* calling reinitialize to each server in the other groups

> Test multiple raft groups with a state machine
> --
>
> Key: RATIS-100
> URL: https://issues.apache.org/jira/browse/RATIS-100
> Project: Ratis
>  Issue Type: Test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: r100_20170804.patch, r100_20170809b.patch, 
> r100_20170809.patch, r100_no_leader_case.log
>
>
> We propose to add a test similar to 
> ReinitializationBaseTest.runTestReinitializeMultiGroups(..) with a state 
> machine so that it can test if the states are recorded correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RATIS-100) Test multiple raft groups with a state machine

2017-08-09 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120430#comment-16120430
 ] 

Tsz Wo Nicholas Sze commented on RATIS-100:
---

> If more than a quorum of nodes in the new conf are in the STARTING state and 
> do not join normal protocol, then the (old, new) conf entry cannot get 
> committed. ...

Yes, the setConf entry cannot get committed.  However, the old nodes still will 
keep retrying leader election with the new nodes.

Here is the dead lock:
- Old nodes: retrying leader election to get a majority from old + new nodes.
- New nodes: stating in STARTING state and refuse to vote.

It seems that the new nodes should vote even if it is in STARTING state.  Would 
it work?

> Test multiple raft groups with a state machine
> --
>
> Key: RATIS-100
> URL: https://issues.apache.org/jira/browse/RATIS-100
> Project: Ratis
>  Issue Type: Test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: r100_20170804.patch, r100_20170809.patch
>
>
> We propose to add a test similar to 
> ReinitializationBaseTest.runTestReinitializeMultiGroups(..) with a state 
> machine so that it can test if the states are recorded correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RATIS-100) Test multiple raft groups with a state machine

2017-08-09 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120390#comment-16120390
 ] 

Tsz Wo Nicholas Sze commented on RATIS-100:
---

> ... Is it possible that no new leader can be elected is because we are using 
> reinitialize API for old peers? For these peers we should call 
> setConfiguration instead of reinitialize.

We are already using setConfiguration for old peers and reinitialize for new 
peers.
{code}
//ReinitializationBaseTest.runTestReinitializeMultiGroups

// update chosen group to use all the peers
final RaftPeer[] array = allPeers.toArray(RaftPeer.EMPTY_PEERS);
for(int i = 0; i < groups.length; i++) {
  LOG.info(i + ") update " + cluster.printServers(groups[i].getGroupId()));
  if (i == chosen) {
try (final RaftClient client = cluster.createClient(null, groups[i])) {
  client.setConfiguration(array);
}
  } else {
for(RaftPeer p : groups[i].getPeers()) {
  try (final RaftClient client = cluster.createClient(p.getId(), 
groups[i])) {
client.reinitialize(array, p.getId());
  }
}
  }
}
{code}


> Test multiple raft groups with a state machine
> --
>
> Key: RATIS-100
> URL: https://issues.apache.org/jira/browse/RATIS-100
> Project: Ratis
>  Issue Type: Test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: r100_20170804.patch
>
>
> We propose to add a test similar to 
> ReinitializationBaseTest.runTestReinitializeMultiGroups(..) with a state 
> machine so that it can test if the states are recorded correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RATIS-100) Test multiple raft groups with a state machine

2017-08-08 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119302#comment-16119302
 ] 

Jing Zhao commented on RATIS-100:
-

bq. One or more node may start a leader election. However, there could be many 
new nodes still in STARTING state and refuse to vote. Then, no new leader can 
be elected.

If more than a quorum of nodes in the new conf are in the STARTING state and do 
not join normal protocol, then the (old, new) conf entry cannot get committed. 
In that case, the leader election will still depend on the old configuration. 
Is it possible that no new leader can be elected is because we are using 
{{reinitialize}} API for old peers? For these peers we should call 
{{setConfiguration}} instead of {{reinitialize}}.

> Test multiple raft groups with a state machine
> --
>
> Key: RATIS-100
> URL: https://issues.apache.org/jira/browse/RATIS-100
> Project: Ratis
>  Issue Type: Test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: r100_20170804.patch
>
>
> We propose to add a test similar to 
> ReinitializationBaseTest.runTestReinitializeMultiGroups(..) with a state 
> machine so that it can test if the states are recorded correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RATIS-100) Test multiple raft groups with a state machine

2017-08-08 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119242#comment-16119242
 ] 

Tsz Wo Nicholas Sze commented on RATIS-100:
---

> ... a RaftServer may stay in STARTING state forever ...

Here is why:
- When a new nodes is started, it is in STARTING state to catch up logs.  In 
STARTING state, it does not vote until a leader send a appendEntries call.
- If a large number of new nodes are added, then the old leader steps down.  
One or more node may start a leader election.  However, there could be many new 
nodes still in STARTING state and refuse to vote.
- Then, no new leader can be elected.

> Test multiple raft groups with a state machine
> --
>
> Key: RATIS-100
> URL: https://issues.apache.org/jira/browse/RATIS-100
> Project: Ratis
>  Issue Type: Test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Attachments: r100_20170804.patch
>
>
> We propose to add a test similar to 
> ReinitializationBaseTest.runTestReinitializeMultiGroups(..) with a state 
> machine so that it can test if the states are recorded correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)