[jira] [Updated] (RATIS-982) RaftServerImpl failed to change from RUNNING to RUNNING

2020-06-19 Thread runzhiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated RATIS-982:
-
Description: 
This happens in test, but it maybe also happen in production.

For example, leader is s3 and follower is s4.
1. kill s4, and restart s4.

{code:java}
2020-06-19 07:03:18,095 [Thread-6194] INFO  ratis.MiniRaftCluster 
(MiniRaftCluster.java:killServer(458)) - killServer s4
2020-06-19 07:03:18,095 [Thread-6194] INFO  ratis.MiniRaftCluster 
(MiniRaftCluster.java:newRaftServer(330)) - newRaftServer: s4, 
group-5BD7E8A01610:[s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, 
s1:0.0.0.0:33783, s2:0.0.0.0:40473], format? false
{code}

2. s4 start and set configuration from storage at 
[setRaftConf(raftConf.getLogEntryIndex(), raftConf) 
|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L170]
 and s4 will change to RUNNING at 
[lifeCycle.transition(RUNNING)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L213]


{code:java}
2020-06-19 07:03:18,127 [pool-16-thread-1] INFO  impl.RaftServerImpl 
(ServerState.java:setRaftConf(356)) - s4@group-5BD7E8A01610: set configuration 
0: [s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, s1:0.0.0.0:33783, 
s2:0.0.0.0:40473], old=null at 0
2020-06-19 07:03:18,153 [Thread-6194] INFO  impl.RaftServerImpl 
(RaftServerImpl.java:start(185)) - s4@group-5BD7E8A01610: start as a follower, 
conf=0: [s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, 
s1:0.0.0.0:33783, s2:0.0.0.0:40473], old=null
2020-06-19 07:03:18,153 [Thread-6194] INFO  impl.RaftServerImpl 
(RaftServerImpl.java:setRole(174)) - s4@group-5BD7E8A01610: changes role from   
   null to FOLLOWER at term 1 for startAsFollower
{code}


3. s3 send append entry request to s4, and s4 change to RUNNING at 
[lifeCycle.compareAndTransition(STARTING, 
RUNNING)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1003]


{code:java}
2020-06-19 07:03:18,162 [nioEventLoopGroup-59-1] DEBUG impl.RaftServerImpl 
(RaftServerImpl.java:logAppendEntries(918)) - s4@group-5BD7E8A01610: receive 
appendEntries(s3, 1, (t:1, i:0), 0, false, commits[s3:c0, s4:c0, s0:c0, s1:c0, 
s2:c0], entries: (t:1, i:1), STATEMACHINELOGENTRY, client-9414EC4E73DA, cid=3000
{code}



4. If change to RUNNING in step3 happens before step2, then step2 will throw 
exception.

{code:java}

2020-06-19 07:03:18,169 [Thread-6194] INFO  impl.RoleInfo 
(RoleInfo.java:updateAndGet(143)) - s4: start FollowerState
2020-06-19 07:03:18,174 [Thread-6194] ERROR netty.TestRaftWithNetty 
(ExitUtils.java:terminate(133)) - Terminating with exit status -1: Failed to 
kill/restart server: s4
2020-06-19T07:03:18.1918474Z java.lang.IllegalStateException: ILLEGAL 
TRANSITION: In s4, RUNNING -> RUNNING
2020-06-19T07:03:18.1918899Zat 
org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:63)
2020-06-19T07:03:18.1919240Zat 
org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:115)
2020-06-19T07:03:18.1919558Zat 
org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:155)
2020-06-19T07:03:18.1919878Zat 
org.apache.ratis.server.impl.RaftServerImpl.startAsFollower(RaftServerImpl.java:214)
2020-06-19T07:03:18.1920206Zat 
org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:186)
2020-06-19T07:03:18.1920520Zat 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
2020-06-19T07:03:18.1920839Zat 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
2020-06-19T07:03:18.1921330Zat 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
2020-06-19T07:03:18.1921639Zat 
java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
2020-06-19T07:03:18.1921951Zat 
java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
2020-06-19T07:03:18.1922261Zat 
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
2020-06-19T07:03:18.1922575Zat 
java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
2020-06-19T07:03:18.1922885Zat 
java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
2020-06-19T07:03:18.1925464Zat 
java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
2020-06-19T07:03:18.1940816Zat 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
2020-06-19T07:03:18.1953283Zat 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
2020-06-19T07:03:18.1967610Zat 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
2020-06-19T07:03:18.1980549Zat 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
2020-06-19T07:03:18.1991620Zat 

[jira] [Updated] (RATIS-982) RaftServerImpl failed to change from RUNNING to RUNNING

2020-06-19 Thread runzhiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated RATIS-982:
-
Description: 
This happens in test, but it maybe also happen in production.

For example, leader is s3 and follower is s4.
1. kill s4, and restart s4.

{code:java}
2020-06-19T07:03:18.1000860Z 2020-06-19 07:03:18,095 [Thread-6194] INFO  
ratis.MiniRaftCluster (MiniRaftCluster.java:killServer(458)) - killServer s4
2020-06-19T07:03:18.1001826Z 2020-06-19 07:03:18,095 [Thread-6194] INFO  
ratis.MiniRaftCluster (MiniRaftCluster.java:newRaftServer(330)) - 
newRaftServer: s4, group-5BD7E8A01610:[s3:0.0.0.0:43375, s4:0.0.0.0:33719, 
s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], format? false
{code}

2. s4 start and set configuration from storage at 
[setRaftConf(raftConf.getLogEntryIndex(), raftConf) 
|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L170]
 and s4 will change to RUNNING at 
[lifeCycle.transition(RUNNING)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L213]


{code:java}
2020-06-19T07:03:18.1345896Z 2020-06-19 07:03:18,127 [pool-16-thread-1] INFO  
impl.RaftServerImpl (ServerState.java:setRaftConf(356)) - 
s4@group-5BD7E8A01610: set configuration 0: [s3:0.0.0.0:43375, 
s4:0.0.0.0:33719, s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], 
old=null at 0
{code}


3. s3 send append entry request to s4, and s4 change to RUNNING at 
[lifeCycle.compareAndTransition(STARTING, 
RUNNING)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1003]


{code:java}
2020-06-19T07:03:18.1773893Z 2020-06-19 07:03:18,162 [nioEventLoopGroup-59-1] 
DEBUG impl.RaftServerImpl (RaftServerImpl.java:logAppendEntries(918)) - 
s4@group-5BD7E8A01610: receive appendEntries(s3, 1, (t:1, i:0), 0, false, 
commits[s3:c0, s4:c0, s0:c0, s1:c0, s2:c0], entries: (t:1, i:1), 
STATEMACHINELOGENTRY, client-9414EC4E73DA, cid=3000
{code}



4. If change to RUNNING in step3 happens before step2, then step2 will throw 
exception.

{code:java}

2020-06-19T07:03:18.1821952Z 2020-06-19 07:03:18,169 [Thread-6194] INFO  
impl.RoleInfo (RoleInfo.java:updateAndGet(143)) - s4: start FollowerState
2020-06-19T07:03:18.1901820Z 2020-06-19 07:03:18,174 [Thread-6194] ERROR 
netty.TestRaftWithNetty (ExitUtils.java:terminate(133)) - Terminating with exit 
status -1: Failed to kill/restart server: s4
2020-06-19T07:03:18.1918474Z java.lang.IllegalStateException: ILLEGAL 
TRANSITION: In s4, RUNNING -> RUNNING
2020-06-19T07:03:18.1918899Zat 
org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:63)
2020-06-19T07:03:18.1919240Zat 
org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:115)
2020-06-19T07:03:18.1919558Zat 
org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:155)
2020-06-19T07:03:18.1919878Zat 
org.apache.ratis.server.impl.RaftServerImpl.startAsFollower(RaftServerImpl.java:214)
2020-06-19T07:03:18.1920206Zat 
org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:186)
2020-06-19T07:03:18.1920520Zat 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
2020-06-19T07:03:18.1920839Zat 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
2020-06-19T07:03:18.1921330Zat 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
2020-06-19T07:03:18.1921639Zat 
java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
2020-06-19T07:03:18.1921951Zat 
java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
2020-06-19T07:03:18.1922261Zat 
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
2020-06-19T07:03:18.1922575Zat 
java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
2020-06-19T07:03:18.1922885Zat 
java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
2020-06-19T07:03:18.1925464Zat 
java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
2020-06-19T07:03:18.1940816Zat 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
2020-06-19T07:03:18.1953283Zat 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
2020-06-19T07:03:18.1967610Zat 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
2020-06-19T07:03:18.1980549Zat 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
2020-06-19T07:03:18.1991620Zat 
org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:301)
2020-06-19T07:03:18.1991958Zat 
org.apache.ratis.MiniRaftCluster.restartServer(MiniRaftCluster.java:312)
2020-06-19T07:03:18.1992275Zat 

[jira] [Updated] (RATIS-982) RaftServerImpl failed to change from RUNNING to RUNNING

2020-06-19 Thread runzhiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated RATIS-982:
-
Description: 
This happens in test, but it maybe also happen in production.

For example, leader is s3 and follower is s4.
1. kill s4, and restart s4.

{code:java}
2020-06-19T07:03:18.1000860Z 2020-06-19 07:03:18,095 [Thread-6194] INFO  
ratis.MiniRaftCluster (MiniRaftCluster.java:killServer(458)) - killServer s4
2020-06-19T07:03:18.1001826Z 2020-06-19 07:03:18,095 [Thread-6194] INFO  
ratis.MiniRaftCluster (MiniRaftCluster.java:newRaftServer(330)) - 
newRaftServer: s4, group-5BD7E8A01610:[s3:0.0.0.0:43375, s4:0.0.0.0:33719, 
s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], format? false
{code}

2. s4 set configuration from storage at 
setRaftConf(raftConf.getLogEntryIndex(), raftConf) 
2020-06-19T07:03:18.1345896Z 2020-06-19 07:03:18,127 [pool-16-thread-1] INFO  
impl.RaftServerImpl (ServerState.java:setRaftConf(356)) - 
s4@group-5BD7E8A01610: set configuration 0: [s3:0.0.0.0:43375, 
s4:0.0.0.0:33719, s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], 
old=null at 0



> RaftServerImpl failed to change from RUNNING to RUNNING
> ---
>
> Key: RATIS-982
> URL: https://issues.apache.org/jira/browse/RATIS-982
> Project: Ratis
>  Issue Type: Bug
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Major
>
> This happens in test, but it maybe also happen in production.
> For example, leader is s3 and follower is s4.
> 1. kill s4, and restart s4.
> {code:java}
> 2020-06-19T07:03:18.1000860Z 2020-06-19 07:03:18,095 [Thread-6194] INFO  
> ratis.MiniRaftCluster (MiniRaftCluster.java:killServer(458)) - killServer s4
> 2020-06-19T07:03:18.1001826Z 2020-06-19 07:03:18,095 [Thread-6194] INFO  
> ratis.MiniRaftCluster (MiniRaftCluster.java:newRaftServer(330)) - 
> newRaftServer: s4, group-5BD7E8A01610:[s3:0.0.0.0:43375, s4:0.0.0.0:33719, 
> s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], format? false
> {code}
> 2. s4 set configuration from storage at 
> setRaftConf(raftConf.getLogEntryIndex(), raftConf) 
> 2020-06-19T07:03:18.1345896Z 2020-06-19 07:03:18,127 [pool-16-thread-1] INFO  
> impl.RaftServerImpl (ServerState.java:setRaftConf(356)) - 
> s4@group-5BD7E8A01610: set configuration 0: [s3:0.0.0.0:43375, 
> s4:0.0.0.0:33719, s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], 
> old=null at 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)