[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385588#comment-16385588 ]
Mukul Kumar Singh commented on RATIS-209: ----------------------------------------- Some more debug info, the following run was started with 3 nodes 172.26.32.228:9858, 172.26.32.230:9858, 172.26.32.231:9858. And the problem happens on 172.26.32.230 {code} 2018-03-04 17:25:35,713 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.230_9858 changes role from CANDIDATE to FOLLOWER at term 4 for changeToFollower 2018-03-04 17:25:35,882 INFO org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.230_9858: appendEntries completed 2018-03-04 17:25:36,462 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.230_9858: change Leader from null to 172.26.32.228_9858 at term 5 for appendEntries 2018-03-04 17:25:36,903 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10908 2018-03-04 17:25:38,290 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10931 2018-03-04 17:25:38,398 ERROR org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit status 2: StateMachineUpdater-172.26.32.230_9858: the StateMachineUpdater hits Throwable java.lang.NullPointerException at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:265) at org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1051) at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:154) at java.lang.Thread.run(Thread.java:745) 2018-03-04 17:25:38,403 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: {code} on 172.26.32.228:9858, the following entries are seen {code} 2018-03-04 12:25:32,771 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.228_9858-RaftLogWorker index to:10906 2018-03-04 12:25:35,473 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.228_9858: change Leader from 172.26.32.228_9858 to null at term 4 for updateCurrentTerm 2018-03-04 12:25:35,474 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.228_9858 changes role from LEADER to FOLLOWER at term 4 for changeToFollower 2018-03-04 12:25:35,475 INFO org.apache.ratis.server.impl.PendingRequests: 172.26.32.228_9858 sends responses before shutting down PendingRequestsHandler 2018-03-04 12:25:35,842 INFO org.apache.ratis.server.impl.LogAppender: 172.26.32.228_9858 stops appending log entries to follower 172.26.32.231_9858(next=10931, match=10925, attendVote=true, lastRpcSendTi me=422478ms, lastRpcResponseTime=422910ms) 2018-03-04 12:25:35,883 INFO org.apache.ratis.server.impl.LogAppender: 172.26.32.228_9858 stops appending log entries to follower 172.26.32.230_9858(next=10911, match=10732, attendVote=true, lastRpcSendTi me=422580ms, lastRpcResponseTime=422541ms) 2018-03-04 12:25:36,423 INFO org.apache.ratis.server.impl.FollowerState: 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:944, electionTimeout:944ms 2018-03-04 12:25:36,424 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 4 for changeToCandidate 2018-03-04 12:25:36,439 INFO org.apache.ratis.server.impl.LeaderElection: 172.26.32.228_9858: begin an election in Term 5 2018-03-04 12:25:36,455 INFO org.apache.ratis.server.impl.LeaderElection: 172.26.32.228_9858: Election PASSED; received 1 response(s) [172.26.32.228_9858<-172.26.32.231_9858#0:OK-t5] and 0 exception(s); 1 {code} and on 172.26.32.231 {code} 2018-03-04 17:25:33,141 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.231_9858-RaftLogWorker index to:10906 2018-03-04 17:25:35,499 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.231_9858: change Leader from 172.26.32.228_9858 to null at term 4 for updateCurrentTerm 2018-03-04 17:25:35,841 INFO org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.231_9858: appendEntries completed 2018-03-04 17:25:36,461 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.231_9858: change Leader from null to 172.26.32.228_9858 at term 5 for appendEntries 2018-03-04 17:25:36,686 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.231_9858-RaftLogWorker index to:10931 {code} The only visible difference from the log file is the index which is being rolled over. Only on 172.26.32.230 the rolled over index is 10908. However on other nodes is 10906. {code} 2018-03-04 17:25:36,903 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10908 {code} > StateMachine updater may miss writeLog Request after a new Leader is chosen. > ---------------------------------------------------------------------------- > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type: Bug > Affects Versions: 0.2.0-alpha > Reporter: Mukul Kumar Singh > Assignee: Mukul Kumar Singh > Priority: Major > > This issue is happening on ozone for write chunk request. > 1) currently write chunk request is processed in two phases, in the first > phase the user data is written to the follower as part of > {{writeStateMachineData}} and then the entry is committed to the follower as > part of {{commit}}. > 2) The issue which is hit right now is the case where a) > {{writeStateMachineData}} didn't happen for a particular chunk however b) the > commit entry is still processed. this leads to a case where a corresponding > stateMachineFuture is not present in the hashmap. > {code} > 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, > electionTimeout:873ms > 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for > changeToCandidate > 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 > for initElection > 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: begin an election in Term 4 > 2018-02-12 00:26:32,901 INFO > org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: > appendEntries completed > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: Election REJECTED; received 2 response(s) > [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, > 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); > 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, > i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, > 172.26.32.230_9858:172.26.32.230:9858, > 172.26.32.232_9858:172.26.32.232:9858], old=null > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for > changeToFollower > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, > electionTimeout:975ms > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for > changeToCandidate > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for > changeToFollower > 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 > for appendEntries > {code} > {code} > 2018-02-12 00:31:12,400 ERROR > org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit > status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater > hits Throwable > java.lang.NullPointerException > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254) > at > org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151) > at java.lang.Thread.run(Thread.java:745) > 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > SHUTDOWN_MSG: > /************************************************************ > SHUTDOWN_MSG: Shutting down DataNode at > y128.l42scl.hortonworks.com/172.26.32.228 > ************************************************************/ > *** shutting down gRPC server since JVM is shutting down > *** server shut down > *** > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)