[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385588#comment-16385588
 ] 

Mukul Kumar Singh commented on RATIS-209:
-----------------------------------------

Some more debug info, the following run was started with 3 nodes 
172.26.32.228:9858, 172.26.32.230:9858, 172.26.32.231:9858.

And the problem happens on 172.26.32.230
{code}
2018-03-04 17:25:35,713 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.230_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
changeToFollower
2018-03-04 17:25:35,882 INFO 
org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.230_9858: 
appendEntries completed
2018-03-04 17:25:36,462 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.230_9858: change Leader from null to 172.26.32.228_9858 at term 5 for 
appendEntries
2018-03-04 17:25:36,903 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10908
2018-03-04 17:25:38,290 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10931
2018-03-04 17:25:38,398 ERROR org.apache.ratis.server.impl.StateMachineUpdater: 
Terminating with exit status 2: StateMachineUpdater-172.26.32.230_9858: the 
StateMachineUpdater hits Throwable
java.lang.NullPointerException
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:265)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1051)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:154)
        at java.lang.Thread.run(Thread.java:745)
2018-03-04 17:25:38,403 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
SHUTDOWN_MSG: 
{code}

on 172.26.32.228:9858, the following entries are seen
{code}
2018-03-04 12:25:32,771 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.228_9858-RaftLogWorker index to:10906
2018-03-04 12:25:35,473 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.228_9858: change Leader from 172.26.32.228_9858 to null at term 4 for 
updateCurrentTerm
2018-03-04 12:25:35,474 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.228_9858 changes role from LEADER to FOLLOWER at term 4 for 
changeToFollower
2018-03-04 12:25:35,475 INFO org.apache.ratis.server.impl.PendingRequests: 
172.26.32.228_9858 sends responses before shutting down PendingRequestsHandler
2018-03-04 12:25:35,842 INFO org.apache.ratis.server.impl.LogAppender: 
172.26.32.228_9858 stops appending log entries to follower 
172.26.32.231_9858(next=10931, match=10925, attendVote=true, lastRpcSendTi
me=422478ms, lastRpcResponseTime=422910ms)
2018-03-04 12:25:35,883 INFO org.apache.ratis.server.impl.LogAppender: 
172.26.32.228_9858 stops appending log entries to follower 
172.26.32.230_9858(next=10911, match=10732, attendVote=true, lastRpcSendTi
me=422580ms, lastRpcResponseTime=422541ms)
2018-03-04 12:25:36,423 INFO org.apache.ratis.server.impl.FollowerState: 
172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:944, electionTimeout:944ms
2018-03-04 12:25:36,424 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 4 for 
changeToCandidate
2018-03-04 12:25:36,439 INFO org.apache.ratis.server.impl.LeaderElection: 
172.26.32.228_9858: begin an election in Term 5
2018-03-04 12:25:36,455 INFO org.apache.ratis.server.impl.LeaderElection: 
172.26.32.228_9858: Election PASSED; received 1 response(s) 
[172.26.32.228_9858<-172.26.32.231_9858#0:OK-t5] and 0 exception(s); 1
{code}

and on 172.26.32.231
{code}
2018-03-04 17:25:33,141 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.231_9858-RaftLogWorker index to:10906
2018-03-04 17:25:35,499 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.231_9858: change Leader from 172.26.32.228_9858 to null at term 4 for 
updateCurrentTerm
2018-03-04 17:25:35,841 INFO 
org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.231_9858: 
appendEntries completed
2018-03-04 17:25:36,461 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.231_9858: change Leader from null to 172.26.32.228_9858 at term 5 for 
appendEntries
2018-03-04 17:25:36,686 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.231_9858-RaftLogWorker index to:10931
{code}


The only visible difference from the log file is the index which is being 
rolled over.
Only on 172.26.32.230 the rolled over index is 10908. However on other nodes is 
10906.

{code}
2018-03-04 17:25:36,903 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10908
{code}

> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> ----------------------------------------------------------------------------
>
>                 Key: RATIS-209
>                 URL: https://issues.apache.org/jira/browse/RATIS-209
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 0.2.0-alpha
>            Reporter: Mukul Kumar Singh
>            Assignee: Mukul Kumar Singh
>            Priority: Major
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is processed in two phases, in the first 
> phase the user data is written to the follower as part of 
> {{writeStateMachineData}} and then the entry is committed to the follower as 
> part of {{commit}}.
> 2) The issue which is hit right now is the case where a) 
> {{writeStateMachineData}} didn't happen for a particular chunk however b) the 
> commit entry is still processed. this leads to a case where a corresponding 
> stateMachineFuture is not present in the hashmap.
> {code}
> 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, 
> electionTimeout:873ms
> 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for 
> changeToCandidate
> 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 
> for initElection
> 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: begin an election in Term 4
> 2018-02-12 00:26:32,901 INFO 
> org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: 
> appendEntries completed
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: Election REJECTED; received 2 response(s) 
> [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, 
> 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); 
> 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, 
> i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, 
> 172.26.32.230_9858:172.26.32.230:9858, 
> 172.26.32.232_9858:172.26.32.232:9858], old=null
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
> changeToFollower
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, 
> electionTimeout:975ms
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for 
> changeToCandidate
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for 
> changeToFollower
> 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 
> for appendEntries
> {code}
> {code}
> 2018-02-12 00:31:12,400 ERROR 
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit 
> status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater 
> hits Throwable
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> SHUTDOWN_MSG: 
> /************************************************************
> SHUTDOWN_MSG: Shutting down DataNode at 
> y128.l42scl.hortonworks.com/172.26.32.228
> ************************************************************/
> *** shutting down gRPC server since JVM is shutting down
> *** server shut down
> ***
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to