[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415037#comment-16415037 ] Hadoop QA commented on RATIS-209: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 25s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 12s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 15s{color} | {color:orange} root: The patch generated 1 new + 79 unchanged - 0 fixed = 80 total (was 79) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 9m 17s{color} | {color:red} root in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 8s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 16m 23s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | ratis.hadooprpc.TestLeaderElectionWithHadoopRpc | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/ratis:date2018-03-27 | | JIRA Issue | RATIS-209 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12916334/RATIS-209.004.patch | | Optional Tests | asflicense javac javadoc unit findbugs checkstyle compile | | uname | Linux 1af052732619 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh | | git revision | master / 288ea4f | | Default Java | 1.8.0_151 | | checkstyle | https://builds.apache.org/job/PreCommit-RATIS-Build/155/artifact/out/diff-checkstyle-root.txt | | unit | https://builds.apache.org/job/PreCommit-RATIS-Build/155/artifact/out/patch-unit-root.txt | | Test Results | https://builds.apache.org/job/PreCommit-RATIS-Build/155/testReport/ | | modules | C: ratis-server U: ratis-server | | Console output | https://builds.apache.org/job/PreCommit-RATIS-Build/155/console | | Powered by | Apache Yetus 0.5.0 http://yetus.apache.org | This message was automatically generated. > StateMachine updater may miss writeLog Request after a new Leader is chosen. > > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.2.0-alpha >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Attachments: RATIS-209.004.patch > > > This issue is happening on ozone for write chunk request. > 1) currently write chunk request is
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415033#comment-16415033 ] Tsz Wo Nicholas Sze commented on RATIS-209: --- +1 the 004 patch looks good. > StateMachine updater may miss writeLog Request after a new Leader is chosen. > > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.2.0-alpha >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Attachments: RATIS-209.004.patch > > > This issue is happening on ozone for write chunk request. > 1) currently write chunk request is processed in two phases, in the first > phase the user data is written to the follower as part of > {{writeStateMachineData}} and then the entry is committed to the follower as > part of {{commit}}. > 2) The issue which is hit right now is the case where a) > {{writeStateMachineData}} didn't happen for a particular chunk however b) the > commit entry is still processed. this leads to a case where a corresponding > stateMachineFuture is not present in the hashmap. > {code} > 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, > electionTimeout:873ms > 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for > changeToCandidate > 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 > for initElection > 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: begin an election in Term 4 > 2018-02-12 00:26:32,901 INFO > org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: > appendEntries completed > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: Election REJECTED; received 2 response(s) > [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, > 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); > 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, > i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, > 172.26.32.230_9858:172.26.32.230:9858, > 172.26.32.232_9858:172.26.32.232:9858], old=null > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for > changeToFollower > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, > electionTimeout:975ms > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for > changeToCandidate > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for > changeToFollower > 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 > for appendEntries > {code} > {code} > 2018-02-12 00:31:12,400 ERROR > org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit > status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater > hits Throwable > java.lang.NullPointerException > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254) > at > org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151) > at java.lang.Thread.run(Thread.java:745) > 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down DataNode at > y128.l42scl.hortonworks.com/172.26.32.228 > / > *** shutting down gRPC server since JVM is shutting down > *** server shut down > *** > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415023#comment-16415023 ] Mukul Kumar Singh commented on RATIS-209: - Thanks look looking into the patch [~szetszwo]. I have removed the changed in RaftServerlmpl. These changes were from a previous debug patch which were left by mistake. > StateMachine updater may miss writeLog Request after a new Leader is chosen. > > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.2.0-alpha >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Attachments: RATIS-209.004.patch > > > This issue is happening on ozone for write chunk request. > 1) currently write chunk request is processed in two phases, in the first > phase the user data is written to the follower as part of > {{writeStateMachineData}} and then the entry is committed to the follower as > part of {{commit}}. > 2) The issue which is hit right now is the case where a) > {{writeStateMachineData}} didn't happen for a particular chunk however b) the > commit entry is still processed. this leads to a case where a corresponding > stateMachineFuture is not present in the hashmap. > {code} > 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, > electionTimeout:873ms > 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for > changeToCandidate > 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 > for initElection > 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: begin an election in Term 4 > 2018-02-12 00:26:32,901 INFO > org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: > appendEntries completed > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: Election REJECTED; received 2 response(s) > [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, > 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); > 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, > i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, > 172.26.32.230_9858:172.26.32.230:9858, > 172.26.32.232_9858:172.26.32.232:9858], old=null > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for > changeToFollower > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, > electionTimeout:975ms > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for > changeToCandidate > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for > changeToFollower > 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 > for appendEntries > {code} > {code} > 2018-02-12 00:31:12,400 ERROR > org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit > status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater > hits Throwable > java.lang.NullPointerException > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254) > at > org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151) > at java.lang.Thread.run(Thread.java:745) > 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down DataNode at > y128.l42scl.hortonworks.com/172.26.32.228 > / > *** shutting down gRPC server since JVM is shutting down > *** server shut down > *** > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414909#comment-16414909 ] Tsz Wo Nicholas Sze commented on RATIS-209: --- One more question: - Why moving state.updateConfiguration and state.updateStatemachine to the end in RaftServerImpl.appendEntriesAsync? It seems the code has nothing to do with this bug. > StateMachine updater may miss writeLog Request after a new Leader is chosen. > > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.2.0-alpha >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Attachments: RATIS-209.003.patch > > > This issue is happening on ozone for write chunk request. > 1) currently write chunk request is processed in two phases, in the first > phase the user data is written to the follower as part of > {{writeStateMachineData}} and then the entry is committed to the follower as > part of {{commit}}. > 2) The issue which is hit right now is the case where a) > {{writeStateMachineData}} didn't happen for a particular chunk however b) the > commit entry is still processed. this leads to a case where a corresponding > stateMachineFuture is not present in the hashmap. > {code} > 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, > electionTimeout:873ms > 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for > changeToCandidate > 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 > for initElection > 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: begin an election in Term 4 > 2018-02-12 00:26:32,901 INFO > org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: > appendEntries completed > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: Election REJECTED; received 2 response(s) > [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, > 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); > 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, > i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, > 172.26.32.230_9858:172.26.32.230:9858, > 172.26.32.232_9858:172.26.32.232:9858], old=null > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for > changeToFollower > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, > electionTimeout:975ms > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for > changeToCandidate > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for > changeToFollower > 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 > for appendEntries > {code} > {code} > 2018-02-12 00:31:12,400 ERROR > org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit > status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater > hits Throwable > java.lang.NullPointerException > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254) > at > org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151) > at java.lang.Thread.run(Thread.java:745) > 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down DataNode at > y128.l42scl.hortonworks.com/172.26.32.228 > / > *** shutting down gRPC server since JVM is shutting down > *** server shut down > *** > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414256#comment-16414256 ] Hadoop QA commented on RATIS-209: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 4m 2s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 6s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 18s{color} | {color:orange} root: The patch generated 1 new + 389 unchanged - 0 fixed = 390 total (was 389) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 16m 20s{color} | {color:red} root in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 9s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 26m 5s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | ratis.server.simulation.TestRaftReconfigurationWithSimulatedRpc | | | ratis.server.simulation.TestLeaderElectionWithSimulatedRpc | | | ratis.server.impl.TestRaftServerJmx | | | ratis.server.simulation.TestReinitializationWithSimulatedRpc | | | ratis.server.TestRaftLogMetrics | | | ratis.server.simulation.TestRaftWithSimulatedRpc | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/ratis:date2018-03-26 | | JIRA Issue | RATIS-209 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12916233/RATIS-209.003.patch | | Optional Tests | asflicense javac javadoc unit findbugs checkstyle compile | | uname | Linux 128d00337f0a 3.13.0-137-generic #186-Ubuntu SMP Mon Dec 4 19:09:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh | | git revision | master / 288ea4f | | Default Java | 1.8.0_151 | | checkstyle | https://builds.apache.org/job/PreCommit-RATIS-Build/154/artifact/out/diff-checkstyle-root.txt | | unit | https://builds.apache.org/job/PreCommit-RATIS-Build/154/artifact/out/patch-unit-root.txt | | Test Results | https://builds.apache.org/job/PreCommit-RATIS-Build/154/testReport/ | | modules | C: ratis-server U: ratis-server | | Console output | https://builds.apache.org/job/PreCommit-RATIS-Build/154/console | | Powered by | Apache Yetus 0.5.0 http://yetus.apache.org | This message was automatically generated. > StateMachine updater may miss writeLog Request after a new Leader is chosen. > > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type:
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414173#comment-16414173 ] Mukul Kumar Singh commented on RATIS-209: - Thanks for reviewing the patch [~szetszwo], I have changed the error message to error was suggested. I tried of a test case, however it is very difficult to reproduce this race condition in a test. > StateMachine updater may miss writeLog Request after a new Leader is chosen. > > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.2.0-alpha >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Attachments: RATIS-209.003.patch > > > This issue is happening on ozone for write chunk request. > 1) currently write chunk request is processed in two phases, in the first > phase the user data is written to the follower as part of > {{writeStateMachineData}} and then the entry is committed to the follower as > part of {{commit}}. > 2) The issue which is hit right now is the case where a) > {{writeStateMachineData}} didn't happen for a particular chunk however b) the > commit entry is still processed. this leads to a case where a corresponding > stateMachineFuture is not present in the hashmap. > {code} > 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, > electionTimeout:873ms > 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for > changeToCandidate > 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 > for initElection > 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: begin an election in Term 4 > 2018-02-12 00:26:32,901 INFO > org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: > appendEntries completed > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: Election REJECTED; received 2 response(s) > [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, > 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); > 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, > i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, > 172.26.32.230_9858:172.26.32.230:9858, > 172.26.32.232_9858:172.26.32.232:9858], old=null > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for > changeToFollower > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, > electionTimeout:975ms > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for > changeToCandidate > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for > changeToFollower > 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 > for appendEntries > {code} > {code} > 2018-02-12 00:31:12,400 ERROR > org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit > status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater > hits Throwable > java.lang.NullPointerException > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254) > at > org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151) > at java.lang.Thread.run(Thread.java:745) > 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down DataNode at > y128.l42scl.hortonworks.com/172.26.32.228 > / > *** shutting down gRPC server since JVM is shutting down > *** server shut down > *** > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413624#comment-16413624 ] Tsz Wo Nicholas Sze commented on RATIS-209: --- Good catch on the bug! {code} +} catch (Throwable throwable) { + LOG.info("exception while appending entry with index:{}", + entry.getIndex(), throwable); + throw throwable; } {code} - I guess the code above is very useful in debugging. How about removing it? -* If we want to keep it, we should change it to LOG.error and should not use "{}" since the throwable won't be printed. - Could you see if you can add a test? > StateMachine updater may miss writeLog Request after a new Leader is chosen. > > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.2.0-alpha >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Attachments: RATIS-209.002.patch > > > This issue is happening on ozone for write chunk request. > 1) currently write chunk request is processed in two phases, in the first > phase the user data is written to the follower as part of > {{writeStateMachineData}} and then the entry is committed to the follower as > part of {{commit}}. > 2) The issue which is hit right now is the case where a) > {{writeStateMachineData}} didn't happen for a particular chunk however b) the > commit entry is still processed. this leads to a case where a corresponding > stateMachineFuture is not present in the hashmap. > {code} > 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, > electionTimeout:873ms > 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for > changeToCandidate > 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 > for initElection > 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: begin an election in Term 4 > 2018-02-12 00:26:32,901 INFO > org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: > appendEntries completed > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: Election REJECTED; received 2 response(s) > [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, > 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); > 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, > i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, > 172.26.32.230_9858:172.26.32.230:9858, > 172.26.32.232_9858:172.26.32.232:9858], old=null > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for > changeToFollower > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, > electionTimeout:975ms > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for > changeToCandidate > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for > changeToFollower > 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 > for appendEntries > {code} > {code} > 2018-02-12 00:31:12,400 ERROR > org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit > status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater > hits Throwable > java.lang.NullPointerException > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254) > at > org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151) > at java.lang.Thread.run(Thread.java:745) > 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down DataNode at > y128.l42scl.hortonworks.com/172.26.32.228 > / > *** shutting down gRPC server since JVM is shutting down > *** server shut down > *** > {code}
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413094#comment-16413094 ] Hadoop QA commented on RATIS-209: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 22s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 2s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 11m 25s{color} | {color:red} root in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 10s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 19m 30s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | ratis.server.simulation.TestRaftReconfigurationWithSimulatedRpc | | | ratis.server.simulation.TestLeaderElectionWithSimulatedRpc | | | ratis.server.impl.TestRaftServerJmx | | | ratis.server.simulation.TestReinitializationWithSimulatedRpc | | | ratis.server.simulation.TestRaftWithSimulatedRpc | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/ratis:date2018-03-25 | | JIRA Issue | RATIS-209 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12916101/RATIS-209.002.patch | | Optional Tests | asflicense javac javadoc unit findbugs checkstyle compile | | uname | Linux 5c8836293795 3.13.0-137-generic #186-Ubuntu SMP Mon Dec 4 19:09:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh | | git revision | master / 288ea4f | | Default Java | 1.8.0_151 | | unit | https://builds.apache.org/job/PreCommit-RATIS-Build/152/artifact/out/patch-unit-root.txt | | Test Results | https://builds.apache.org/job/PreCommit-RATIS-Build/152/testReport/ | | modules | C: ratis-server U: ratis-server | | Console output | https://builds.apache.org/job/PreCommit-RATIS-Build/152/console | | Powered by | Apache Yetus 0.5.0 http://yetus.apache.org | This message was automatically generated. > StateMachine updater may miss writeLog Request after a new Leader is chosen. > > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.2.0-alpha >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Attachments: RATIS-209.002.patch > > > This issue is happening
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411840#comment-16411840 ] Mukul Kumar Singh commented on RATIS-209: - The reason for this bug is following, Assume a 3 node cluster as in the case of Ozone. 1) Leader accepts a request and forwards it to the followers. 2) one of the follower, is not able to add the request completely. Please note that the requests are sent through appendEntriesAsync, each individual entry is now inserted through SegmentedRaftLog#appendEntry. 3) SegmentedRaftLog#appendEntry first inserts entry to the cache and then to the statemachine. However because of any error, if the write to stateMachine data fails then the there will be a spurious entry in the cache. 4) Follower now receives a commit info from the leader about the entry it missed in step(3). As the cache entry is present, the statemachine updater process the request even when the statemachine write was not successful. Adding logs from a failed instance, Leader adds the request to its log {code} 2018-03-23 12:51:18,583 INFO storage.RaftLog (SegmentedRaftLog.java:appendEntry(275)) - added log entry for index 158 12:51:18,598 INFO ratis.ContainerStateMachine (ContainerStateMachine.java:handleWriteChunk(207)) - adding entry:158 {code} Follower receive the request {code} 2018-03-23 12:51:19,294 INFO storage.RaftLog (SegmentedRaftLog.java:appendEntry(275)) - added log entry for index 158 2018-03-23 12:51:19,498 INFO storage.RaftLog (SegmentedRaftLog.java:appendEntry(275)) - added log entry for index 158 {code} One follower succeeds in adding entry to state machine while the other fails. {code} 2018-03-23 12:51:19,895 INFO ratis.ContainerStateMachine (ContainerSt ateMachine.java:handleWriteChunk(207)) - adding entry:158 2018-03-23 12:51:20,219 WARN impl.LogAppender (GRpcLogAppender.java:onError(226)) - 127.0.0.1_63413 got error when appending entries to 127.0.0.1_63403, exception: {}. java.io.IOException: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: INTERNAL: Java heap space at org.apache.ratis.grpc.RaftGrpcUtil.unwrapException(RaftGrpcUtil.java:65) at org.apache.ratis.grpc.RaftGrpcUtil.unwrapThrowable(RaftGrpcUtil.java:48) at org.apache.ratis.grpc.server.GRpcLogAppender$AppendLogResponseHandler.onError(GRpcLogAppender.java:227) at org.apache.ratis.shaded.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:395) at org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:481) at org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:398) at org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:513) at org.apache.ratis.shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52) at org.apache.ratis.shaded.io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: INTERNAL: Java heap space at org.apache.ratis.shaded.io.grpc.Status.asRuntimeException(Status.java:545) ... 9 more {code} Leader now receives failure info. {code} 2018-03-23 12:51:20,220 INFO impl.LogAppender (GRpcLogAppender.java:onError(247)) - 127.0.0.1_63413 failed peer:127.0.0.1_63403 next index:158 2018-03-23 12:51:20,264 INFO storage.RaftLog (RaftLog.java:updateLastCommitted(92)) - 127.0.0.1_63413: Updating lastCommitted to 158 {code} Failed follower not trying to commit the transaction on receiving the commit info from leader. This fails with NullPointerException {code} 2018-03-23 12:51:20,966 INFO ratis.ContainerStateMachine (ContainerStateMachine.java:applyTransaction(263)) - removing entry:158 2018-03-23 12:51:20,972 ERROR impl.StateMachineUpdater (ExitUtils.java:terminate(86)) - Terminating with exit status 2: StateMachineUpdater-127.0.0.1_63403: the StateMachineUpdater hits Throwable java.lang.NullPointerException at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:267) at org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1051) at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:154) at java.lang.Thread.run(Thread.java:745) *** shutting down gRPC server since JVM is shutting down *** shutting down gRPC server since JVM is shutting down {code} > StateMachine updater may
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407678#comment-16407678 ] Tsz Wo Nicholas Sze commented on RATIS-209: --- > ... The appendEntries are async, therefore the follower can append the > entries in future and return the next index. ... appendEntries returned a combined future; see RaftLogWorker.WriteLog. So that the stateMachineFuture must be completed when combined future is completed. > StateMachine updater may miss writeLog Request after a new Leader is chosen. > > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.2.0-alpha >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > > This issue is happening on ozone for write chunk request. > 1) currently write chunk request is processed in two phases, in the first > phase the user data is written to the follower as part of > {{writeStateMachineData}} and then the entry is committed to the follower as > part of {{commit}}. > 2) The issue which is hit right now is the case where a) > {{writeStateMachineData}} didn't happen for a particular chunk however b) the > commit entry is still processed. this leads to a case where a corresponding > stateMachineFuture is not present in the hashmap. > {code} > 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, > electionTimeout:873ms > 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for > changeToCandidate > 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 > for initElection > 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: begin an election in Term 4 > 2018-02-12 00:26:32,901 INFO > org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: > appendEntries completed > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: Election REJECTED; received 2 response(s) > [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, > 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); > 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, > i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, > 172.26.32.230_9858:172.26.32.230:9858, > 172.26.32.232_9858:172.26.32.232:9858], old=null > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for > changeToFollower > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, > electionTimeout:975ms > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for > changeToCandidate > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for > changeToFollower > 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 > for appendEntries > {code} > {code} > 2018-02-12 00:31:12,400 ERROR > org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit > status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater > hits Throwable > java.lang.NullPointerException > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254) > at > org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151) > at java.lang.Thread.run(Thread.java:745) > 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down DataNode at > y128.l42scl.hortonworks.com/172.26.32.228 > / > *** shutting down gRPC server since JVM is shutting down > *** server shut down > *** > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404362#comment-16404362 ] Mukul Kumar Singh commented on RATIS-209: - Thanks for looking into this [~szetszwo]. I agree, the null pointer exception is in ContainerStateMachine. However this is because of a case where 1) a commit key request is received for a message where the writeStateMachine data request wasn't processed. I am trying to think of a possible case to explain this issue but no luck there. > StateMachine updater may miss writeLog Request after a new Leader is chosen. > > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.2.0-alpha >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > > This issue is happening on ozone for write chunk request. > 1) currently write chunk request is processed in two phases, in the first > phase the user data is written to the follower as part of > {{writeStateMachineData}} and then the entry is committed to the follower as > part of {{commit}}. > 2) The issue which is hit right now is the case where a) > {{writeStateMachineData}} didn't happen for a particular chunk however b) the > commit entry is still processed. this leads to a case where a corresponding > stateMachineFuture is not present in the hashmap. > {code} > 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, > electionTimeout:873ms > 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for > changeToCandidate > 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 > for initElection > 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: begin an election in Term 4 > 2018-02-12 00:26:32,901 INFO > org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: > appendEntries completed > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: Election REJECTED; received 2 response(s) > [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, > 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); > 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, > i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, > 172.26.32.230_9858:172.26.32.230:9858, > 172.26.32.232_9858:172.26.32.232:9858], old=null > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for > changeToFollower > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, > electionTimeout:975ms > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for > changeToCandidate > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for > changeToFollower > 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 > for appendEntries > {code} > {code} > 2018-02-12 00:31:12,400 ERROR > org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit > status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater > hits Throwable > java.lang.NullPointerException > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254) > at > org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151) > at java.lang.Thread.run(Thread.java:745) > 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down DataNode at > y128.l42scl.hortonworks.com/172.26.32.228 > / > *** shutting down gRPC server since JVM is shutting down > *** server shut down > *** > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388775#comment-16388775 ] Tsz Wo Nicholas Sze commented on RATIS-209: --- Hey [~msingh], this is a NPE in ContainerStateMachine.java:265 in Ozone (not Ratis) {code:java} ava.lang.NullPointerException at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:265) {code} > StateMachine updater may miss writeLog Request after a new Leader is chosen. > > > Key: RATIS-209 > URL: https://issues.apache.org/jira/browse/RATIS-209 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.2.0-alpha >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > > This issue is happening on ozone for write chunk request. > 1) currently write chunk request is processed in two phases, in the first > phase the user data is written to the follower as part of > {{writeStateMachineData}} and then the entry is committed to the follower as > part of {{commit}}. > 2) The issue which is hit right now is the case where a) > {{writeStateMachineData}} didn't happen for a particular chunk however b) the > commit entry is still processed. this leads to a case where a corresponding > stateMachineFuture is not present in the hashmap. > {code} > 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, > electionTimeout:873ms > 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for > changeToCandidate > 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 > for initElection > 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: begin an election in Term 4 > 2018-02-12 00:26:32,901 INFO > org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: > appendEntries completed > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: > 172.26.32.228_9858: Election REJECTED; received 2 response(s) > [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, > 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); > 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, > i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, > 172.26.32.230_9858:172.26.32.230:9858, > 172.26.32.232_9858:172.26.32.232:9858], old=null > 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for > changeToFollower > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: > 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, > electionTimeout:975ms > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for > changeToCandidate > 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for > changeToFollower > 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: > 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 > for appendEntries > {code} > {code} > 2018-02-12 00:31:12,400 ERROR > org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit > status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater > hits Throwable > java.lang.NullPointerException > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254) > at > org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151) > at java.lang.Thread.run(Thread.java:745) > 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down DataNode at > y128.l42scl.hortonworks.com/172.26.32.228 > / > *** shutting down gRPC server since JVM is shutting down > *** server shut down > *** > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.
[ https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385588#comment-16385588 ] Mukul Kumar Singh commented on RATIS-209: - Some more debug info, the following run was started with 3 nodes 172.26.32.228:9858, 172.26.32.230:9858, 172.26.32.231:9858. And the problem happens on 172.26.32.230 {code} 2018-03-04 17:25:35,713 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.230_9858 changes role from CANDIDATE to FOLLOWER at term 4 for changeToFollower 2018-03-04 17:25:35,882 INFO org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.230_9858: appendEntries completed 2018-03-04 17:25:36,462 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.230_9858: change Leader from null to 172.26.32.228_9858 at term 5 for appendEntries 2018-03-04 17:25:36,903 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10908 2018-03-04 17:25:38,290 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10931 2018-03-04 17:25:38,398 ERROR org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit status 2: StateMachineUpdater-172.26.32.230_9858: the StateMachineUpdater hits Throwable java.lang.NullPointerException at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:265) at org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1051) at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:154) at java.lang.Thread.run(Thread.java:745) 2018-03-04 17:25:38,403 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: {code} on 172.26.32.228:9858, the following entries are seen {code} 2018-03-04 12:25:32,771 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.228_9858-RaftLogWorker index to:10906 2018-03-04 12:25:35,473 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.228_9858: change Leader from 172.26.32.228_9858 to null at term 4 for updateCurrentTerm 2018-03-04 12:25:35,474 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.228_9858 changes role from LEADER to FOLLOWER at term 4 for changeToFollower 2018-03-04 12:25:35,475 INFO org.apache.ratis.server.impl.PendingRequests: 172.26.32.228_9858 sends responses before shutting down PendingRequestsHandler 2018-03-04 12:25:35,842 INFO org.apache.ratis.server.impl.LogAppender: 172.26.32.228_9858 stops appending log entries to follower 172.26.32.231_9858(next=10931, match=10925, attendVote=true, lastRpcSendTi me=422478ms, lastRpcResponseTime=422910ms) 2018-03-04 12:25:35,883 INFO org.apache.ratis.server.impl.LogAppender: 172.26.32.228_9858 stops appending log entries to follower 172.26.32.230_9858(next=10911, match=10732, attendVote=true, lastRpcSendTi me=422580ms, lastRpcResponseTime=422541ms) 2018-03-04 12:25:36,423 INFO org.apache.ratis.server.impl.FollowerState: 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:944, electionTimeout:944ms 2018-03-04 12:25:36,424 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 4 for changeToCandidate 2018-03-04 12:25:36,439 INFO org.apache.ratis.server.impl.LeaderElection: 172.26.32.228_9858: begin an election in Term 5 2018-03-04 12:25:36,455 INFO org.apache.ratis.server.impl.LeaderElection: 172.26.32.228_9858: Election PASSED; received 1 response(s) [172.26.32.228_9858<-172.26.32.231_9858#0:OK-t5] and 0 exception(s); 1 {code} and on 172.26.32.231 {code} 2018-03-04 17:25:33,141 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.231_9858-RaftLogWorker index to:10906 2018-03-04 17:25:35,499 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.231_9858: change Leader from 172.26.32.228_9858 to null at term 4 for updateCurrentTerm 2018-03-04 17:25:35,841 INFO org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.231_9858: appendEntries completed 2018-03-04 17:25:36,461 INFO org.apache.ratis.server.impl.RaftServerImpl: 172.26.32.231_9858: change Leader from null to 172.26.32.228_9858 at term 5 for appendEntries 2018-03-04 17:25:36,686 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.231_9858-RaftLogWorker index to:10931 {code} The only visible difference from the log file is the index which is being rolled over. Only on 172.26.32.230 the rolled over index is 10908. However on other nodes is 10906. {code} 2018-03-04 17:25:36,903 INFO org.apache.ratis.server.storage.RaftLogWorker: Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10908 {code} > StateMachine updater may miss writeLog Request after a new Leader is chosen. >