[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415037#comment-16415037
 ] 

Hadoop QA commented on RATIS-209:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
25s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
12s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 15s{color} | {color:orange} root: The patch generated 1 new + 79 unchanged - 
0 fixed = 80 total (was 79) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  9m 17s{color} 
| {color:red} root in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 8s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 16m 23s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | ratis.hadooprpc.TestLeaderElectionWithHadoopRpc |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/ratis:date2018-03-27 
|
| JIRA Issue | RATIS-209 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12916334/RATIS-209.004.patch |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  checkstyle  
compile  |
| uname | Linux 1af052732619 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 
10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh
 |
| git revision | master / 288ea4f |
| Default Java | 1.8.0_151 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-RATIS-Build/155/artifact/out/diff-checkstyle-root.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-RATIS-Build/155/artifact/out/patch-unit-root.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-RATIS-Build/155/testReport/ |
| modules | C: ratis-server U: ratis-server |
| Console output | 
https://builds.apache.org/job/PreCommit-RATIS-Build/155/console |
| Powered by | Apache Yetus 0.5.0   http://yetus.apache.org |


This message was automatically generated.



> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> 
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.2.0-alpha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Attachments: RATIS-209.004.patch
>
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is 

[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-26 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415033#comment-16415033
 ] 

Tsz Wo Nicholas Sze commented on RATIS-209:
---

+1 the 004 patch looks good.

> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> 
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.2.0-alpha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Attachments: RATIS-209.004.patch
>
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is processed in two phases, in the first 
> phase the user data is written to the follower as part of 
> {{writeStateMachineData}} and then the entry is committed to the follower as 
> part of {{commit}}.
> 2) The issue which is hit right now is the case where a) 
> {{writeStateMachineData}} didn't happen for a particular chunk however b) the 
> commit entry is still processed. this leads to a case where a corresponding 
> stateMachineFuture is not present in the hashmap.
> {code}
> 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, 
> electionTimeout:873ms
> 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for 
> changeToCandidate
> 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 
> for initElection
> 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: begin an election in Term 4
> 2018-02-12 00:26:32,901 INFO 
> org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: 
> appendEntries completed
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: Election REJECTED; received 2 response(s) 
> [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, 
> 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); 
> 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, 
> i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, 
> 172.26.32.230_9858:172.26.32.230:9858, 
> 172.26.32.232_9858:172.26.32.232:9858], old=null
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
> changeToFollower
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, 
> electionTimeout:975ms
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for 
> changeToCandidate
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for 
> changeToFollower
> 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 
> for appendEntries
> {code}
> {code}
> 2018-02-12 00:31:12,400 ERROR 
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit 
> status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater 
> hits Throwable
> java.lang.NullPointerException
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
> at java.lang.Thread.run(Thread.java:745)
> 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down DataNode at 
> y128.l42scl.hortonworks.com/172.26.32.228
> /
> *** shutting down gRPC server since JVM is shutting down
> *** server shut down
> ***
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-26 Thread Mukul Kumar Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415023#comment-16415023
 ] 

Mukul Kumar Singh commented on RATIS-209:
-

Thanks look looking into the patch [~szetszwo]. I have removed the changed in 
RaftServerlmpl. These changes were from a previous debug patch which were left 
by mistake.

> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> 
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.2.0-alpha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Attachments: RATIS-209.004.patch
>
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is processed in two phases, in the first 
> phase the user data is written to the follower as part of 
> {{writeStateMachineData}} and then the entry is committed to the follower as 
> part of {{commit}}.
> 2) The issue which is hit right now is the case where a) 
> {{writeStateMachineData}} didn't happen for a particular chunk however b) the 
> commit entry is still processed. this leads to a case where a corresponding 
> stateMachineFuture is not present in the hashmap.
> {code}
> 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, 
> electionTimeout:873ms
> 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for 
> changeToCandidate
> 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 
> for initElection
> 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: begin an election in Term 4
> 2018-02-12 00:26:32,901 INFO 
> org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: 
> appendEntries completed
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: Election REJECTED; received 2 response(s) 
> [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, 
> 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); 
> 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, 
> i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, 
> 172.26.32.230_9858:172.26.32.230:9858, 
> 172.26.32.232_9858:172.26.32.232:9858], old=null
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
> changeToFollower
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, 
> electionTimeout:975ms
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for 
> changeToCandidate
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for 
> changeToFollower
> 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 
> for appendEntries
> {code}
> {code}
> 2018-02-12 00:31:12,400 ERROR 
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit 
> status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater 
> hits Throwable
> java.lang.NullPointerException
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
> at java.lang.Thread.run(Thread.java:745)
> 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down DataNode at 
> y128.l42scl.hortonworks.com/172.26.32.228
> /
> *** shutting down gRPC server since JVM is shutting down
> *** server shut down
> ***
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-26 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414909#comment-16414909
 ] 

Tsz Wo Nicholas Sze commented on RATIS-209:
---

One more question:
- Why moving state.updateConfiguration and state.updateStatemachine to the end 
in RaftServerImpl.appendEntriesAsync?  It seems the code has nothing to do with 
this bug.

> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> 
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.2.0-alpha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Attachments: RATIS-209.003.patch
>
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is processed in two phases, in the first 
> phase the user data is written to the follower as part of 
> {{writeStateMachineData}} and then the entry is committed to the follower as 
> part of {{commit}}.
> 2) The issue which is hit right now is the case where a) 
> {{writeStateMachineData}} didn't happen for a particular chunk however b) the 
> commit entry is still processed. this leads to a case where a corresponding 
> stateMachineFuture is not present in the hashmap.
> {code}
> 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, 
> electionTimeout:873ms
> 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for 
> changeToCandidate
> 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 
> for initElection
> 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: begin an election in Term 4
> 2018-02-12 00:26:32,901 INFO 
> org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: 
> appendEntries completed
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: Election REJECTED; received 2 response(s) 
> [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, 
> 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); 
> 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, 
> i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, 
> 172.26.32.230_9858:172.26.32.230:9858, 
> 172.26.32.232_9858:172.26.32.232:9858], old=null
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
> changeToFollower
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, 
> electionTimeout:975ms
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for 
> changeToCandidate
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for 
> changeToFollower
> 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 
> for appendEntries
> {code}
> {code}
> 2018-02-12 00:31:12,400 ERROR 
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit 
> status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater 
> hits Throwable
> java.lang.NullPointerException
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
> at java.lang.Thread.run(Thread.java:745)
> 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down DataNode at 
> y128.l42scl.hortonworks.com/172.26.32.228
> /
> *** shutting down gRPC server since JVM is shutting down
> *** server shut down
> ***
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414256#comment-16414256
 ] 

Hadoop QA commented on RATIS-209:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  4m  
2s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 6s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 18s{color} | {color:orange} root: The patch generated 1 new + 389 unchanged 
- 0 fixed = 390 total (was 389) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 16m 20s{color} 
| {color:red} root in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 9s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 26m  5s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
ratis.server.simulation.TestRaftReconfigurationWithSimulatedRpc |
|   | ratis.server.simulation.TestLeaderElectionWithSimulatedRpc |
|   | ratis.server.impl.TestRaftServerJmx |
|   | ratis.server.simulation.TestReinitializationWithSimulatedRpc |
|   | ratis.server.TestRaftLogMetrics |
|   | ratis.server.simulation.TestRaftWithSimulatedRpc |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/ratis:date2018-03-26 
|
| JIRA Issue | RATIS-209 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12916233/RATIS-209.003.patch |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  checkstyle  
compile  |
| uname | Linux 128d00337f0a 3.13.0-137-generic #186-Ubuntu SMP Mon Dec 4 
19:09:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh
 |
| git revision | master / 288ea4f |
| Default Java | 1.8.0_151 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-RATIS-Build/154/artifact/out/diff-checkstyle-root.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-RATIS-Build/154/artifact/out/patch-unit-root.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-RATIS-Build/154/testReport/ |
| modules | C: ratis-server U: ratis-server |
| Console output | 
https://builds.apache.org/job/PreCommit-RATIS-Build/154/console |
| Powered by | Apache Yetus 0.5.0   http://yetus.apache.org |


This message was automatically generated.



> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> 
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
>  Issue Type: 

[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-26 Thread Mukul Kumar Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414173#comment-16414173
 ] 

Mukul Kumar Singh commented on RATIS-209:
-

Thanks for reviewing the patch [~szetszwo], I have changed the error message to 
error was suggested. I tried of a test case, however it is very difficult to 
reproduce this race condition in a test.

> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> 
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.2.0-alpha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Attachments: RATIS-209.003.patch
>
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is processed in two phases, in the first 
> phase the user data is written to the follower as part of 
> {{writeStateMachineData}} and then the entry is committed to the follower as 
> part of {{commit}}.
> 2) The issue which is hit right now is the case where a) 
> {{writeStateMachineData}} didn't happen for a particular chunk however b) the 
> commit entry is still processed. this leads to a case where a corresponding 
> stateMachineFuture is not present in the hashmap.
> {code}
> 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, 
> electionTimeout:873ms
> 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for 
> changeToCandidate
> 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 
> for initElection
> 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: begin an election in Term 4
> 2018-02-12 00:26:32,901 INFO 
> org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: 
> appendEntries completed
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: Election REJECTED; received 2 response(s) 
> [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, 
> 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); 
> 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, 
> i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, 
> 172.26.32.230_9858:172.26.32.230:9858, 
> 172.26.32.232_9858:172.26.32.232:9858], old=null
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
> changeToFollower
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, 
> electionTimeout:975ms
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for 
> changeToCandidate
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for 
> changeToFollower
> 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 
> for appendEntries
> {code}
> {code}
> 2018-02-12 00:31:12,400 ERROR 
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit 
> status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater 
> hits Throwable
> java.lang.NullPointerException
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
> at java.lang.Thread.run(Thread.java:745)
> 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down DataNode at 
> y128.l42scl.hortonworks.com/172.26.32.228
> /
> *** shutting down gRPC server since JVM is shutting down
> *** server shut down
> ***
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-26 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413624#comment-16413624
 ] 

Tsz Wo Nicholas Sze commented on RATIS-209:
---

Good catch on the bug!
{code}
+} catch (Throwable throwable) {
+  LOG.info("exception while appending entry with index:{}",
+  entry.getIndex(), throwable);
+  throw throwable;
 }
{code}
- I guess the code above is very useful in debugging.  How about removing it?
-* If we want to keep it, we should change it to LOG.error and should not use 
"{}" since the throwable won't be printed.
- Could you see if you can add a test?


> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> 
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.2.0-alpha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Attachments: RATIS-209.002.patch
>
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is processed in two phases, in the first 
> phase the user data is written to the follower as part of 
> {{writeStateMachineData}} and then the entry is committed to the follower as 
> part of {{commit}}.
> 2) The issue which is hit right now is the case where a) 
> {{writeStateMachineData}} didn't happen for a particular chunk however b) the 
> commit entry is still processed. this leads to a case where a corresponding 
> stateMachineFuture is not present in the hashmap.
> {code}
> 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, 
> electionTimeout:873ms
> 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for 
> changeToCandidate
> 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 
> for initElection
> 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: begin an election in Term 4
> 2018-02-12 00:26:32,901 INFO 
> org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: 
> appendEntries completed
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: Election REJECTED; received 2 response(s) 
> [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, 
> 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); 
> 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, 
> i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, 
> 172.26.32.230_9858:172.26.32.230:9858, 
> 172.26.32.232_9858:172.26.32.232:9858], old=null
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
> changeToFollower
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, 
> electionTimeout:975ms
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for 
> changeToCandidate
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for 
> changeToFollower
> 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 
> for appendEntries
> {code}
> {code}
> 2018-02-12 00:31:12,400 ERROR 
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit 
> status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater 
> hits Throwable
> java.lang.NullPointerException
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
> at java.lang.Thread.run(Thread.java:745)
> 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down DataNode at 
> y128.l42scl.hortonworks.com/172.26.32.228
> /
> *** shutting down gRPC server since JVM is shutting down
> *** server shut down
> ***
> {code}




[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413094#comment-16413094
 ] 

Hadoop QA commented on RATIS-209:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
 2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
25s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
37s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 11m 25s{color} 
| {color:red} root in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
10s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 19m 30s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
ratis.server.simulation.TestRaftReconfigurationWithSimulatedRpc |
|   | ratis.server.simulation.TestLeaderElectionWithSimulatedRpc |
|   | ratis.server.impl.TestRaftServerJmx |
|   | ratis.server.simulation.TestReinitializationWithSimulatedRpc |
|   | ratis.server.simulation.TestRaftWithSimulatedRpc |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/ratis:date2018-03-25 
|
| JIRA Issue | RATIS-209 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12916101/RATIS-209.002.patch |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  checkstyle  
compile  |
| uname | Linux 5c8836293795 3.13.0-137-generic #186-Ubuntu SMP Mon Dec 4 
19:09:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh
 |
| git revision | master / 288ea4f |
| Default Java | 1.8.0_151 |
| unit | 
https://builds.apache.org/job/PreCommit-RATIS-Build/152/artifact/out/patch-unit-root.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-RATIS-Build/152/testReport/ |
| modules | C: ratis-server U: ratis-server |
| Console output | 
https://builds.apache.org/job/PreCommit-RATIS-Build/152/console |
| Powered by | Apache Yetus 0.5.0   http://yetus.apache.org |


This message was automatically generated.



> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> 
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.2.0-alpha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Attachments: RATIS-209.002.patch
>
>
> This issue is happening 

[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-23 Thread Mukul Kumar Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411840#comment-16411840
 ] 

Mukul Kumar Singh commented on RATIS-209:
-

The reason for this bug is following, Assume a 3 node cluster as in the case of 
Ozone.

1) Leader accepts a request and forwards it to the followers.
2) one of the follower, is not able to add the request completely. Please note 
that the requests are sent through appendEntriesAsync, each individual entry is 
now inserted through SegmentedRaftLog#appendEntry. 
3) SegmentedRaftLog#appendEntry first inserts entry to the cache and then to 
the statemachine. However because of any error, if the write to stateMachine 
data fails then the there will be a spurious entry in the cache.
4) Follower now receives a commit info from the leader about the entry it 
missed in step(3). As the cache entry is present, the statemachine updater 
process the request even when the statemachine write was not successful.

Adding logs from a failed instance,
Leader adds the request to its log
{code}
2018-03-23 12:51:18,583 INFO  storage.RaftLog 
(SegmentedRaftLog.java:appendEntry(275)) - added log entry for index 158 
12:51:18,598 INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:handleWriteChunk(207)) - adding entry:158 
{code}

Follower receive the request
{code}
2018-03-23 12:51:19,294 INFO  storage.RaftLog 
(SegmentedRaftLog.java:appendEntry(275)) - added log entry for index 158
2018-03-23 12:51:19,498 INFO  storage.RaftLog 
(SegmentedRaftLog.java:appendEntry(275)) - added log entry for index 158
{code}

One follower succeeds in adding entry to state machine while the other fails.
{code}
2018-03-23 12:51:19,895 INFO  ratis.ContainerStateMachine (ContainerSt
ateMachine.java:handleWriteChunk(207)) - adding entry:158
2018-03-23 12:51:20,219 WARN  impl.LogAppender 
(GRpcLogAppender.java:onError(226)) - 127.0.0.1_63413 got error when appending 
entries to 127.0.0.1_63403, exception: {}.
java.io.IOException: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: 
INTERNAL: Java heap space
at 
org.apache.ratis.grpc.RaftGrpcUtil.unwrapException(RaftGrpcUtil.java:65)
at 
org.apache.ratis.grpc.RaftGrpcUtil.unwrapThrowable(RaftGrpcUtil.java:48)
at 
org.apache.ratis.grpc.server.GRpcLogAppender$AppendLogResponseHandler.onError(GRpcLogAppender.java:227)
at 
org.apache.ratis.shaded.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:395)
at 
org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:481)
at 
org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:398)
at 
org.apache.ratis.shaded.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:513)
at 
org.apache.ratis.shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:52)
at 
org.apache.ratis.shaded.io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: INTERNAL: 
Java heap space
at 
org.apache.ratis.shaded.io.grpc.Status.asRuntimeException(Status.java:545)
... 9 more
{code}

Leader now receives failure info.
{code}
2018-03-23 12:51:20,220 INFO  impl.LogAppender 
(GRpcLogAppender.java:onError(247)) - 127.0.0.1_63413 failed 
peer:127.0.0.1_63403 next index:158
2018-03-23 12:51:20,264 INFO  storage.RaftLog 
(RaftLog.java:updateLastCommitted(92)) - 127.0.0.1_63413: Updating 
lastCommitted to 158
{code}

Failed follower not trying to commit the transaction on receiving the commit 
info from leader. This fails with NullPointerException
{code}
2018-03-23 12:51:20,966 INFO  ratis.ContainerStateMachine 
(ContainerStateMachine.java:applyTransaction(263)) - removing entry:158
2018-03-23 12:51:20,972 ERROR impl.StateMachineUpdater 
(ExitUtils.java:terminate(86)) - Terminating with exit status 2: 
StateMachineUpdater-127.0.0.1_63403: the StateMachineUpdater hits Throwable

java.lang.NullPointerException
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:267)
at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1051)
at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:154)
at java.lang.Thread.run(Thread.java:745)
*** shutting down gRPC server since JVM is shutting down
*** shutting down gRPC server since JVM is shutting down
{code}

> StateMachine updater may 

[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-21 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407678#comment-16407678
 ] 

Tsz Wo Nicholas Sze commented on RATIS-209:
---

> ...  The appendEntries are async, therefore the follower can append the 
> entries in future and return the next index. ...

appendEntries returned a combined future; see RaftLogWorker.WriteLog.  So that 
the stateMachineFuture must be completed when combined future is completed.

> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> 
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.2.0-alpha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is processed in two phases, in the first 
> phase the user data is written to the follower as part of 
> {{writeStateMachineData}} and then the entry is committed to the follower as 
> part of {{commit}}.
> 2) The issue which is hit right now is the case where a) 
> {{writeStateMachineData}} didn't happen for a particular chunk however b) the 
> commit entry is still processed. this leads to a case where a corresponding 
> stateMachineFuture is not present in the hashmap.
> {code}
> 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, 
> electionTimeout:873ms
> 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for 
> changeToCandidate
> 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 
> for initElection
> 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: begin an election in Term 4
> 2018-02-12 00:26:32,901 INFO 
> org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: 
> appendEntries completed
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: Election REJECTED; received 2 response(s) 
> [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, 
> 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); 
> 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, 
> i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, 
> 172.26.32.230_9858:172.26.32.230:9858, 
> 172.26.32.232_9858:172.26.32.232:9858], old=null
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
> changeToFollower
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, 
> electionTimeout:975ms
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for 
> changeToCandidate
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for 
> changeToFollower
> 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 
> for appendEntries
> {code}
> {code}
> 2018-02-12 00:31:12,400 ERROR 
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit 
> status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater 
> hits Throwable
> java.lang.NullPointerException
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
> at java.lang.Thread.run(Thread.java:745)
> 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down DataNode at 
> y128.l42scl.hortonworks.com/172.26.32.228
> /
> *** shutting down gRPC server since JVM is shutting down
> *** server shut down
> ***
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-18 Thread Mukul Kumar Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404362#comment-16404362
 ] 

Mukul Kumar Singh commented on RATIS-209:
-

Thanks for looking into this [~szetszwo]. I agree, the null pointer exception 
is in ContainerStateMachine. However this is because of a case where 1) a 
commit key request is received for a message where the writeStateMachine data 
request wasn't processed.

I am trying to think of a possible case to explain this issue but no luck there.

> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> 
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.2.0-alpha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is processed in two phases, in the first 
> phase the user data is written to the follower as part of 
> {{writeStateMachineData}} and then the entry is committed to the follower as 
> part of {{commit}}.
> 2) The issue which is hit right now is the case where a) 
> {{writeStateMachineData}} didn't happen for a particular chunk however b) the 
> commit entry is still processed. this leads to a case where a corresponding 
> stateMachineFuture is not present in the hashmap.
> {code}
> 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, 
> electionTimeout:873ms
> 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for 
> changeToCandidate
> 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 
> for initElection
> 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: begin an election in Term 4
> 2018-02-12 00:26:32,901 INFO 
> org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: 
> appendEntries completed
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: Election REJECTED; received 2 response(s) 
> [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, 
> 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); 
> 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, 
> i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, 
> 172.26.32.230_9858:172.26.32.230:9858, 
> 172.26.32.232_9858:172.26.32.232:9858], old=null
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
> changeToFollower
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, 
> electionTimeout:975ms
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for 
> changeToCandidate
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for 
> changeToFollower
> 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 
> for appendEntries
> {code}
> {code}
> 2018-02-12 00:31:12,400 ERROR 
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit 
> status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater 
> hits Throwable
> java.lang.NullPointerException
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
> at java.lang.Thread.run(Thread.java:745)
> 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down DataNode at 
> y128.l42scl.hortonworks.com/172.26.32.228
> /
> *** shutting down gRPC server since JVM is shutting down
> *** server shut down
> ***
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-06 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388775#comment-16388775
 ] 

Tsz Wo Nicholas Sze commented on RATIS-209:
---

Hey [~msingh], this is a NPE in ContainerStateMachine.java:265 in Ozone (not 
Ratis)
{code:java}
ava.lang.NullPointerException
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:265)
 {code}

> StateMachine updater may miss writeLog Request after a new Leader is chosen.
> 
>
> Key: RATIS-209
> URL: https://issues.apache.org/jira/browse/RATIS-209
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.2.0-alpha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
>
> This issue is happening on ozone for write chunk request.
> 1) currently write chunk request is processed in two phases, in the first 
> phase the user data is written to the follower as part of 
> {{writeStateMachineData}} and then the entry is committed to the follower as 
> part of {{commit}}.
> 2) The issue which is hit right now is the case where a) 
> {{writeStateMachineData}} didn't happen for a particular chunk however b) the 
> commit entry is still processed. this leads to a case where a corresponding 
> stateMachineFuture is not present in the hashmap.
> {code}
> 2018-02-12 00:26:30,097 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:1773, 
> electionTimeout:873ms
> 2018-02-12 00:26:30,098 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 3 for 
> changeToCandidate
> 2018-02-12 00:26:30,100 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from 172.26.32.232_9858 to null at term 3 
> for initElection
> 2018-02-12 00:26:32,869 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: begin an election in Term 4
> 2018-02-12 00:26:32,901 INFO 
> org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.228_9858: 
> appendEntries completed
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.LeaderElection: 
> 172.26.32.228_9858: Election REJECTED; received 2 response(s) 
> [172.26.32.228_9858<-172.26.32.230_9858#0:FAIL-t4, 
> 172.26.32.228_9858<-172.26.32.232_9858#0:FAIL-t4] and 0 exception(s); 
> 172.26.32.228_9858:t4, leader=null, voted=172.26.32.228_9858, raftlog=[(t:3, 
> i:10711)], conf=[172.26.32.228_9858:172.26.32.228:9858, 
> 172.26.32.230_9858:172.26.32.230:9858, 
> 172.26.32.232_9858:172.26.32.232:9858], old=null
> 2018-02-12 00:26:33,217 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
> changeToFollower
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.FollowerState: 
> 172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:5624, 
> electionTimeout:975ms
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 5 for 
> changeToCandidate
> 2018-02-12 00:26:39,518 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858 changes role from CANDIDATE to FOLLOWER at term 5 for 
> changeToFollower
> 2018-02-12 00:26:39,520 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 172.26.32.228_9858: change Leader from null to 172.26.32.232_9858 at term 5 
> for appendEntries
> {code}
> {code}
> 2018-02-12 00:31:12,400 ERROR 
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit 
> status 2: StateMachineUpdater-172.26.32.228_9858: the StateMachineUpdater 
> hits Throwable
> java.lang.NullPointerException
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:254)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1001)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:151)
> at java.lang.Thread.run(Thread.java:745)
> 2018-02-12 00:31:12,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down DataNode at 
> y128.l42scl.hortonworks.com/172.26.32.228
> /
> *** shutting down gRPC server since JVM is shutting down
> *** server shut down
> ***
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-209) StateMachine updater may miss writeLog Request after a new Leader is chosen.

2018-03-04 Thread Mukul Kumar Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/RATIS-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385588#comment-16385588
 ] 

Mukul Kumar Singh commented on RATIS-209:
-

Some more debug info, the following run was started with 3 nodes 
172.26.32.228:9858, 172.26.32.230:9858, 172.26.32.231:9858.

And the problem happens on 172.26.32.230
{code}
2018-03-04 17:25:35,713 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.230_9858 changes role from CANDIDATE to FOLLOWER at term 4 for 
changeToFollower
2018-03-04 17:25:35,882 INFO 
org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.230_9858: 
appendEntries completed
2018-03-04 17:25:36,462 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.230_9858: change Leader from null to 172.26.32.228_9858 at term 5 for 
appendEntries
2018-03-04 17:25:36,903 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10908
2018-03-04 17:25:38,290 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10931
2018-03-04 17:25:38,398 ERROR org.apache.ratis.server.impl.StateMachineUpdater: 
Terminating with exit status 2: StateMachineUpdater-172.26.32.230_9858: the 
StateMachineUpdater hits Throwable
java.lang.NullPointerException
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:265)
at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1051)
at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:154)
at java.lang.Thread.run(Thread.java:745)
2018-03-04 17:25:38,403 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
SHUTDOWN_MSG: 
{code}

on 172.26.32.228:9858, the following entries are seen
{code}
2018-03-04 12:25:32,771 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.228_9858-RaftLogWorker index to:10906
2018-03-04 12:25:35,473 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.228_9858: change Leader from 172.26.32.228_9858 to null at term 4 for 
updateCurrentTerm
2018-03-04 12:25:35,474 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.228_9858 changes role from LEADER to FOLLOWER at term 4 for 
changeToFollower
2018-03-04 12:25:35,475 INFO org.apache.ratis.server.impl.PendingRequests: 
172.26.32.228_9858 sends responses before shutting down PendingRequestsHandler
2018-03-04 12:25:35,842 INFO org.apache.ratis.server.impl.LogAppender: 
172.26.32.228_9858 stops appending log entries to follower 
172.26.32.231_9858(next=10931, match=10925, attendVote=true, lastRpcSendTi
me=422478ms, lastRpcResponseTime=422910ms)
2018-03-04 12:25:35,883 INFO org.apache.ratis.server.impl.LogAppender: 
172.26.32.228_9858 stops appending log entries to follower 
172.26.32.230_9858(next=10911, match=10732, attendVote=true, lastRpcSendTi
me=422580ms, lastRpcResponseTime=422541ms)
2018-03-04 12:25:36,423 INFO org.apache.ratis.server.impl.FollowerState: 
172.26.32.228_9858 changes to CANDIDATE, lastRpcTime:944, electionTimeout:944ms
2018-03-04 12:25:36,424 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.228_9858 changes role from FOLLOWER to CANDIDATE at term 4 for 
changeToCandidate
2018-03-04 12:25:36,439 INFO org.apache.ratis.server.impl.LeaderElection: 
172.26.32.228_9858: begin an election in Term 5
2018-03-04 12:25:36,455 INFO org.apache.ratis.server.impl.LeaderElection: 
172.26.32.228_9858: Election PASSED; received 1 response(s) 
[172.26.32.228_9858<-172.26.32.231_9858#0:OK-t5] and 0 exception(s); 1
{code}

and on 172.26.32.231
{code}
2018-03-04 17:25:33,141 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.231_9858-RaftLogWorker index to:10906
2018-03-04 17:25:35,499 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.231_9858: change Leader from 172.26.32.228_9858 to null at term 4 for 
updateCurrentTerm
2018-03-04 17:25:35,841 INFO 
org.apache.ratis.grpc.server.RaftServerProtocolService: 172.26.32.231_9858: 
appendEntries completed
2018-03-04 17:25:36,461 INFO org.apache.ratis.server.impl.RaftServerImpl: 
172.26.32.231_9858: change Leader from null to 172.26.32.228_9858 at term 5 for 
appendEntries
2018-03-04 17:25:36,686 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.231_9858-RaftLogWorker index to:10931
{code}


The only visible difference from the log file is the index which is being 
rolled over.
Only on 172.26.32.230 the rolled over index is 10908. However on other nodes is 
10906.

{code}
2018-03-04 17:25:36,903 INFO org.apache.ratis.server.storage.RaftLogWorker: 
Rolling segment:172.26.32.230_9858-RaftLogWorker index to:10908
{code}

> StateMachine updater may miss writeLog Request after a new Leader is chosen.
>