[jira] [Commented] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576699#comment-16576699
 ] 

Tsz Wo Nicholas Sze commented on RATIS-270:
---

> 1) RaftServerImpl:1069, lets rename updateCache to isReplyDelayed, I feel 
> that this will help with the comments as well, ...

For follower, there is no reply so that I will keep the name "updateCache". 
Sure, let's add more comments.

Will address #2 and #3 in the next patch.

> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180803.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> Replication_ALL semantics. This leads to a case, where the client retries for 
> a response which is stuck in the delayed replies queue. This new retry is now 
> answered from the retry cache even though the request has not been completed 
> on all the nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (RATIS-290) Raft server should notify the state machine if no leader is assigned for a long time

2018-08-10 Thread Mukul Kumar Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh updated RATIS-290:

Attachment: (was: RATIS-290.002.patch)

> Raft server should notify the state machine if no leader is assigned for a 
> long time
> 
>
> Key: RATIS-290
> URL: https://issues.apache.org/jira/browse/RATIS-290
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
>  Labels: ozone
> Fix For: 0.3.0
>
>
> In ratis a raft server can be in 3 state, Candidate, Leader and Follower. Out 
> of these state, in a cluster, one node being a leader and others being the 
> follower is the steady system state. This jira proposes to add a new api to 
> identify if a node is left aside because of network partition and is without 
> a leader.
> Once it is detected that a node has been in candidate for a sufficiently long 
> time, then a callback to the state machine will be triggered to handle this 
> partitioned node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-290) Raft server should notify the state machine if no leader is assigned for a long time

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576645#comment-16576645
 ] 

Tsz Wo Nicholas Sze commented on RATIS-290:
---

- We should not add RaftServerImpl.group since the member may change.  The 
peers should be retrieved from the conf as below.
{code:java}
-final RaftGroup group = new RaftGroup(groupId, getRaftConf().getPeers());
 {code}
- lastNoLeaderTime should be declared as
{code}
volatile Timestamp lastNoLeaderTime;
{code}
-* Then, use a local variable in getLastLeaderElapsedTimeMs() (let's add "Ms" 
to the method name).
{code}
  long getLastLeaderElapsedTimeMs() {
final Timestamp t = lastNoLeaderTime;
return t == null ? 0 : t.elapsedTimeMs();
  }
{code}
The getLastLeaderElapsedTime() implementation in the patch has a problem that 
lastNoLeaderTime.get() may return different values due to a race condition.


> Raft server should notify the state machine if no leader is assigned for a 
> long time
> 
>
> Key: RATIS-290
> URL: https://issues.apache.org/jira/browse/RATIS-290
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
>  Labels: ozone
> Fix For: 0.3.0
>
> Attachments: RATIS-290.002.patch
>
>
> In ratis a raft server can be in 3 state, Candidate, Leader and Follower. Out 
> of these state, in a cluster, one node being a leader and others being the 
> follower is the steady system state. This jira proposes to add a new api to 
> identify if a node is left aside because of network partition and is without 
> a leader.
> Once it is detected that a node has been in candidate for a sufficiently long 
> time, then a callback to the state machine will be triggered to handle this 
> partitioned node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-298) Update auto-common and log4j versions in ratis

2018-08-10 Thread Mukul Kumar Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576735#comment-16576735
 ] 

Mukul Kumar Singh commented on RATIS-298:
-

Thanks for the review [~szetszwo]. I have tested this with Ozone using local 
builds and that is working fine.

> Update auto-common and log4j versions in ratis
> --
>
> Key: RATIS-298
> URL: https://issues.apache.org/jira/browse/RATIS-298
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 0.3.0
>
> Attachments: RATIS-298.001.patch
>
>
> Update auto-common to 0.8 and log4j to 2.11.0 versions in ratis.
> As this causes compliation issues in Ozone as following
> {code}
> [WARNING] 
> Dependency convergence error for com.google.auto:auto-common:0.10 paths to 
> dependency are:
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-com.google.auto:auto-common:0.10
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-com.google.auto.service:auto-service:1.0-rc4
> +-com.google.auto:auto-common:0.8
>  
> [WARNING] 
> Dependency convergence error for org.apache.logging.log4j:log4j-api:2.6.2 
> paths to dependency are:
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-api:2.6.2
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-api:2.11.0
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-core:2.11.0
> +-org.apache.logging.log4j:log4j-api:2.11.0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (RATIS-290) Raft server should notify the state machine if no leader is assigned for a long time

2018-08-10 Thread Mukul Kumar Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh updated RATIS-290:

Attachment: RATIS-290.002.patch

> Raft server should notify the state machine if no leader is assigned for a 
> long time
> 
>
> Key: RATIS-290
> URL: https://issues.apache.org/jira/browse/RATIS-290
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
>  Labels: ozone
> Fix For: 0.3.0
>
>
> In ratis a raft server can be in 3 state, Candidate, Leader and Follower. Out 
> of these state, in a cluster, one node being a leader and others being the 
> follower is the steady system state. This jira proposes to add a new api to 
> identify if a node is left aside because of network partition and is without 
> a leader.
> Once it is detected that a node has been in candidate for a sufficiently long 
> time, then a callback to the state machine will be triggered to handle this 
> partitioned node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (RATIS-290) Raft server should notify the state machine if no leader is assigned for a long time

2018-08-10 Thread Mukul Kumar Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh updated RATIS-290:

Attachment: (was: RATIS-290.002.patch)

> Raft server should notify the state machine if no leader is assigned for a 
> long time
> 
>
> Key: RATIS-290
> URL: https://issues.apache.org/jira/browse/RATIS-290
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
>  Labels: ozone
> Fix For: 0.3.0
>
>
> In ratis a raft server can be in 3 state, Candidate, Leader and Follower. Out 
> of these state, in a cluster, one node being a leader and others being the 
> follower is the steady system state. This jira proposes to add a new api to 
> identify if a node is left aside because of network partition and is without 
> a leader.
> Once it is detected that a node has been in candidate for a sufficiently long 
> time, then a callback to the state machine will be triggered to handle this 
> partitioned node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-260) Ratis Leader election should try for other peers even when ask for votes fails

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576845#comment-16576845
 ] 

Tsz Wo Nicholas Sze commented on RATIS-260:
---

{quote}
In this test, one of the nodes was shut down permanently. This can result into 
a situation where a candidate node is never able to move out of Leader Election 
phase.
{quote}
I just have checked the current code again.  I cannot see how this could 
happen.  I suspect that the candidate node cannot talk to the other nodes in 
this failure case so that it won't able to move out from Leader Election.



> Ratis Leader election should try for other peers even when ask for votes fails
> --
>
> Key: RATIS-260
> URL: https://issues.apache.org/jira/browse/RATIS-260
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: ozone
> Attachments: RATIS-260.00.patch
>
>
> This bug was simulated using Ozone using Ratis for Data pipeline.
> In this test, one of the nodes was shut down permanently. This can result 
> into a situation where a candidate node is never able to move out of Leader 
> Election phase.
> {code}
> 2018-06-15 07:44:58,246 INFO org.apache.ratis.server.impl.LeaderElection: 
> 0f7b9cd2-4dad-46d7-acbc-57d424492d00_9858 got exception when requesting 
> votes: {}
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at 
> org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
> at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
> at 
> org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:131)
> at 
> org.apache.ratis.shaded.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:281)
> at 
> org.apache.ratis.grpc.server.RaftServerProtocolClient.requestVote(RaftServerProtocolClient.java:61)
> at 
> org.apache.ratis.grpc.RaftGRpcService.requestVote(RaftGRpcService.java:147)
> at 
> org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.ratis.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: y128.l42scl.hortonworks.com/172.26.32.228:9858
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.ratis.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> ... 1 more
> Caused by: java.net.ConnectException: Connection refused
> ... 11 more
> {code}
> 

[jira] [Commented] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576703#comment-16576703
 ] 

Tsz Wo Nicholas Sze commented on RATIS-270:
---

r270_20180810.patch: addresses [~msingh]'s comments.

> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180810.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> Replication_ALL semantics. This leads to a case, where the client retries for 
> a response which is stuck in the delayed replies queue. This new retry is now 
> answered from the retry cache even though the request has not been completed 
> on all the nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-270:
--
Attachment: (was: r270_20180803.patch)

> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180810.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> Replication_ALL semantics. This leads to a case, where the client retries for 
> a response which is stuck in the delayed replies queue. This new retry is now 
> answered from the retry cache even though the request has not been completed 
> on all the nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-270:
--
Attachment: r270_20180810.patch

> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180810.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> Replication_ALL semantics. This leads to a case, where the client retries for 
> a response which is stuck in the delayed replies queue. This new retry is now 
> answered from the retry cache even though the request has not been completed 
> on all the nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (RATIS-298) Update auto-common and log4j versions in ratis

2018-08-10 Thread Mukul Kumar Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh updated RATIS-298:

Attachment: RATIS-298.001.patch

> Update auto-common and log4j versions in ratis
> --
>
> Key: RATIS-298
> URL: https://issues.apache.org/jira/browse/RATIS-298
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 0.3.0
>
> Attachments: RATIS-298.001.patch
>
>
> Update auto-common to 0.8 and log4j to 2.11.0 versions in ratis, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-290) Raft server should notify the state machine if no leader is assigned for a long time

2018-08-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576783#comment-16576783
 ] 

Hadoop QA commented on RATIS-290:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  4m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
9s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
50s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
4s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 21s{color} | {color:orange} root: The patch generated 25 new + 791 unchanged 
- 3 fixed = 816 total (was 794) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 16m 50s{color} 
| {color:red} root in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 30m 47s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | ratis.TestRaftServerSlownessDetection |
|   | ratis.server.simulation.TestRaftWithSimulatedRpc |
|   | ratis.TestRaftServerLeaderElectionTimeout |
|   | ratis.server.simulation.TestLeaderElectionWithSimulatedRpc |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/ratis:date2018-08-10 
|
| JIRA Issue | RATIS-290 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12935180/RATIS-290.003.patch |
| Optional Tests |  asflicense  cc  unit  javac  javadoc  findbugs  checkstyle  
compile  |
| uname | Linux e4612129386a 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 
08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh
 |
| git revision | master / 6a2c3d5 |
| Default Java | 1.8.0_181 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-RATIS-Build/292/artifact/out/diff-checkstyle-root.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-RATIS-Build/292/artifact/out/patch-unit-root.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-RATIS-Build/292/testReport/ |
| modules | C: ratis-proto-shaded ratis-server U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-RATIS-Build/292/console |
| Powered by | Apache Yetus 0.5.0   http://yetus.apache.org |


This message was automatically generated.



> Raft server should notify the state machine if no leader is assigned for a 
> long time
> 

[jira] [Commented] (RATIS-298) Update auto-common and log4j versions in ratis

2018-08-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576755#comment-16576755
 ] 

Hadoop QA commented on RATIS-298:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
14s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
13s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  6m 48s{color} 
| {color:red} root in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 6s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 12m 47s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | ratis.server.TestRaftLogMetrics |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/ratis:date2018-08-10 
|
| JIRA Issue | RATIS-298 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12935175/RATIS-298.001.patch |
| Optional Tests |  asflicense  javac  javadoc  unit  xml  compile  |
| uname | Linux d9f41987c435 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh
 |
| git revision | master / 6a2c3d5 |
| Default Java | 1.8.0_171 |
| unit | 
https://builds.apache.org/job/PreCommit-RATIS-Build/290/artifact/out/patch-unit-root.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-RATIS-Build/290/testReport/ |
| modules | C: ratis-proto-shaded U: ratis-proto-shaded |
| Console output | 
https://builds.apache.org/job/PreCommit-RATIS-Build/290/console |
| Powered by | Apache Yetus 0.5.0   http://yetus.apache.org |


This message was automatically generated.



> Update auto-common and log4j versions in ratis
> --
>
> Key: RATIS-298
> URL: https://issues.apache.org/jira/browse/RATIS-298
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 0.3.0
>
> Attachments: RATIS-298.001.patch
>
>
> Update auto-common to 0.8 and log4j to 2.11.0 versions in ratis.
> As this causes compliation issues in Ozone as following
> {code}
> [WARNING] 
> Dependency convergence error for com.google.auto:auto-common:0.10 paths to 
> dependency are:
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-com.google.auto:auto-common:0.10
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   

[jira] [Commented] (RATIS-260) Ratis Leader election should try for other peers even when ask for votes fails

2018-08-10 Thread Shashikant Banerjee (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576770#comment-16576770
 ] 

Shashikant Banerjee commented on RATIS-260:
---

Thanks [~szetszwo], for the review. The issue is not recreatable consistently 
with Ozone.

As discussed with [~msingh], it was hit after 50 runs of Freon in cluster once. 
I ran basic Freon in Ozone and it worked well.

> Ratis Leader election should try for other peers even when ask for votes fails
> --
>
> Key: RATIS-260
> URL: https://issues.apache.org/jira/browse/RATIS-260
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: ozone
> Attachments: RATIS-260.00.patch
>
>
> This bug was simulated using Ozone using Ratis for Data pipeline.
> In this test, one of the nodes was shut down permanently. This can result 
> into a situation where a candidate node is never able to move out of Leader 
> Election phase.
> {code}
> 2018-06-15 07:44:58,246 INFO org.apache.ratis.server.impl.LeaderElection: 
> 0f7b9cd2-4dad-46d7-acbc-57d424492d00_9858 got exception when requesting 
> votes: {}
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at 
> org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
> at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
> at 
> org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:131)
> at 
> org.apache.ratis.shaded.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:281)
> at 
> org.apache.ratis.grpc.server.RaftServerProtocolClient.requestVote(RaftServerProtocolClient.java:61)
> at 
> org.apache.ratis.grpc.RaftGRpcService.requestVote(RaftGRpcService.java:147)
> at 
> org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.ratis.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: y128.l42scl.hortonworks.com/172.26.32.228:9858
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.ratis.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> ... 1 more
> Caused by: java.net.ConnectException: Connection refused
> ... 11 more
> {code}
> This happens because of the following lines of the code during requestVote.
> {code}
> for (final RaftPeer peer : others) {
>   final RequestVoteRequestProto r = 

[jira] [Commented] (RATIS-298) Update auto-common and log4j versions in ratis

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576730#comment-16576730
 ] 

Tsz Wo Nicholas Sze commented on RATIS-298:
---

+1 patch looks good.

Have you tested it with Ozone?

> Update auto-common and log4j versions in ratis
> --
>
> Key: RATIS-298
> URL: https://issues.apache.org/jira/browse/RATIS-298
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 0.3.0
>
> Attachments: RATIS-298.001.patch
>
>
> Update auto-common to 0.8 and log4j to 2.11.0 versions in ratis.
> As this causes compliation issues in Ozone as following
> {code}
> [WARNING] 
> Dependency convergence error for com.google.auto:auto-common:0.10 paths to 
> dependency are:
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-com.google.auto:auto-common:0.10
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-com.google.auto.service:auto-service:1.0-rc4
> +-com.google.auto:auto-common:0.8
>  
> [WARNING] 
> Dependency convergence error for org.apache.logging.log4j:log4j-api:2.6.2 
> paths to dependency are:
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-api:2.6.2
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-api:2.11.0
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-core:2.11.0
> +-org.apache.logging.log4j:log4j-api:2.11.0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (RATIS-298) Update auto-common and log4j versions in ratis

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-298:
--
Component/s: (was: server)
 build

> Update auto-common and log4j versions in ratis
> --
>
> Key: RATIS-298
> URL: https://issues.apache.org/jira/browse/RATIS-298
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 0.3.0
>
> Attachments: RATIS-298.001.patch
>
>
> Update auto-common to 0.8 and log4j to 2.11.0 versions in ratis.
> As this causes compliation issues in Ozone as following
> {code}
> [WARNING] 
> Dependency convergence error for com.google.auto:auto-common:0.10 paths to 
> dependency are:
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-com.google.auto:auto-common:0.10
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-com.google.auto.service:auto-service:1.0-rc4
> +-com.google.auto:auto-common:0.8
>  
> [WARNING] 
> Dependency convergence error for org.apache.logging.log4j:log4j-api:2.6.2 
> paths to dependency are:
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-api:2.6.2
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-api:2.11.0
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-core:2.11.0
> +-org.apache.logging.log4j:log4j-api:2.11.0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576729#comment-16576729
 ] 

Hadoop QA commented on RATIS-270:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
11s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
19s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 23s{color} | {color:orange} root: The patch generated 50 new + 613 unchanged 
- 23 fixed = 663 total (was 636) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  6m 56s{color} 
| {color:red} root in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 6s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 13m 35s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | ratis.server.simulation.TestRetryCacheWithSimulatedRpc |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/ratis:date2018-08-10 
|
| JIRA Issue | RATIS-270 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12935171/r270_20180810.patch |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  checkstyle  
compile  |
| uname | Linux e43f17295865 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh
 |
| git revision | master / 6a2c3d5 |
| Default Java | 1.8.0_171 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-RATIS-Build/289/artifact/out/diff-checkstyle-root.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-RATIS-Build/289/artifact/out/patch-unit-root.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-RATIS-Build/289/testReport/ |
| modules | C: ratis-server U: ratis-server |
| Console output | 
https://builds.apache.org/job/PreCommit-RATIS-Build/289/console |
| Powered by | Apache Yetus 0.5.0   http://yetus.apache.org |


This message was automatically generated.



> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180810.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> Replication_ALL semantics. This leads to a case, where the client retries for 
> a response which 

[jira] [Commented] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576739#comment-16576739
 ] 

Tsz Wo Nicholas Sze commented on RATIS-270:
---

TestRetryCacheWithSimulatedRpc failed since it does not support async.  Will 
upload a new patch.

> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180810.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> Replication_ALL semantics. This leads to a case, where the client retries for 
> a response which is stuck in the delayed replies queue. This new retry is now 
> answered from the retry cache even though the request has not been completed 
> on all the nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (RATIS-290) Raft server should notify the state machine if no leader is assigned for a long time

2018-08-10 Thread Mukul Kumar Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh updated RATIS-290:

Attachment: RATIS-290.003.patch

> Raft server should notify the state machine if no leader is assigned for a 
> long time
> 
>
> Key: RATIS-290
> URL: https://issues.apache.org/jira/browse/RATIS-290
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
>  Labels: ozone
> Fix For: 0.3.0
>
> Attachments: RATIS-290.003.patch
>
>
> In ratis a raft server can be in 3 state, Candidate, Leader and Follower. Out 
> of these state, in a cluster, one node being a leader and others being the 
> follower is the steady system state. This jira proposes to add a new api to 
> identify if a node is left aside because of network partition and is without 
> a leader.
> Once it is detected that a node has been in candidate for a sufficiently long 
> time, then a callback to the state machine will be triggered to handle this 
> partitioned node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576699#comment-16576699
 ] 

Tsz Wo Nicholas Sze edited comment on RATIS-270 at 8/10/18 6:33 PM:


> 1) RaftServerImpl:1069, lets rename updateCache to isReplyDelayed, I feel 
> that this will help with the comments as well, ...

For follower, there is no reply so that I will keep the name "updateCache". 
Sure, let's add more comments.

Will also address #2 and #3 in the next patch.  Thanks Mukul.


was (Author: szetszwo):
> 1) RaftServerImpl:1069, lets rename updateCache to isReplyDelayed, I feel 
> that this will help with the comments as well, ...

For follower, there is no reply so that I will keep the name "updateCache". 
Sure, let's add more comments.

Will address #2 and #3 in the next patch.

> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180803.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> Replication_ALL semantics. This leads to a case, where the client retries for 
> a response which is stuck in the delayed replies queue. This new retry is now 
> answered from the retry cache even though the request has not been completed 
> on all the nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-260) Ratis Leader election should try for other peers even when ask for votes fails

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576727#comment-16576727
 ] 

Tsz Wo Nicholas Sze commented on RATIS-260:
---

+1 patch looks good.

[~shashikant], have tested it with Ozone to see if this can fix the problem?

> Ratis Leader election should try for other peers even when ask for votes fails
> --
>
> Key: RATIS-260
> URL: https://issues.apache.org/jira/browse/RATIS-260
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: ozone
> Attachments: RATIS-260.00.patch
>
>
> This bug was simulated using Ozone using Ratis for Data pipeline.
> In this test, one of the nodes was shut down permanently. This can result 
> into a situation where a candidate node is never able to move out of Leader 
> Election phase.
> {code}
> 2018-06-15 07:44:58,246 INFO org.apache.ratis.server.impl.LeaderElection: 
> 0f7b9cd2-4dad-46d7-acbc-57d424492d00_9858 got exception when requesting 
> votes: {}
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at 
> org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
> at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
> at 
> org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:131)
> at 
> org.apache.ratis.shaded.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:281)
> at 
> org.apache.ratis.grpc.server.RaftServerProtocolClient.requestVote(RaftServerProtocolClient.java:61)
> at 
> org.apache.ratis.grpc.RaftGRpcService.requestVote(RaftGRpcService.java:147)
> at 
> org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.ratis.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: y128.l42scl.hortonworks.com/172.26.32.228:9858
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.ratis.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> ... 1 more
> Caused by: java.net.ConnectException: Connection refused
> ... 11 more
> {code}
> This happens because of the following lines of the code during requestVote.
> {code}
> for (final RaftPeer peer : others) {
>   final RequestVoteRequestProto r = server.createRequestVoteRequest(
>   peer.getId(), electionTerm, lastEntry);
>   service.submit(
>   () -> 

[jira] [Updated] (RATIS-298) Update auto-common and log4j versions in ratis

2018-08-10 Thread Mukul Kumar Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh updated RATIS-298:

Description: 
Update auto-common to 0.8 and log4j to 2.11.0 versions in ratis.

As this causes compliation issues in Ozone as following

{code}
[WARNING] 
Dependency convergence error for com.google.auto:auto-common:0.10 paths to 
dependency are:
+-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
  +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
+-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
  +-com.google.auto:auto-common:0.10
and
+-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
  +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
+-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
  +-com.google.auto.service:auto-service:1.0-rc4
+-com.google.auto:auto-common:0.8
 
[WARNING] 
Dependency convergence error for org.apache.logging.log4j:log4j-api:2.6.2 paths 
to dependency are:
+-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
  +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
+-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
  +-org.apache.logging.log4j:log4j-api:2.6.2
and
+-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
  +-org.apache.logging.log4j:log4j-api:2.11.0
and
+-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
  +-org.apache.logging.log4j:log4j-core:2.11.0
+-org.apache.logging.log4j:log4j-api:2.11.0
{code}

  was:Update auto-common to 0.8 and log4j to 2.11.0 versions in ratis, 


> Update auto-common and log4j versions in ratis
> --
>
> Key: RATIS-298
> URL: https://issues.apache.org/jira/browse/RATIS-298
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 0.3.0
>
> Attachments: RATIS-298.001.patch
>
>
> Update auto-common to 0.8 and log4j to 2.11.0 versions in ratis.
> As this causes compliation issues in Ozone as following
> {code}
> [WARNING] 
> Dependency convergence error for com.google.auto:auto-common:0.10 paths to 
> dependency are:
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-com.google.auto:auto-common:0.10
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-com.google.auto.service:auto-service:1.0-rc4
> +-com.google.auto:auto-common:0.8
>  
> [WARNING] 
> Dependency convergence error for org.apache.logging.log4j:log4j-api:2.6.2 
> paths to dependency are:
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-api:2.6.2
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-api:2.11.0
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-core:2.11.0
> +-org.apache.logging.log4j:log4j-api:2.11.0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576778#comment-16576778
 ] 

Hadoop QA commented on RATIS-270:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
6s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
33s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
5s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 22s{color} | {color:orange} root: The patch generated 50 new + 613 unchanged 
- 23 fixed = 663 total (was 636) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 23m 
57s{color} | {color:green} root in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 7s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 30m 52s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/ratis:date2018-08-10 
|
| JIRA Issue | RATIS-270 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12935176/r270_20180810b.patch |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  checkstyle  
compile  |
| uname | Linux 57daf79ad248 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh
 |
| git revision | master / 6a2c3d5 |
| Default Java | 1.8.0_171 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-RATIS-Build/291/artifact/out/diff-checkstyle-root.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-RATIS-Build/291/testReport/ |
| modules | C: ratis-server ratis-grpc U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-RATIS-Build/291/console |
| Powered by | Apache Yetus 0.5.0   http://yetus.apache.org |


This message was automatically generated.



> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180810b.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> 

[jira] [Created] (RATIS-298) Update auto-common and log4j versions in ratis

2018-08-10 Thread Mukul Kumar Singh (JIRA)
Mukul Kumar Singh created RATIS-298:
---

 Summary: Update auto-common and log4j versions in ratis
 Key: RATIS-298
 URL: https://issues.apache.org/jira/browse/RATIS-298
 Project: Ratis
  Issue Type: Bug
  Components: server
Affects Versions: 0.3.0
Reporter: Mukul Kumar Singh
Assignee: Mukul Kumar Singh
 Fix For: 0.3.0


Update auto-common to 0.8 and log4j to 2.11.0 versions in ratis, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (RATIS-298) Update auto-common and log4j versions

2018-08-10 Thread Mukul Kumar Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh updated RATIS-298:

Summary: Update auto-common and log4j versions  (was: Update auto-common 
and log4j versions in ratis)

> Update auto-common and log4j versions
> -
>
> Key: RATIS-298
> URL: https://issues.apache.org/jira/browse/RATIS-298
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 0.3.0
>
> Attachments: RATIS-298.001.patch
>
>
> Update auto-common to 0.8 and log4j to 2.11.0 versions in ratis.
> As this causes compliation issues in Ozone as following
> {code}
> [WARNING] 
> Dependency convergence error for com.google.auto:auto-common:0.10 paths to 
> dependency are:
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-com.google.auto:auto-common:0.10
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-com.google.auto.service:auto-service:1.0-rc4
> +-com.google.auto:auto-common:0.8
>  
> [WARNING] 
> Dependency convergence error for org.apache.logging.log4j:log4j-api:2.6.2 
> paths to dependency are:
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.ratis:ratis-server:0.3.0-6a2c3d5-SNAPSHOT
> +-org.apache.ratis:ratis-proto-shaded:0.3.0-6a2c3d5-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-api:2.6.2
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-api:2.11.0
> and
> +-org.apache.hadoop:hadoop-hdds-common:0.2.1-SNAPSHOT
>   +-org.apache.logging.log4j:log4j-core:2.11.0
> +-org.apache.logging.log4j:log4j-api:2.11.0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (RATIS-260) Ratis Leader election should try for other peers even when ask for votes fails

2018-08-10 Thread Mukul Kumar Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh updated RATIS-260:

Attachment: hadoop-hdfs-datanode-y128.log

> Ratis Leader election should try for other peers even when ask for votes fails
> --
>
> Key: RATIS-260
> URL: https://issues.apache.org/jira/browse/RATIS-260
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: ozone
> Attachments: RATIS-260.00.patch, hadoop-hdfs-datanode-y130.log
>
>
> This bug was simulated using Ozone using Ratis for Data pipeline.
> In this test, one of the nodes was shut down permanently. This can result 
> into a situation where a candidate node is never able to move out of Leader 
> Election phase.
> {code}
> 2018-06-15 07:44:58,246 INFO org.apache.ratis.server.impl.LeaderElection: 
> 0f7b9cd2-4dad-46d7-acbc-57d424492d00_9858 got exception when requesting 
> votes: {}
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at 
> org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
> at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
> at 
> org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:131)
> at 
> org.apache.ratis.shaded.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:281)
> at 
> org.apache.ratis.grpc.server.RaftServerProtocolClient.requestVote(RaftServerProtocolClient.java:61)
> at 
> org.apache.ratis.grpc.RaftGRpcService.requestVote(RaftGRpcService.java:147)
> at 
> org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.ratis.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: y128.l42scl.hortonworks.com/172.26.32.228:9858
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.ratis.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> ... 1 more
> Caused by: java.net.ConnectException: Connection refused
> ... 11 more
> {code}
> This happens because of the following lines of the code during requestVote.
> {code}
> for (final RaftPeer peer : others) {
>   final RequestVoteRequestProto r = server.createRequestVoteRequest(
>   peer.getId(), electionTerm, lastEntry);
>   service.submit(
>   () -> server.getServerRpc().requestVote(r));
>   submitted++;
> }
> {code}



--

[jira] [Updated] (RATIS-260) Ratis Leader election should try for other peers even when ask for votes fails

2018-08-10 Thread Mukul Kumar Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh updated RATIS-260:

Attachment: hadoop-hdfs-datanode-y130.log

> Ratis Leader election should try for other peers even when ask for votes fails
> --
>
> Key: RATIS-260
> URL: https://issues.apache.org/jira/browse/RATIS-260
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: ozone
> Attachments: RATIS-260.00.patch, hadoop-hdfs-datanode-y130.log
>
>
> This bug was simulated using Ozone using Ratis for Data pipeline.
> In this test, one of the nodes was shut down permanently. This can result 
> into a situation where a candidate node is never able to move out of Leader 
> Election phase.
> {code}
> 2018-06-15 07:44:58,246 INFO org.apache.ratis.server.impl.LeaderElection: 
> 0f7b9cd2-4dad-46d7-acbc-57d424492d00_9858 got exception when requesting 
> votes: {}
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at 
> org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
> at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
> at 
> org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:131)
> at 
> org.apache.ratis.shaded.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:281)
> at 
> org.apache.ratis.grpc.server.RaftServerProtocolClient.requestVote(RaftServerProtocolClient.java:61)
> at 
> org.apache.ratis.grpc.RaftGRpcService.requestVote(RaftGRpcService.java:147)
> at 
> org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.ratis.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: y128.l42scl.hortonworks.com/172.26.32.228:9858
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.ratis.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> ... 1 more
> Caused by: java.net.ConnectException: Connection refused
> ... 11 more
> {code}
> This happens because of the following lines of the code during requestVote.
> {code}
> for (final RaftPeer peer : others) {
>   final RequestVoteRequestProto r = server.createRequestVoteRequest(
>   peer.getId(), electionTerm, lastEntry);
>   service.submit(
>   () -> server.getServerRpc().requestVote(r));
>   submitted++;
> }
> {code}



--

[jira] [Updated] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-270:
--
Attachment: (was: r270_20180810.patch)

> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180810b.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> Replication_ALL semantics. This leads to a case, where the client retries for 
> a response which is stuck in the delayed replies queue. This new retry is now 
> answered from the retry cache even though the request has not been completed 
> on all the nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (RATIS-260) Ratis Leader election should try for other peers even when ask for votes fails

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576727#comment-16576727
 ] 

Tsz Wo Nicholas Sze edited comment on RATIS-260 at 8/10/18 7:10 PM:


+1 patch looks good.

[~shashikant], have you tested it with Ozone to see if this can fix the problem?


was (Author: szetszwo):
+1 patch looks good.

[~shashikant], have tested it with Ozone to see if this can fix the problem?

> Ratis Leader election should try for other peers even when ask for votes fails
> --
>
> Key: RATIS-260
> URL: https://issues.apache.org/jira/browse/RATIS-260
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: ozone
> Attachments: RATIS-260.00.patch
>
>
> This bug was simulated using Ozone using Ratis for Data pipeline.
> In this test, one of the nodes was shut down permanently. This can result 
> into a situation where a candidate node is never able to move out of Leader 
> Election phase.
> {code}
> 2018-06-15 07:44:58,246 INFO org.apache.ratis.server.impl.LeaderElection: 
> 0f7b9cd2-4dad-46d7-acbc-57d424492d00_9858 got exception when requesting 
> votes: {}
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at 
> org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
> at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
> at 
> org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:131)
> at 
> org.apache.ratis.shaded.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:281)
> at 
> org.apache.ratis.grpc.server.RaftServerProtocolClient.requestVote(RaftServerProtocolClient.java:61)
> at 
> org.apache.ratis.grpc.RaftGRpcService.requestVote(RaftGRpcService.java:147)
> at 
> org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.ratis.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: y128.l42scl.hortonworks.com/172.26.32.228:9858
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.ratis.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> ... 1 more
> Caused by: java.net.ConnectException: Connection refused
> ... 11 more
> {code}
> This happens because of the following lines of the code during requestVote.
> {code}
> for (final RaftPeer peer : others) {
>  

[jira] [Commented] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576745#comment-16576745
 ] 

Tsz Wo Nicholas Sze commented on RATIS-270:
---

r270_20180810b.patch: moves the new test to TestRetryCacheWithGrpc.

> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180810b.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> Replication_ALL semantics. This leads to a case, where the client retries for 
> a response which is stuck in the delayed replies queue. This new retry is now 
> answered from the retry cache even though the request has not been completed 
> on all the nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-270:
--
Attachment: r270_20180810b.patch

> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180810b.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> Replication_ALL semantics. This leads to a case, where the client retries for 
> a response which is stuck in the delayed replies queue. This new retry is now 
> answered from the retry cache even though the request has not been completed 
> on all the nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-260) Ratis Leader election should try for other peers even when ask for votes fails

2018-08-10 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576842#comment-16576842
 ] 

Tsz Wo Nicholas Sze commented on RATIS-260:
---

{quote}
No, it is a bug in LeaderElection.waitForResults(LeaderElection.java:214) 
according to the given stack trace.
{quote}

Sorry [~shashikant].  My above comment was wrong.  The stack trace indeed shows 
that the StatusRuntimeException is wrapped by an ExecutionException.  Catching 
StatusRuntimeException seems not helpful.
{code}
java.util.concurrent.ExecutionException: 
org.apache.ratis.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
{code}

> Ratis Leader election should try for other peers even when ask for votes fails
> --
>
> Key: RATIS-260
> URL: https://issues.apache.org/jira/browse/RATIS-260
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: ozone
> Attachments: RATIS-260.00.patch
>
>
> This bug was simulated using Ozone using Ratis for Data pipeline.
> In this test, one of the nodes was shut down permanently. This can result 
> into a situation where a candidate node is never able to move out of Leader 
> Election phase.
> {code}
> 2018-06-15 07:44:58,246 INFO org.apache.ratis.server.impl.LeaderElection: 
> 0f7b9cd2-4dad-46d7-acbc-57d424492d00_9858 got exception when requesting 
> votes: {}
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at 
> org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
> at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
> at 
> org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
> at 
> org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:131)
> at 
> org.apache.ratis.shaded.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:281)
> at 
> org.apache.ratis.grpc.server.RaftServerProtocolClient.requestVote(RaftServerProtocolClient.java:61)
> at 
> org.apache.ratis.grpc.RaftGRpcService.requestVote(RaftGRpcService.java:147)
> at 
> org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.ratis.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: y128.l42scl.hortonworks.com/172.26.32.228:9858
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.ratis.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at 
> org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> ... 1 more
> Caused by: 

[jira] [Commented] (RATIS-260) Ratis Leader election should try for other peers even when ask for votes fails

2018-08-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575979#comment-16575979
 ] 

Hadoop QA commented on RATIS-260:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
1s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
27s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
33s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 15s{color} | {color:orange} root: The patch generated 1 new + 50 unchanged - 
1 fixed = 51 total (was 51) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 23m 
17s{color} | {color:green} root in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 8s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 30m  3s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/ratis:date2018-08-10 
|
| JIRA Issue | RATIS-260 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12934968/RATIS-260.00.patch |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  checkstyle  
compile  |
| uname | Linux 02f4772880c3 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh
 |
| git revision | master / 6a2c3d5 |
| Default Java | 1.8.0_171 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-RATIS-Build/288/artifact/out/diff-checkstyle-root.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-RATIS-Build/288/testReport/ |
| modules | C: ratis-server U: ratis-server |
| Console output | 
https://builds.apache.org/job/PreCommit-RATIS-Build/288/console |
| Powered by | Apache Yetus 0.5.0   http://yetus.apache.org |


This message was automatically generated.



> Ratis Leader election should try for other peers even when ask for votes fails
> --
>
> Key: RATIS-260
> URL: https://issues.apache.org/jira/browse/RATIS-260
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: ozone
> Attachments: RATIS-260.00.patch
>
>
> This bug was simulated using Ozone using Ratis for Data pipeline.
> In this test, one of the nodes was shut down permanently. This can result 
> into a situation where a candidate node is never able to move out of Leader 
> Election phase.
> {code}
> 2018-06-15 

[jira] [Assigned] (RATIS-291) Raft Server should fail themselves when a raft storage directory fails

2018-08-10 Thread Mukul Kumar Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh reassigned RATIS-291:
---

Assignee: Shashikant Banerjee  (was: Mukul Kumar Singh)

> Raft Server should fail themselves when a raft storage directory fails
> --
>
> Key: RATIS-291
> URL: https://issues.apache.org/jira/browse/RATIS-291
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: ozone
> Fix For: 0.3.0
>
>
> A Raft server uses a storage directory to store the write ahead log. If this 
> log is lost because of a reason, then this node should fail itself.
> For a follower, if raft log location has failed, then the follower will not 
> be able to append any entries. This node will now be lagging behind the 
> follower and will eventually be notified via notifySlowness.
> For a leader where the raft log disk has failed, the leader will not append 
> any new entries to its log. However with respect to the raft ring, the leader 
> will still remain healthy. This jira proposes to add a new api to identify a 
> leader with failed node.
> Also this jira also proposes to add a new api to the statemachine, so that 
> state machine implementation can provide methods to verify the raft log 
> location.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RATIS-270) Replication ALL requests should not be replied from retry cache if they are delayed.

2018-08-10 Thread Mukul Kumar Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576145#comment-16576145
 ] 

Mukul Kumar Singh commented on RATIS-270:
-

Thanks for working on this [~szetszwo]. The patch looks good to me, Some minor 
comments as following.

1) RaftServerImpl:1069, lets rename updateCache to isReplyDelayed, I feel that 
this will help with the comments as well, also lets add a note that for delayed 
replies, the retryCache will be updated as part of RaftClientReply#getReply
2) RetryCacheTests.java:205, lets add a small note inside the fail statement.
3) Inside RetryCacheTests.java, when the first set of request fail, can we 
retry with another round of same requests being sent, and making sure that they 
are blocking on the failed node and not reading from the retry cache of the 
leader.

> Replication ALL requests should not be replied from retry cache if they are 
> delayed.
> 
>
> Key: RATIS-270
> URL: https://issues.apache.org/jira/browse/RATIS-270
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Attachments: r270_20180803.patch
>
>
> Retry requests are answered from the retry cache when requests have 
> Replication_ALL semantics. This leads to a case, where the client retries for 
> a response which is stuck in the delayed replies queue. This new retry is now 
> answered from the retry cache even though the request has not been completed 
> on all the nodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)