[jira] [Created] (RATIS-198) Ozone Ratis test is failing with Socket IO exception during Key Creation
Shashikant Banerjee created RATIS-198: - Summary: Ozone Ratis test is failing with Socket IO exception during Key Creation Key: RATIS-198 URL: https://issues.apache.org/jira/browse/RATIS-198 Project: Ratis Issue Type: Bug Reporter: Shashikant Banerjee Assignee: Shashikant Banerjee Attachments: HDFS-12794-HDFS-7240.009.patch_tmp While Executing TestCorona#ratisTest3, with the attached patch hit the below exception. {code:java} 2018-01-23 18:15:11,058 [IPC Server handler 5 on 51292] INFO scm.StorageContainerManager (StorageContainerManager.java:notifyObjectStageChange(687)) - Object type container name 2efd4054-c479-45a4-a1db-3a4ec3526d4d op create new stage complete 100.00% |█| 20/20 Time: 0:00:05 Jan 23, 2018 6:15:11 PM org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl maybeTerminateChannel INFO: [ManagedChannelImpl@7202ef94] Terminated Jan 23, 2018 6:15:11 PM org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl maybeTerminateChannel INFO: [ManagedChannelImpl@5e5452c3] Terminated Jan 23, 2018 6:15:11 PM org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl maybeTerminateChannel INFO: [ManagedChannelImpl@72d74e90] Terminated Jan 23, 2018 6:15:11 PM org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl maybeTerminateChannel INFO: [ManagedChannelImpl@3679cc6c] Terminated Jan 23, 2018 6:15:11 PM org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl maybeTerminateChannel INFO: [ManagedChannelImpl@589f60fd] Terminated Jan 23, 2018 6:15:11 PM org.apache.ratis.shaded.io.grpc.netty.NettyServerHandler onConnectionError WARNING: Connection Error java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.ratis.shaded.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) {code}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-198) Ozone Ratis test is failing with Socket IO exception during Key Creation
[ https://issues.apache.org/jira/browse/RATIS-198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-198: -- Attachment: HDFS-12794-HDFS-7240.009.patch_tmp > Ozone Ratis test is failing with Socket IO exception during Key Creation > > > Key: RATIS-198 > URL: https://issues.apache.org/jira/browse/RATIS-198 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Attachments: HDFS-12794-HDFS-7240.009.patch_tmp > > > While Executing TestCorona#ratisTest3, with the attached patch hit the below > exception. > {code:java} > 2018-01-23 18:15:11,058 [IPC Server handler 5 on 51292] INFO > scm.StorageContainerManager > (StorageContainerManager.java:notifyObjectStageChange(687)) - Object type > container name 2efd4054-c479-45a4-a1db-3a4ec3526d4d op create new stage > complete > 100.00% > |█| > 20/20 Time: 0:00:05 > Jan 23, 2018 6:15:11 PM > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl > maybeTerminateChannel > INFO: [ManagedChannelImpl@7202ef94] Terminated > Jan 23, 2018 6:15:11 PM > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl > maybeTerminateChannel > INFO: [ManagedChannelImpl@5e5452c3] Terminated > Jan 23, 2018 6:15:11 PM > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl > maybeTerminateChannel > INFO: [ManagedChannelImpl@72d74e90] Terminated > Jan 23, 2018 6:15:11 PM > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl > maybeTerminateChannel > INFO: [ManagedChannelImpl@3679cc6c] Terminated > Jan 23, 2018 6:15:11 PM > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelImpl > maybeTerminateChannel > INFO: [ManagedChannelImpl@589f60fd] Terminated > Jan 23, 2018 6:15:11 PM > org.apache.ratis.shaded.io.grpc.netty.NettyServerHandler onConnectionError > WARNING: Connection Error > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > org.apache.ratis.shaded.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) > {code}{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-176: -- Attachment: RATIS-176.001.patch > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Attachments: RATIS-176.001.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-176: -- Priority: Minor (was: Major) > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > Attachments: RATIS-176.001.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-176: -- Attachment: (was: RATIS-176.001.patch) > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > Attachments: RATIS-176.001.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-176: -- Attachment: RATIS-176.001.patch > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > Attachments: RATIS-176.001.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407587#comment-16407587 ] Shashikant Banerjee commented on RATIS-176: --- Resubmiitted the patch to re trigger jenkins. > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > Attachments: RATIS-176.001.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412489#comment-16412489 ] Shashikant Banerjee commented on RATIS-176: --- Thanks [~szetszwo], for the review comments. I have a few doubts regarding the comments. 1.The check should be in LogAppender.addEntry(..) By moving the check, in LogAppender.addEntry in the server, what happens is, createRequest fails , the sender is stopped and it triggers an re-election . This goes on in an infinite loop. I think we should bail out instead of retrying indefinitely. When the clients sends a msg , the 1st log entry itself can be greater than the maxBufferSize configured. In this case itself , the raft client itself can detect and handle it. > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > Attachments: RATIS-176.001.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-176: -- Attachment: RATIS-176.002.patch > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > Attachments: RATIS-176.001.patch, RATIS-176.002.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415403#comment-16415403 ] Shashikant Banerjee commented on RATIS-176: --- Thanks [~szetszwo], for the review. patch v2 addresses your review comments. > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > Attachments: RATIS-176.001.patch, RATIS-176.002.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-176: -- Attachment: (was: RATIS-176.001.patch) > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-176: -- Attachment: (was: RATIS-176.002.patch) > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-176: -- Attachment: RATIS-176.003.patch > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > Attachments: RATIS-176.003.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416111#comment-16416111 ] Shashikant Banerjee commented on RATIS-176: --- Thanks [~szetszwo], for the review. I have removed the earlier patches and uploaded a v3 patch which addresses your review comments. > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > Attachments: RATIS-176.003.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-176: -- Attachment: (was: RATIS-176.003.patch) > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-176: -- Attachment: RATIS-176.004.patch > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > Attachments: RATIS-176.004.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-176) Log Appender should throw an Exception in case append entry size exceeds the maxBufferSize configured
[ https://issues.apache.org/jira/browse/RATIS-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416831#comment-16416831 ] Shashikant Banerjee commented on RATIS-176: --- Thanks [~szetszwo], for the review comments. Patch v4 addresses your review comments. > Log Appender should throw an Exception in case append entry size exceeds the > maxBufferSize configured > -- > > Key: RATIS-176 > URL: https://issues.apache.org/jira/browse/RATIS-176 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Minor > Attachments: RATIS-176.004.patch > > > LogAppender while adding append entry in LogEntryBuffer, checks whether the > total allocated for all entries does not exceed the maxBufferSize allocated. > In case, the size exceeds the limit ,entries are not added to the buffer but > no exception is thrown . This case needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-260) Ratis Leader election should try for other peers even when ask for votes fails
[ https://issues.apache.org/jira/browse/RATIS-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-260: -- Attachment: RATIS-260.00.patch > Ratis Leader election should try for other peers even when ask for votes fails > -- > > Key: RATIS-260 > URL: https://issues.apache.org/jira/browse/RATIS-260 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-260.00.patch > > > This bug was simulated using Ozone using Ratis for Data pipeline. > In this test, one of the nodes was shut down permanently. This can result > into a situation where a candidate node is never able to move out of Leader > Election phase. > {code} > 2018-06-15 07:44:58,246 INFO org.apache.ratis.server.impl.LeaderElection: > 0f7b9cd2-4dad-46d7-acbc-57d424492d00_9858 got exception when requesting > votes: {} > java.util.concurrent.ExecutionException: > org.apache.ratis.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io > exception > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214) > at > org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146) > at > org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102) > Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: > UNAVAILABLE: io exception > at > org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221) > at > org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202) > at > org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:131) > at > org.apache.ratis.shaded.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:281) > at > org.apache.ratis.grpc.server.RaftServerProtocolClient.requestVote(RaftServerProtocolClient.java:61) > at > org.apache.ratis.grpc.RaftGRpcService.requestVote(RaftGRpcService.java:147) > at > org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: > org.apache.ratis.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: y128.l42scl.hortonworks.com/172.26.32.228:9858 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.ratis.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) > at > org.apache.ratis.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > ... 1 more > Caused by: java.net.ConnectException: Connection refused > ... 11 more > {code} > This happens because of the following lines of the code during requestVote. > {code} > for (final RaftPeer peer : others) { > final RequestVoteRequestProto r = server.createRequestVoteRequest( > peer.getId(), electionTerm, lastEntry); > service.submit( > () -> server.getServerRpc().requestVote(r)); > submitted++; > } > {code} -- This message was sent by Atlassian JIR
[jira] [Commented] (RATIS-260) Ratis Leader election should try for other peers even when ask for votes fails
[ https://issues.apache.org/jira/browse/RATIS-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574726#comment-16574726 ] Shashikant Banerjee commented on RATIS-260: --- With Patch v0, LeaderElection.waitForResults also catches StatusRuntimeException and adds to the exception list. > Ratis Leader election should try for other peers even when ask for votes fails > -- > > Key: RATIS-260 > URL: https://issues.apache.org/jira/browse/RATIS-260 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-260.00.patch > > > This bug was simulated using Ozone using Ratis for Data pipeline. > In this test, one of the nodes was shut down permanently. This can result > into a situation where a candidate node is never able to move out of Leader > Election phase. > {code} > 2018-06-15 07:44:58,246 INFO org.apache.ratis.server.impl.LeaderElection: > 0f7b9cd2-4dad-46d7-acbc-57d424492d00_9858 got exception when requesting > votes: {} > java.util.concurrent.ExecutionException: > org.apache.ratis.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io > exception > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214) > at > org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146) > at > org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102) > Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: > UNAVAILABLE: io exception > at > org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221) > at > org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202) > at > org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:131) > at > org.apache.ratis.shaded.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:281) > at > org.apache.ratis.grpc.server.RaftServerProtocolClient.requestVote(RaftServerProtocolClient.java:61) > at > org.apache.ratis.grpc.RaftGRpcService.requestVote(RaftGRpcService.java:147) > at > org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: > org.apache.ratis.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: y128.l42scl.hortonworks.com/172.26.32.228:9858 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.ratis.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) > at > org.apache.ratis.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > ... 1 more > Caused by: java.net.ConnectException: Connection refused > ... 11 more > {code} > This happens because of the following lines of the code during requestVote. > {code} > for (final RaftPeer peer : others) { > final RequestVoteRequestProto r = server.createRequestVoteRequest( > peer.getId(), electionTerm, lastEntry); > service.submit( >
[jira] [Commented] (RATIS-260) Ratis Leader election should try for other peers even when ask for votes fails
[ https://issues.apache.org/jira/browse/RATIS-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576770#comment-16576770 ] Shashikant Banerjee commented on RATIS-260: --- Thanks [~szetszwo], for the review. The issue is not recreatable consistently with Ozone. As discussed with [~msingh], it was hit after 50 runs of Freon in cluster once. I ran basic Freon in Ozone and it worked well. > Ratis Leader election should try for other peers even when ask for votes fails > -- > > Key: RATIS-260 > URL: https://issues.apache.org/jira/browse/RATIS-260 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-260.00.patch > > > This bug was simulated using Ozone using Ratis for Data pipeline. > In this test, one of the nodes was shut down permanently. This can result > into a situation where a candidate node is never able to move out of Leader > Election phase. > {code} > 2018-06-15 07:44:58,246 INFO org.apache.ratis.server.impl.LeaderElection: > 0f7b9cd2-4dad-46d7-acbc-57d424492d00_9858 got exception when requesting > votes: {} > java.util.concurrent.ExecutionException: > org.apache.ratis.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io > exception > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214) > at > org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146) > at > org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102) > Caused by: org.apache.ratis.shaded.io.grpc.StatusRuntimeException: > UNAVAILABLE: io exception > at > org.apache.ratis.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221) > at > org.apache.ratis.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202) > at > org.apache.ratis.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:131) > at > org.apache.ratis.shaded.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:281) > at > org.apache.ratis.grpc.server.RaftServerProtocolClient.requestVote(RaftServerProtocolClient.java:61) > at > org.apache.ratis.grpc.RaftGRpcService.requestVote(RaftGRpcService.java:147) > at > org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: > org.apache.ratis.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: y128.l42scl.hortonworks.com/172.26.32.228:9858 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.ratis.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) > at > org.apache.ratis.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at > org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > ... 1 more > Caused by: java.net.ConnectException: Connection refused > ... 11 more > {code} > This happens because of the following lines of the code during requestVote. > {code} > for (final RaftPeer peer : others) { > final RequestVoteRequestProto r = serve
[jira] [Commented] (RATIS-295) RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine data is also written
[ https://issues.apache.org/jira/browse/RATIS-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16583845#comment-16583845 ] Shashikant Banerjee commented on RATIS-295: --- Thanks [~msingh] for reporting and initiating work on this and [~szetszwo], for the review comments. Patch v2 addresses your review comments. > RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine > data is also written > > > Key: RATIS-295 > URL: https://issues.apache.org/jira/browse/RATIS-295 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-295.001.patch, RATIS-295.02.patch > > > Currently raft log worker only waits for the log data flush to finish. > However it should also wait for state machine data write to finish as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-295) RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine data is also written
[ https://issues.apache.org/jira/browse/RATIS-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-295: -- Attachment: RATIS-295.02.patch > RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine > data is also written > > > Key: RATIS-295 > URL: https://issues.apache.org/jira/browse/RATIS-295 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-295.001.patch, RATIS-295.02.patch > > > Currently raft log worker only waits for the log data flush to finish. > However it should also wait for state machine data write to finish as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-295) RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine data is also written
[ https://issues.apache.org/jira/browse/RATIS-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-295: -- Attachment: (was: RATIS-295.02.patch) > RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine > data is also written > > > Key: RATIS-295 > URL: https://issues.apache.org/jira/browse/RATIS-295 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > > Currently raft log worker only waits for the log data flush to finish. > However it should also wait for state machine data write to finish as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-295) RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine data is also written
[ https://issues.apache.org/jira/browse/RATIS-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-295: -- Attachment: RATIS-295.03.patch > RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine > data is also written > > > Key: RATIS-295 > URL: https://issues.apache.org/jira/browse/RATIS-295 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-295.03.patch > > > Currently raft log worker only waits for the log data flush to finish. > However it should also wait for state machine data write to finish as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-295) RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine data is also written
[ https://issues.apache.org/jira/browse/RATIS-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584033#comment-16584033 ] Shashikant Banerjee commented on RATIS-295: --- Removed the earlier patch and added patch v3 which addresses the test failures. > RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine > data is also written > > > Key: RATIS-295 > URL: https://issues.apache.org/jira/browse/RATIS-295 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-295.03.patch > > > Currently raft log worker only waits for the log data flush to finish. > However it should also wait for state machine data write to finish as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-295) RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine data is also written
[ https://issues.apache.org/jira/browse/RATIS-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-295: -- Attachment: RATIS-295.04.patch > RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine > data is also written > > > Key: RATIS-295 > URL: https://issues.apache.org/jira/browse/RATIS-295 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-295.03.patch, RATIS-295.04.patch > > > Currently raft log worker only waits for the log data flush to finish. > However it should also wait for state machine data write to finish as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-295) RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine data is also written
[ https://issues.apache.org/jira/browse/RATIS-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-295: -- Attachment: (was: RATIS-295.03.patch) > RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine > data is also written > > > Key: RATIS-295 > URL: https://issues.apache.org/jira/browse/RATIS-295 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-295.04.patch > > > Currently raft log worker only waits for the log data flush to finish. > However it should also wait for state machine data write to finish as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-295) RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine data is also written
[ https://issues.apache.org/jira/browse/RATIS-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584208#comment-16584208 ] Shashikant Banerjee commented on RATIS-295: --- Thanks [~szetszwo], for the review comments. Patch v4 addresses the same. > RaftLogWorker#WriteLog#excute should updateFlushedIndex after state machine > data is also written > > > Key: RATIS-295 > URL: https://issues.apache.org/jira/browse/RATIS-295 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-295.04.patch > > > Currently raft log worker only waits for the log data flush to finish. > However it should also wait for state machine data write to finish as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-301) provide a force option to reinitialize group from a client in a different group
[ https://issues.apache.org/jira/browse/RATIS-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585627#comment-16585627 ] Shashikant Banerjee commented on RATIS-301: --- Thanks [~msingh], for reporting and working on this. Some very minor comments: # RaftClient.java: 96 : can we rename the API to "forceReinitialize" to be consistent with the other changes in the patch. # Can we add a test case to verify the API? > provide a force option to reinitialize group from a client in a different > group > --- > > Key: RATIS-301 > URL: https://issues.apache.org/jira/browse/RATIS-301 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-301.001.patch > > > Currently for a client to re-initialize a raft group, it should be in the > same group as the server's current group. This jira proposes to add a force > option to override this requirement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-291) Raft Server should fail themselves when a raft storage directory fails
[ https://issues.apache.org/jira/browse/RATIS-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-291: -- Attachment: RATIS-291.01.patch > Raft Server should fail themselves when a raft storage directory fails > -- > > Key: RATIS-291 > URL: https://issues.apache.org/jira/browse/RATIS-291 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-291.01.patch > > > A Raft server uses a storage directory to store the write ahead log. If this > log is lost because of a reason, then this node should fail itself. > For a follower, if raft log location has failed, then the follower will not > be able to append any entries. This node will now be lagging behind the > follower and will eventually be notified via notifySlowness. > For a leader where the raft log disk has failed, the leader will not append > any new entries to its log. However with respect to the raft ring, the leader > will still remain healthy. This jira proposes to add a new api to identify a > leader with failed node. > Also this jira also proposes to add a new api to the statemachine, so that > state machine implementation can provide methods to verify the raft log > location. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-291) Raft Server should fail themselves when a raft storage directory fails
[ https://issues.apache.org/jira/browse/RATIS-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585672#comment-16585672 ] Shashikant Banerjee commented on RATIS-291: --- In Patch v1 , the leader steps down in case the raft log worker encounters an error while writing/truncating the log file or in case of any stateMachineException thrown while applying the log. > Raft Server should fail themselves when a raft storage directory fails > -- > > Key: RATIS-291 > URL: https://issues.apache.org/jira/browse/RATIS-291 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-291.01.patch > > > A Raft server uses a storage directory to store the write ahead log. If this > log is lost because of a reason, then this node should fail itself. > For a follower, if raft log location has failed, then the follower will not > be able to append any entries. This node will now be lagging behind the > follower and will eventually be notified via notifySlowness. > For a leader where the raft log disk has failed, the leader will not append > any new entries to its log. However with respect to the raft ring, the leader > will still remain healthy. This jira proposes to add a new api to identify a > leader with failed node. > Also this jira also proposes to add a new api to the statemachine, so that > state machine implementation can provide methods to verify the raft log > location. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (RATIS-291) Raft Server should fail themselves when a raft storage directory fails
[ https://issues.apache.org/jira/browse/RATIS-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585672#comment-16585672 ] Shashikant Banerjee edited comment on RATIS-291 at 8/20/18 9:44 AM: In Patch v1 , the leader steps down in case the raft log worker encounters an error while writing/truncating the log file or in case of any stateMachineException thrown while applying the log. Statemachine API to verify the log location can be addded as a separate Jira. was (Author: shashikant): In Patch v1 , the leader steps down in case the raft log worker encounters an error while writing/truncating the log file or in case of any stateMachineException thrown while applying the log. > Raft Server should fail themselves when a raft storage directory fails > -- > > Key: RATIS-291 > URL: https://issues.apache.org/jira/browse/RATIS-291 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-291.01.patch > > > A Raft server uses a storage directory to store the write ahead log. If this > log is lost because of a reason, then this node should fail itself. > For a follower, if raft log location has failed, then the follower will not > be able to append any entries. This node will now be lagging behind the > follower and will eventually be notified via notifySlowness. > For a leader where the raft log disk has failed, the leader will not append > any new entries to its log. However with respect to the raft ring, the leader > will still remain healthy. This jira proposes to add a new api to identify a > leader with failed node. > Also this jira also proposes to add a new api to the statemachine, so that > state machine implementation can provide methods to verify the raft log > location. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-301) provide a force option to reinitialize group from a client in a different group
[ https://issues.apache.org/jira/browse/RATIS-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586179#comment-16586179 ] Shashikant Banerjee commented on RATIS-301: --- Thanks [~msingh] for the review. patch v3 looks good to me. +1 > provide a force option to reinitialize group from a client in a different > group > --- > > Key: RATIS-301 > URL: https://issues.apache.org/jira/browse/RATIS-301 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-301.003.patch > > > Currently for a client to re-initialize a raft group, it should be in the > same group as the server's current group. This jira proposes to add a force > option to override this requirement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-291) Raft Server should fail themselves when a raft storage directory fails
[ https://issues.apache.org/jira/browse/RATIS-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-291: -- Attachment: (was: RATIS-291.01.patch) > Raft Server should fail themselves when a raft storage directory fails > -- > > Key: RATIS-291 > URL: https://issues.apache.org/jira/browse/RATIS-291 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-291.02.patch > > > A Raft server uses a storage directory to store the write ahead log. If this > log is lost because of a reason, then this node should fail itself. > For a follower, if raft log location has failed, then the follower will not > be able to append any entries. This node will now be lagging behind the > follower and will eventually be notified via notifySlowness. > For a leader where the raft log disk has failed, the leader will not append > any new entries to its log. However with respect to the raft ring, the leader > will still remain healthy. This jira proposes to add a new api to identify a > leader with failed node. > Also this jira also proposes to add a new api to the statemachine, so that > state machine implementation can provide methods to verify the raft log > location. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-291) Raft Server should fail themselves when a raft storage directory fails
[ https://issues.apache.org/jira/browse/RATIS-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-291: -- Attachment: RATIS-291.02.patch > Raft Server should fail themselves when a raft storage directory fails > -- > > Key: RATIS-291 > URL: https://issues.apache.org/jira/browse/RATIS-291 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-291.02.patch > > > A Raft server uses a storage directory to store the write ahead log. If this > log is lost because of a reason, then this node should fail itself. > For a follower, if raft log location has failed, then the follower will not > be able to append any entries. This node will now be lagging behind the > follower and will eventually be notified via notifySlowness. > For a leader where the raft log disk has failed, the leader will not append > any new entries to its log. However with respect to the raft ring, the leader > will still remain healthy. This jira proposes to add a new api to identify a > leader with failed node. > Also this jira also proposes to add a new api to the statemachine, so that > state machine implementation can provide methods to verify the raft log > location. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-291) Raft Server should fail themselves when a raft storage directory fails
[ https://issues.apache.org/jira/browse/RATIS-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590680#comment-16590680 ] Shashikant Banerjee commented on RATIS-291: --- Thanks [~szetszwo], for the review. I think its really not required to step down the leader in case the server already getting terminated. Updated patch v2. > Raft Server should fail themselves when a raft storage directory fails > -- > > Key: RATIS-291 > URL: https://issues.apache.org/jira/browse/RATIS-291 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-291.02.patch > > > A Raft server uses a storage directory to store the write ahead log. If this > log is lost because of a reason, then this node should fail itself. > For a follower, if raft log location has failed, then the follower will not > be able to append any entries. This node will now be lagging behind the > follower and will eventually be notified via notifySlowness. > For a leader where the raft log disk has failed, the leader will not append > any new entries to its log. However with respect to the raft ring, the leader > will still remain healthy. This jira proposes to add a new api to identify a > leader with failed node. > Also this jira also proposes to add a new api to the statemachine, so that > state machine implementation can provide methods to verify the raft log > location. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-303) TestRaftStateMachineException is failing with NullPointerException
[ https://issues.apache.org/jira/browse/RATIS-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-303: -- Attachment: RATIS-303.00.patch > TestRaftStateMachineException is failing with NullPointerException > -- > > Key: RATIS-303 > URL: https://issues.apache.org/jira/browse/RATIS-303 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-303.00.patch > > > TestRaftStateMachineException is failing with the following exception > {code} > [ERROR] > testRetryOnExceptionDuringReplication[2](org.apache.ratis.statemachine.TestRaftStateMachineException) > Time elapsed: 0.001 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.ratis.statemachine.TestRaftStateMachineException.testRetryOnExceptionDuringReplication(TestRaftStateMachineException.java:139) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (RATIS-310) Add support for Retry Policy in Ratis
Shashikant Banerjee created RATIS-310: - Summary: Add support for Retry Policy in Ratis Key: RATIS-310 URL: https://issues.apache.org/jira/browse/RATIS-310 Project: Ratis Issue Type: Bug Reporter: Shashikant Banerjee Assignee: Shashikant Banerjee -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Description: Currently, ratis retries indefinitely if a client request fails. This Jira aims to add retryPolicy in Ratis which : 1) Adds a policy to retry with a fixed count and with fixed sleep interval 2) Default policy is set to RETRY_FOREVER > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: RATIS-310.00.patch > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Attachments: RATIS-310.00.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: RATIS-310.01.patch > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.01.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: (was: RATIS-310.00.patch) > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.01.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602671#comment-16602671 ] Shashikant Banerjee commented on RATIS-310: --- Thanks [~msingh], for the review. Patch v1 addresses your review comments. > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.01.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-303) TestRaftStateMachineException is failing with NullPointerException
[ https://issues.apache.org/jira/browse/RATIS-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-303: -- Attachment: (was: RATIS-303.00.patch) > TestRaftStateMachineException is failing with NullPointerException > -- > > Key: RATIS-303 > URL: https://issues.apache.org/jira/browse/RATIS-303 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-303.01.patch > > > TestRaftStateMachineException is failing with the following exception > {code} > [ERROR] > testRetryOnExceptionDuringReplication[2](org.apache.ratis.statemachine.TestRaftStateMachineException) > Time elapsed: 0.001 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.ratis.statemachine.TestRaftStateMachineException.testRetryOnExceptionDuringReplication(TestRaftStateMachineException.java:139) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-303) TestRaftStateMachineException is failing with NullPointerException
[ https://issues.apache.org/jira/browse/RATIS-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-303: -- Attachment: RATIS-303.01.patch > TestRaftStateMachineException is failing with NullPointerException > -- > > Key: RATIS-303 > URL: https://issues.apache.org/jira/browse/RATIS-303 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-303.01.patch > > > TestRaftStateMachineException is failing with the following exception > {code} > [ERROR] > testRetryOnExceptionDuringReplication[2](org.apache.ratis.statemachine.TestRaftStateMachineException) > Time elapsed: 0.001 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.ratis.statemachine.TestRaftStateMachineException.testRetryOnExceptionDuringReplication(TestRaftStateMachineException.java:139) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-303) TestRaftStateMachineException is failing with NullPointerException
[ https://issues.apache.org/jira/browse/RATIS-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602673#comment-16602673 ] Shashikant Banerjee commented on RATIS-303: --- Thanks [~msingh], for the review. patch v1 addresses your review comments. > TestRaftStateMachineException is failing with NullPointerException > -- > > Key: RATIS-303 > URL: https://issues.apache.org/jira/browse/RATIS-303 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-303.01.patch > > > TestRaftStateMachineException is failing with the following exception > {code} > [ERROR] > testRetryOnExceptionDuringReplication[2](org.apache.ratis.statemachine.TestRaftStateMachineException) > Time elapsed: 0.001 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.ratis.statemachine.TestRaftStateMachineException.testRetryOnExceptionDuringReplication(TestRaftStateMachineException.java:139) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-303) TestRaftStateMachineException is failing with NullPointerException
[ https://issues.apache.org/jira/browse/RATIS-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-303: -- Attachment: RATIS-303.02.patch > TestRaftStateMachineException is failing with NullPointerException > -- > > Key: RATIS-303 > URL: https://issues.apache.org/jira/browse/RATIS-303 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-303.02.patch > > > TestRaftStateMachineException is failing with the following exception > {code} > [ERROR] > testRetryOnExceptionDuringReplication[2](org.apache.ratis.statemachine.TestRaftStateMachineException) > Time elapsed: 0.001 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.ratis.statemachine.TestRaftStateMachineException.testRetryOnExceptionDuringReplication(TestRaftStateMachineException.java:139) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-303) TestRaftStateMachineException is failing with NullPointerException
[ https://issues.apache.org/jira/browse/RATIS-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-303: -- Attachment: (was: RATIS-303.01.patch) > TestRaftStateMachineException is failing with NullPointerException > -- > > Key: RATIS-303 > URL: https://issues.apache.org/jira/browse/RATIS-303 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-303.02.patch > > > TestRaftStateMachineException is failing with the following exception > {code} > [ERROR] > testRetryOnExceptionDuringReplication[2](org.apache.ratis.statemachine.TestRaftStateMachineException) > Time elapsed: 0.001 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.ratis.statemachine.TestRaftStateMachineException.testRetryOnExceptionDuringReplication(TestRaftStateMachineException.java:139) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-303) TestRaftStateMachineException is failing with NullPointerException
[ https://issues.apache.org/jira/browse/RATIS-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605535#comment-16605535 ] Shashikant Banerjee commented on RATIS-303: --- Thanks [~szetszwo], for the review. patch v1 moves TestRaftStateMachineException tests to ratis-server and adds the subclasses for the rpcs. ...Could you describe how the patch fix NullPointerException? The Null pointer exception was caused because in the TestStateMachine, a fake exception is thrown during preAppendTransaction. This leads to leader stepping down so subsequent client call with leader set to null fails with null pointer exception. Since the single cluster instance was shared among all the tests , leader being set to null intermittently lead to the failure of other tests as well. The exception is addressed by waiting for the leader to come up and sending the next request to the proper leader after the stateMachine exception is thrown in testRetryOnExceptionDuringReplication. > TestRaftStateMachineException is failing with NullPointerException > -- > > Key: RATIS-303 > URL: https://issues.apache.org/jira/browse/RATIS-303 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Fix For: 0.3.0 > > Attachments: RATIS-303.02.patch > > > TestRaftStateMachineException is failing with the following exception > {code} > [ERROR] > testRetryOnExceptionDuringReplication[2](org.apache.ratis.statemachine.TestRaftStateMachineException) > Time elapsed: 0.001 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.ratis.statemachine.TestRaftStateMachineException.testRetryOnExceptionDuringReplication(TestRaftStateMachineException.java:139) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605562#comment-16605562 ] Shashikant Banerjee commented on RATIS-310: --- Thanks [~szetszwo], for the review. patch v2 addresses your review comments. > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.01.patch, RATIS-310.02.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: RATIS-310.02.patch > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.01.patch, RATIS-310.02.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (RATIS-313) Raft client ignores the reinitilization exception when the raft server is not ready
Shashikant Banerjee created RATIS-313: - Summary: Raft client ignores the reinitilization exception when the raft server is not ready Key: RATIS-313 URL: https://issues.apache.org/jira/browse/RATIS-313 Project: Ratis Issue Type: Bug Reporter: Shashikant Banerjee Assignee: Shashikant Banerjee This was found in Ozone testing. Three nodes in the pipeline. {code:java} group-2041ABBEE452:[bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858, faa888b7-92bb-4e35-a38c-711bd1c28948:172.27.80.23:9858, ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858] {code} On two servers, the reinitialization request succeeds, {code:java} 2018-09-09 10:49:40,938 INFO org.apache.ratis.server.impl.RaftServerProxy: faa888b7-92bb-4e35-a38c-711bd1c28948: reinitializeAsync ReinitializeRequest(client-682DF1D0F737->faa888b7-92bb-4e35-a38c-711bd1c28948) in group-7347726F7570, cid=4, seq=0 RW, null, group-2041ABBEE452:[bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858, faa888b7-92bb-4e35-a38c-711bd1c28948:172.27.80.23:9858, ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858 2018-09-09 10:49:40,209 INFO org.apache.ratis.server.impl.RaftServerProxy: bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: reinitializeAsync ReinitializeRequest(client-DFE3ACF394F9->bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9) in group-7347726F7570, cid=3, seq=0 RW, null, group-2041ABBEE452:[bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858, faa888b7-92bb-4e35-a38c-711bd1c28948:172.27.80.23:9858, ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858] {code} But around the same time, the third server is not ready {code:java} 2018-09-09 10:49:41,414 WARN org.apache.ratis.grpc.server.RaftServerProtocolService: ff544de8-96ea-4097-8cdc-460ac1c60db7: Failed requestVote bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9->ff544de8-96ea-4097-8cdc-460ac1c60db7#0: org.apache.ratis.protocol.ServerNotReadyException: Server ff544de8-96ea-4097-8cdc-460ac1c60db7 is not [RUNNING]: current state is STARTING {code} Though the reinitialization request never got processed on this server, the exception is ignored in RaftClientImpl. This needs to be addressed -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: RATIS-310.03.patch > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.03.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: (was: RATIS-310.02.patch) > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.03.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609844#comment-16609844 ] Shashikant Banerjee commented on RATIS-310: --- Thanks [~szetszwo] for the review. Patch v3 addresses your review comments. > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.03.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: (was: RATIS-310.03.patch) > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.03.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: RATIS-310.03.patch > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.03.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: (was: RATIS-310.03.patch) > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.04.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: RATIS-310.04.patch > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.04.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609932#comment-16609932 ] Shashikant Banerjee commented on RATIS-310: --- Thanks [~szetszwo], for the review. patch v4 addresses the review comments. > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.04.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: (was: RATIS-310.04.patch) > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610554#comment-16610554 ] Shashikant Banerjee commented on RATIS-310: --- Patch v5 adds the setter function to set the retry policy in RaftCiient. > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.05.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: RATIS-310.05.patch > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.05.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610554#comment-16610554 ] Shashikant Banerjee edited comment on RATIS-310 at 9/11/18 1:09 PM: Patch v5 adds the setter function to set the retry policy in RaftClient. was (Author: shashikant): Patch v5 adds the setter function to set the retry policy in RaftCiient. > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.05.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (RATIS-313) Raft client ignores the reinitilization exception when the raft server is not ready
[ https://issues.apache.org/jira/browse/RATIS-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-313. --- Resolution: Not A Problem > Raft client ignores the reinitilization exception when the raft server is not > ready > --- > > Key: RATIS-313 > URL: https://issues.apache.org/jira/browse/RATIS-313 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > > This was found in Ozone testing. > Three nodes in the pipeline. > {code:java} > group-2041ABBEE452:[bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858, > faa888b7-92bb-4e35-a38c-711bd1c28948:172.27.80.23:9858, > ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858] > {code} > On two servers, the reinitialization request succeeds, > {code:java} > 2018-09-09 10:49:40,938 INFO org.apache.ratis.server.impl.RaftServerProxy: > faa888b7-92bb-4e35-a38c-711bd1c28948: reinitializeAsync > ReinitializeRequest(client-682DF1D0F737->faa888b7-92bb-4e35-a38c-711bd1c28948) > in group-7347726F7570, cid=4, seq=0 RW, null, > group-2041ABBEE452:[bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858, > faa888b7-92bb-4e35-a38c-711bd1c28948:172.27.80.23:9858, > ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858 > 2018-09-09 10:49:40,209 INFO org.apache.ratis.server.impl.RaftServerProxy: > bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9: reinitializeAsync > ReinitializeRequest(client-DFE3ACF394F9->bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9) > in group-7347726F7570, cid=3, seq=0 RW, null, > group-2041ABBEE452:[bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9:172.27.12.96:9858, > faa888b7-92bb-4e35-a38c-711bd1c28948:172.27.80.23:9858, > ff544de8-96ea-4097-8cdc-460ac1c60db7:172.27.23.161:9858] > {code} > But around the same time, the third server is not ready > {code:java} > 2018-09-09 10:49:41,414 WARN > org.apache.ratis.grpc.server.RaftServerProtocolService: > ff544de8-96ea-4097-8cdc-460ac1c60db7: Failed requestVote > bfe9c5f2-da9b-4a8f-9013-7540cbbed1c9->ff544de8-96ea-4097-8cdc-460ac1c60db7#0: > org.apache.ratis.protocol.ServerNotReadyException: Server > ff544de8-96ea-4097-8cdc-460ac1c60db7 is not [RUNNING]: current state is > STARTING > {code} > Though the reinitialization request never got processed on this server, the > exception is ignored in RaftClientImpl. This needs to be addressed -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: (was: RATIS-310.05.patch) > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Attachment: RATIS-310.06.patch > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.06.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610673#comment-16610673 ] Shashikant Banerjee commented on RATIS-310: --- Patch v6 fixes the test failures. > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: ozone > Attachments: RATIS-310.06.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-310) Add support for Retry Policy in Ratis
[ https://issues.apache.org/jira/browse/RATIS-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-310: -- Labels: acadia ozone (was: ozone) > Add support for Retry Policy in Ratis > - > > Key: RATIS-310 > URL: https://issues.apache.org/jira/browse/RATIS-310 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Labels: acadia, ozone > Attachments: RATIS-310.06.patch > > > Currently, ratis retries indefinitely if a client request fails. This Jira > aims to add retryPolicy in Ratis which : > 1) Adds a policy to retry with a fixed count and with fixed sleep interval > 2) Default policy is set to RETRY_FOREVER -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (RATIS-318) Ratis is leaking managed channel
[ https://issues.apache.org/jira/browse/RATIS-318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee reassigned RATIS-318: - Assignee: Shashikant Banerjee > Ratis is leaking managed channel > > > Key: RATIS-318 > URL: https://issues.apache.org/jira/browse/RATIS-318 > Project: Ratis > Issue Type: Bug > Components: client >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > > TestDataValidate in Ozone throws the following exception. > {code} > java.lang.RuntimeException: ManagedChannel allocation site > at > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.(ManagedChannelOrphanWrapper.java:103) > at > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:53) > at > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:44) > at > org.apache.ratis.shaded.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:410) > at > org.apache.ratis.grpc.client.RaftClientProtocolClient.(RaftClientProtocolClient.java:80) > at > org.apache.ratis.grpc.client.GrpcClientRpc.lambda$new$0(GrpcClientRpc.java:56) > at > org.apache.ratis.util.PeerProxyMap$PeerAndProxy.lambda$getProxy$0(PeerProxyMap.java:55) > at > org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:182) > at > org.apache.ratis.util.PeerProxyMap$PeerAndProxy.getProxy(PeerProxyMap.java:54) > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:101) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:78) > at > org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:313) > at > org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetry(RaftClientImpl.java:268) > at > org.apache.ratis.client.impl.RaftClientImpl.send(RaftClientImpl.java:197) > at > org.apache.ratis.client.impl.RaftClientImpl.send(RaftClientImpl.java:178) > at org.apache.ratis.client.RaftClient.send(RaftClient.java:82) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.sendRequest(XceiverClientRatis.java:193) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.sendCommand(XceiverClientRatis.java:210) > at > org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createContainer(ContainerProtocolCalls.java:297) > at > org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.checkKeyLocationInfo(ChunkGroupOutputStream.java:197) > at > org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.addPreallocateBlocks(ChunkGroupOutputStream.java:180) > at > org.apache.hadoop.ozone.client.rpc.RpcClient.createKey(RpcClient.java:472) > at > org.apache.hadoop.ozone.client.OzoneBucket.createKey(OzoneBucket.java:245) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator$OfflineProcessor.run(RandomKeyGenerator.java:601) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
***UNCHECKED*** [jira] [Commented] (RATIS-325) RetryPolicies should not import com.google.common.annotations.VisibleForTesting.
[ https://issues.apache.org/jira/browse/RATIS-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620246#comment-16620246 ] Shashikant Banerjee commented on RATIS-325: --- Thanks [~szetszwo], for the review. Patch looks good to me. +1. > RetryPolicies should not import > com.google.common.annotations.VisibleForTesting. > > > Key: RATIS-325 > URL: https://issues.apache.org/jira/browse/RATIS-325 > Project: Ratis > Issue Type: Improvement > Components: client >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze >Priority: Minor > Attachments: r325_20180918.patch > > > It should import the shaded class instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (RATIS-325) RetryPolicies should not import com.google.common.annotations.VisibleForTesting.
[ https://issues.apache.org/jira/browse/RATIS-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620246#comment-16620246 ] Shashikant Banerjee edited comment on RATIS-325 at 9/19/18 9:23 AM: Thanks [~szetszwo], for the patch. Patch looks good to me. +1. was (Author: shashikant): Thanks [~szetszwo], for the review. Patch looks good to me. +1. > RetryPolicies should not import > com.google.common.annotations.VisibleForTesting. > > > Key: RATIS-325 > URL: https://issues.apache.org/jira/browse/RATIS-325 > Project: Ratis > Issue Type: Improvement > Components: client >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze >Priority: Minor > Attachments: r325_20180918.patch > > > It should import the shaded class instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-326) Introduce RemoveStateMachineData API in StateMachine interface in Ratis
[ https://issues.apache.org/jira/browse/RATIS-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-326: -- Summary: Introduce RemoveStateMachineData API in StateMachine interface in Ratis (was: Introduce RemoveStateMachine Data in StateMachine interface in Ratis) > Introduce RemoveStateMachineData API in StateMachine interface in Ratis > --- > > Key: RATIS-326 > URL: https://issues.apache.org/jira/browse/RATIS-326 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > > When a follower truncates its log entry in case there is a mismatch between > the received log entry and its own stored entry, we should also remove the > stateMachine data written as a part of appending the stored log entry on the > follower. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (RATIS-326) Introduce RemoveStateMachine Data in StateMachine interface in Ratis
Shashikant Banerjee created RATIS-326: - Summary: Introduce RemoveStateMachine Data in StateMachine interface in Ratis Key: RATIS-326 URL: https://issues.apache.org/jira/browse/RATIS-326 Project: Ratis Issue Type: Bug Reporter: Shashikant Banerjee Assignee: Shashikant Banerjee When a follower truncates its log entry in case there is a mismatch between the received log entry and its own stored entry, we should also remove the stateMachine data written as a part of appending the stored log entry on the follower. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (RATIS-331) Ratis client should provide a method to wait for commit from all the replica
[ https://issues.apache.org/jira/browse/RATIS-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee reassigned RATIS-331: - Assignee: Shashikant Banerjee (was: Mukul Kumar Singh) > Ratis client should provide a method to wait for commit from all the replica > > > Key: RATIS-331 > URL: https://issues.apache.org/jira/browse/RATIS-331 > Project: Ratis > Issue Type: Bug > Components: client >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > > Ratis client should provide a method to wait for commit from all the peers. > Also it will be great is supplier method can be provided to take an action on > this event. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-318) Ratis is leaking managed channel
[ https://issues.apache.org/jira/browse/RATIS-318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628882#comment-16628882 ] Shashikant Banerjee commented on RATIS-318: --- [~szetszwo], I think the issue is with closing of the xceiverClients in Ozone. Same issue exist with XceiverClientGrpc as well . Resolving it here. {code:java} Sep 26, 2018 8:11:01 PM org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference cleanQueue SEVERE: *~*~*~ Channel ManagedChannelImpl{logId=136, target=192.168.1.2:50712} was not shutdown properly!!! ~*~*~* Make sure to call shutdown()/shutdownNow() and wait until awaitTermination() returns true. java.lang.RuntimeException: ManagedChannel allocation site at org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.(ManagedChannelOrphanWrapper.java:103) at org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:53) at org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:44) at org.apache.ratis.shaded.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:410) at org.apache.hadoop.hdds.scm.XceiverClientGrpc.connect(XceiverClientGrpc.java:92) at org.apache.hadoop.hdds.scm.XceiverClientManager$2.call(XceiverClientManager.java:159) at org.apache.hadoop.hdds.scm.XceiverClientManager$2.call(XceiverClientManager.java:144) at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4767) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) at com.google.common.cache.LocalCache.get(LocalCache.java:3965) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4764) at org.apache.hadoop.hdds.scm.XceiverClientManager.getClient(XceiverClientManager.java:143) at org.apache.hadoop.hdds.scm.XceiverClientManager.acquireClient(XceiverClientManager.java:122) at org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.checkKeyLocationInfo(ChunkGroupOutputStream.java:192) at org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.addPreallocateBlocks(ChunkGroupOutputStream.java:180) at org.apache.hadoop.ozone.client.rpc.RpcClient.createKey(RpcClient.java:472) at org.apache.hadoop.ozone.client.OzoneBucket.createKey(OzoneBucket.java:262) at org.apache.hadoop.ozone.freon.RandomKeyGenerator$OfflineProcessor.run(RandomKeyGenerator.java:601) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) at java.util.concurrent.FutureTask.run(FutureTask.java) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} > Ratis is leaking managed channel > > > Key: RATIS-318 > URL: https://issues.apache.org/jira/browse/RATIS-318 > Project: Ratis > Issue Type: Bug > Components: client >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > > TestDataValidate in Ozone throws the following exception. > {code} > java.lang.RuntimeException: ManagedChannel allocation site > at > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.(ManagedChannelOrphanWrapper.java:103) > at > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:53) > at > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:44) > at > org.apache.ratis.shaded.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:410) > at > org.apache.ratis.grpc.client.RaftClientProtocolClient.(RaftClientProtocolClient.java:80) > at > org.apache.ratis.grpc.client.GrpcClientRpc.lambda$new$0(GrpcClientRpc.java:56) > at > org.apache.ratis.util.PeerProxyMap$PeerAndProxy.lambda$getProxy$0(PeerProxyMap.java:55) > at > org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:182) > at > org.apache.ratis.util.PeerProxyMap$PeerAndProxy.getProxy(PeerProxyMap.java:54) > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:101) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:78) >
[jira] [Resolved] (RATIS-318) Ratis is leaking managed channel
[ https://issues.apache.org/jira/browse/RATIS-318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-318. --- Resolution: Fixed > Ratis is leaking managed channel > > > Key: RATIS-318 > URL: https://issues.apache.org/jira/browse/RATIS-318 > Project: Ratis > Issue Type: Bug > Components: client >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > > TestDataValidate in Ozone throws the following exception. > {code} > java.lang.RuntimeException: ManagedChannel allocation site > at > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.(ManagedChannelOrphanWrapper.java:103) > at > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:53) > at > org.apache.ratis.shaded.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:44) > at > org.apache.ratis.shaded.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:410) > at > org.apache.ratis.grpc.client.RaftClientProtocolClient.(RaftClientProtocolClient.java:80) > at > org.apache.ratis.grpc.client.GrpcClientRpc.lambda$new$0(GrpcClientRpc.java:56) > at > org.apache.ratis.util.PeerProxyMap$PeerAndProxy.lambda$getProxy$0(PeerProxyMap.java:55) > at > org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:182) > at > org.apache.ratis.util.PeerProxyMap$PeerAndProxy.getProxy(PeerProxyMap.java:54) > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:101) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:78) > at > org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:313) > at > org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetry(RaftClientImpl.java:268) > at > org.apache.ratis.client.impl.RaftClientImpl.send(RaftClientImpl.java:197) > at > org.apache.ratis.client.impl.RaftClientImpl.send(RaftClientImpl.java:178) > at org.apache.ratis.client.RaftClient.send(RaftClient.java:82) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.sendRequest(XceiverClientRatis.java:193) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.sendCommand(XceiverClientRatis.java:210) > at > org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createContainer(ContainerProtocolCalls.java:297) > at > org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.checkKeyLocationInfo(ChunkGroupOutputStream.java:197) > at > org.apache.hadoop.ozone.client.io.ChunkGroupOutputStream.addPreallocateBlocks(ChunkGroupOutputStream.java:180) > at > org.apache.hadoop.ozone.client.rpc.RpcClient.createKey(RpcClient.java:472) > at > org.apache.hadoop.ozone.client.OzoneBucket.createKey(OzoneBucket.java:245) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator$OfflineProcessor.run(RandomKeyGenerator.java:601) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-234) Add an feature to watch if a request is replicated/committed to a particular ReplicationLevel
[ https://issues.apache.org/jira/browse/RATIS-234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637196#comment-16637196 ] Shashikant Banerjee commented on RATIS-234: --- Thanks [~szetszwo], for the patch. The patch does not apply to trunk. Can you please rebase? > Add an feature to watch if a request is replicated/committed to a particular > ReplicationLevel > - > > Key: RATIS-234 > URL: https://issues.apache.org/jira/browse/RATIS-234 > Project: Ratis > Issue Type: New Feature > Components: client, server >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze >Priority: Major > Labels: ozone > Attachments: r234_20181002.patch > > > When a client request is specified with ALL replication, it is possible that > it is committed (i.e. replicated to a majority of servers) but not yet > replicated to all servers. This feature is to let the client to watch it > until it is replicated to all server. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-337) In RaftServerImpl, leaderState/heartbeatMonitor may be accessed without proper null check
[ https://issues.apache.org/jira/browse/RATIS-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638267#comment-16638267 ] Shashikant Banerjee commented on RATIS-337: --- Thanks [~szetszwo] for the patch. The patch does not apply anymore. Can you please rebase? > In RaftServerImpl, leaderState/heartbeatMonitor may be accessed without > proper null check > - > > Key: RATIS-337 > URL: https://issues.apache.org/jira/browse/RATIS-337 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze >Priority: Major > Attachments: r337_20181002.patch > > > leaderState/heartbeatMonitor is declared as volatile. Some code like below > won't work since leaderState may be set to null in between. > {code:java} > //RaftServerImpl.checkLeaderState(..) > } else if (leaderState == null || !leaderState.isReady()) { > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (RATIS-337) In RaftServerImpl, leaderState/heartbeatMonitor may be accessed without proper null check
[ https://issues.apache.org/jira/browse/RATIS-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-337: -- Comment: was deleted (was: Thanks [~szetszwo] for the patch. The patch does not apply anymore. Can you please rebase?) > In RaftServerImpl, leaderState/heartbeatMonitor may be accessed without > proper null check > - > > Key: RATIS-337 > URL: https://issues.apache.org/jira/browse/RATIS-337 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze >Priority: Major > Attachments: r337_20181002.patch > > > leaderState/heartbeatMonitor is declared as volatile. Some code like below > won't work since leaderState may be set to null in between. > {code:java} > //RaftServerImpl.checkLeaderState(..) > } else if (leaderState == null || !leaderState.isReady()) { > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-336) LeaderState.isBootStrappingPeer may have NPE
[ https://issues.apache.org/jira/browse/RATIS-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638290#comment-16638290 ] Shashikant Banerjee commented on RATIS-336: --- The patch looks good to me . I am +1 on this. > LeaderState.isBootStrappingPeer may have NPE > > > Key: RATIS-336 > URL: https://issues.apache.org/jira/browse/RATIS-336 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze >Priority: Major > Attachments: r336_20181001.patch > > > {code} > //LeaderState > boolean isBootStrappingPeer(RaftPeerId peerId) { > return inStagingState() && getStagingState().contains(peerId); > } > boolean inStagingState() { > return stagingState != null; > } > > ConfigurationStagingState getStagingState() { > return stagingState; > } > {code} > Since stagingState is volatile, it could be set to null between > inStagingState() and contains(..). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-337) In RaftServerImpl, leaderState/heartbeatMonitor may be accessed without proper null check
[ https://issues.apache.org/jira/browse/RATIS-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640144#comment-16640144 ] Shashikant Banerjee commented on RATIS-337: --- Thanks [~szetszwo] for the patch. The patch is not applying on trunk anymore. Can you check? > In RaftServerImpl, leaderState/heartbeatMonitor may be accessed without > proper null check > - > > Key: RATIS-337 > URL: https://issues.apache.org/jira/browse/RATIS-337 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze >Priority: Major > Attachments: r337_20181002.patch > > > leaderState/heartbeatMonitor is declared as volatile. Some code like below > won't work since leaderState may be set to null in between. > {code:java} > //RaftServerImpl.checkLeaderState(..) > } else if (leaderState == null || !leaderState.isReady()) { > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-341) Raft log index on the follower should be applied to state machine only after writing the log
[ https://issues.apache.org/jira/browse/RATIS-341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640631#comment-16640631 ] Shashikant Banerjee commented on RATIS-341: --- Thanks [~msingh] for the patch and offline discussion. I am +1 on this. > Raft log index on the follower should be applied to state machine only after > writing the log > > > Key: RATIS-341 > URL: https://issues.apache.org/jira/browse/RATIS-341 > Project: Ratis > Issue Type: Bug > Components: server >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Fix For: 0.3.0 > > Attachments: RATIS-341.002.patch > > > In follower, RaftServerImpl#appendEntriesAsync, entries should only be > applied to state machine > only after writing the log to the state machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (RATIS-331) Ratis client should provide a method to wait for commit from all the replica
[ https://issues.apache.org/jira/browse/RATIS-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-331. --- Resolution: Fixed > Ratis client should provide a method to wait for commit from all the replica > > > Key: RATIS-331 > URL: https://issues.apache.org/jira/browse/RATIS-331 > Project: Ratis > Issue Type: Bug > Components: client >Affects Versions: 0.3.0 >Reporter: Mukul Kumar Singh >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > > Ratis client should provide a method to wait for commit from all the peers. > Also it will be great is supplier method can be provided to take an action on > this event. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-382) writeStateMachineData times out
[ https://issues.apache.org/jira/browse/RATIS-382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670433#comment-16670433 ] Shashikant Banerjee commented on RATIS-382: --- >From logs on node >hadoop-root-datanode-ctr-e138-1518143905142-53-01-08.hwx.site : {code:java} 2018-10-31 07:31:06,654 ERROR org.apache.ratis.server.storage.RaftLogWorker: Terminating with exit status 1: 54026017-a738-45f5-92f9-c50a0fc24a9f-RaftLogWorker failed. org.apache.ratis.protocol.TimeoutIOException: Timeout: WriteLog:57: (t:3, i:57), STATEMACHINELOGENTRY, client-81616CC8EE42, cid=163-writeStateMachineData at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:87) at org.apache.ratis.server.storage.RaftLogWorker$WriteLog.execute(RaftLogWorker.java:310) at org.apache.ratis.server.storage.RaftLogWorker.run(RaftLogWorker.java:182) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:79) {code} Timeout Exception happened around 07:31. >From Ozone.log: {code:java} 2018-10-31 07:30:50,691 [pool-3-thread-48] DEBUG (ChunkManagerImpl.java:85) - writing chunk:7a6ab5f5d7891d266ab743b6054e678e_stream_1acd3f82-556f-4a37-8efd-029eb626d72c_chunk_15 chunk stage:WRITE_DATA chunk file:/tmp/hadoop-root/dfs/data/hdds/4099890c-4d08-4e76-9850-b990bca90d6d/current/containerDir0/16/chunks/7a6ab5f5d7891d266ab743b6054e678e_stream_1acd3f82-556f-4a37-8efd-029eb626d72c_chunk_15 tmp chunk file 2018-10-31 07:30:51,768 [pool-3-thread-49] DEBUG (ChunkManagerImpl.java:85) - writing chunk:7a6ab5f5d7891d266ab743b6054e678e_stream_1acd3f82-556f-4a37-8efd-029eb626d72c_chunk_16 chunk stage:WRITE_DATA chunk file:/tmp/hadoop-root/dfs/data/hdds/4099890c-4d08-4e76-9850-b990bca90d6d/current/containerDir0/16/chunks/7a6ab5f5d7891d266ab743b6054e678e_stream_1acd3f82-556f-4a37-8efd-029eb626d72c_chunk_16 tmp chunk file 2018-10-31 07:30:53,757 [pool-10-thread-1] DEBUG (ChunkManagerImpl.java:85) - writing chunk:7a6ab5f5d7891d266ab743b6054e678e_stream_1acd3f82-556f-4a37-8efd-029eb626d72c_chunk_14 chunk stage:COMMIT_DATA chunk file:/tmp/hadoop-root/dfs/data/hdds/4099890c-4d08-4e76-9850-b990bca90d6d/current/containerDir0/16/chunks/7a6ab5f5d7891d266ab743b6054e678e_stream_1acd3f82-556f-4a37-8efd-029eb626d72c_chunk_14 tmp chunk file 2018-10-31 07:31:06,673 [shutdown-hook-0] INFO (LogAdapter.java:51) - SHUTDOWN_MSG: // raftServer Stopped {code} These are the 2 write chunks during writeStateMachineData in flight. The commit for these has not happened yet. Looks like it indeed took more than 10 seconds for chunkFile *chunk file:/tmp/hadoop-root/dfs/data/hdds/4099890c-4d08-4e76-9850-b990bca90d6d/current/containerDir0/16/chunks/7a6ab5f5d7891d266ab743b6054e678e_stream_1acd3f82-556f-4a37-8efd-029eb626d72c_chunk_15* to get written completely. May be increasing the timeout would help here. > writeStateMachineData times out > --- > > Key: RATIS-382 > URL: https://issues.apache.org/jira/browse/RATIS-382 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Nilotpal Nandi >Priority: Blocker > Attachments: all-node-ozone-logs-1540979056.tar.gz > > > datanode stopped due to following error : > datanode.log > {noformat} > 2018-10-31 09:12:04,517 INFO org.apache.ratis.server.impl.RaftServerImpl: > 9fab9937-fbcd-4196-8014-cb165045724b: set configuration 169: > [9fab9937-fbcd-4196-8014-cb165045724b:172.27.15.131:9858, > ce0084c2-97cd-4c97-9378-e5175daad18b:172.27.15.139:9858, > f0291cb4-7a48-456a-847f-9f91a12aa850:172.27.38.9:9858], old=null at 169 > 2018-10-31 09:12:22,187 ERROR org.apache.ratis.server.storage.RaftLogWorker: > Terminating with exit status 1: > 9fab9937-fbcd-4196-8014-cb165045724b-RaftLogWorker failed. > org.apache.ratis.protocol.TimeoutIOException: Timeout: WriteLog:182: (t:10, > i:182), STATEMACHINELOGENTRY, client-611073BBFA46, > cid=127-writeStateMachineData > at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:87) > at > org.apache.ratis.server.storage.RaftLogWorker$WriteLog.execute(RaftLogWorker.java:310) > at org.apache.ratis.server.storage.RaftLogWorker.run(RaftLogWorker.java:182) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:79) > ... 3 more{noformat} -- This message was sent by Atlassian JIRA (v7
[jira] [Commented] (RATIS-382) writeStateMachineData times out
[ https://issues.apache.org/jira/browse/RATIS-382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670568#comment-16670568 ] Shashikant Banerjee commented on RATIS-382: --- Looking further at the nodes, the tmp chunk files do actually exist and are completely written: {code:java} -rw-r--r-- 1 root root 16M Oct 31 07:30 /tmp/hadoop-root/dfs/data/hdds/4099890c-4d08-4e76-9850-b990bca90d6d/current/containerDir0/16/chunks/7a6ab5f5d7891d266ab743b6054e678e_stream_1acd3f82-556f-4a37-8efd-029eb626d72c_chunk_15.tmp -rw-r--r-- 1 root root 16M Oct 31 07:30 /tmp/hadoop-root/dfs/data/hdds/4099890c-4d08-4e76-9850-b990bca90d6d/current/containerDir0/16/chunks/7a6ab5f5d7891d266ab743b6054e678e_stream_1acd3f82-556f-4a37-8efd-029eb626d72c_chunk_16.tmp{code} > writeStateMachineData times out > --- > > Key: RATIS-382 > URL: https://issues.apache.org/jira/browse/RATIS-382 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Nilotpal Nandi >Priority: Blocker > Attachments: all-node-ozone-logs-1540979056.tar.gz > > > datanode stopped due to following error : > datanode.log > {noformat} > 2018-10-31 09:12:04,517 INFO org.apache.ratis.server.impl.RaftServerImpl: > 9fab9937-fbcd-4196-8014-cb165045724b: set configuration 169: > [9fab9937-fbcd-4196-8014-cb165045724b:172.27.15.131:9858, > ce0084c2-97cd-4c97-9378-e5175daad18b:172.27.15.139:9858, > f0291cb4-7a48-456a-847f-9f91a12aa850:172.27.38.9:9858], old=null at 169 > 2018-10-31 09:12:22,187 ERROR org.apache.ratis.server.storage.RaftLogWorker: > Terminating with exit status 1: > 9fab9937-fbcd-4196-8014-cb165045724b-RaftLogWorker failed. > org.apache.ratis.protocol.TimeoutIOException: Timeout: WriteLog:182: (t:10, > i:182), STATEMACHINELOGENTRY, client-611073BBFA46, > cid=127-writeStateMachineData > at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:87) > at > org.apache.ratis.server.storage.RaftLogWorker$WriteLog.execute(RaftLogWorker.java:310) > at org.apache.ratis.server.storage.RaftLogWorker.run(RaftLogWorker.java:182) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException > at > java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) > at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:79) > ... 3 more{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (RATIS-386) Raft Client Async API's should honor Retry Policy
Shashikant Banerjee created RATIS-386: - Summary: Raft Client Async API's should honor Retry Policy Key: RATIS-386 URL: https://issues.apache.org/jira/browse/RATIS-386 Project: Ratis Issue Type: Improvement Components: client Affects Versions: 0.3.0 Reporter: Shashikant Banerjee Assignee: Shashikant Banerjee Fix For: 0.3.0 Raft client sync Api has support for retry policies. Similarly, for Async API's including watch Api, support for Retry Policy is required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-362) Add a Builder for TransactionContext
[ https://issues.apache.org/jira/browse/RATIS-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672908#comment-16672908 ] Shashikant Banerjee commented on RATIS-362: --- Thanks [~szetszwo], for the patch. The patch looks good to me. Just one minor comment: In TransactionContext , can we also have a setter function called setException (we already have getException exposed), so that, we can set the exception inside the stateMachine in case the startTransaction fails, the exception can be set and properly handled here in RaftServerImpl: {code:java} // TODO: this client request will not be added to pending requests until // later which means that any failure in between will leave partial state in // the state machine. We should call cancelTransaction() for failed requests TransactionContext context = stateMachine.startTransaction(request); if (context.getException() != null) { RaftClientReply exceptionReply = new RaftClientReply(request, new StateMachineException(getId(), context.getException()), getCommitInfos()); cacheEntry.failWithReply(exceptionReply); return CompletableFuture.completedFuture(exceptionReply); } {code} > Add a Builder for TransactionContext > > > Key: RATIS-362 > URL: https://issues.apache.org/jira/browse/RATIS-362 > Project: Ratis > Issue Type: Improvement > Components: server >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze >Priority: Major > Attachments: r362_20181027.patch > > > Currently, we use TransactionContextImpl constructors to create > TransactionContext objects. Howerver, TransactionContextImpl is supposed to > be internal but not a public API. It is better to add a Builder for > TransactionContext. The Builder is a public API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (RATIS-362) Add a Builder for TransactionContext
[ https://issues.apache.org/jira/browse/RATIS-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672908#comment-16672908 ] Shashikant Banerjee edited comment on RATIS-362 at 11/2/18 10:44 AM: - Thanks [~szetszwo], for the patch. The patch looks good to me. Just one minor comment: In TransactionContext , can we also have a public setter function called setException (we already have getException exposed), so that, we can set the exception inside the stateMachine in case the startTransaction fails, the exception can be set and properly handled here in RaftServerImpl: {code:java} // TODO: this client request will not be added to pending requests until // later which means that any failure in between will leave partial state in // the state machine. We should call cancelTransaction() for failed requests TransactionContext context = stateMachine.startTransaction(request); if (context.getException() != null) { RaftClientReply exceptionReply = new RaftClientReply(request, new StateMachineException(getId(), context.getException()), getCommitInfos()); cacheEntry.failWithReply(exceptionReply); return CompletableFuture.completedFuture(exceptionReply); } {code} was (Author: shashikant): Thanks [~szetszwo], for the patch. The patch looks good to me. Just one minor comment: In TransactionContext , can we also have a setter function called setException (we already have getException exposed), so that, we can set the exception inside the stateMachine in case the startTransaction fails, the exception can be set and properly handled here in RaftServerImpl: {code:java} // TODO: this client request will not be added to pending requests until // later which means that any failure in between will leave partial state in // the state machine. We should call cancelTransaction() for failed requests TransactionContext context = stateMachine.startTransaction(request); if (context.getException() != null) { RaftClientReply exceptionReply = new RaftClientReply(request, new StateMachineException(getId(), context.getException()), getCommitInfos()); cacheEntry.failWithReply(exceptionReply); return CompletableFuture.completedFuture(exceptionReply); } {code} > Add a Builder for TransactionContext > > > Key: RATIS-362 > URL: https://issues.apache.org/jira/browse/RATIS-362 > Project: Ratis > Issue Type: Improvement > Components: server >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze >Priority: Major > Attachments: r362_20181027.patch > > > Currently, we use TransactionContextImpl constructors to create > TransactionContext objects. Howerver, TransactionContextImpl is supposed to > be internal but not a public API. It is better to add a Builder for > TransactionContext. The Builder is a public API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-362) Add a Builder for TransactionContext
[ https://issues.apache.org/jira/browse/RATIS-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674883#comment-16674883 ] Shashikant Banerjee commented on RATIS-362: --- Thanks [~szetszwo], for updating the patch. The patch looks good to me. I am +1 on this. > Add a Builder for TransactionContext > > > Key: RATIS-362 > URL: https://issues.apache.org/jira/browse/RATIS-362 > Project: Ratis > Issue Type: Improvement > Components: server >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze >Priority: Major > Attachments: r362_20181105.patch > > > Currently, we use TransactionContextImpl constructors to create > TransactionContext objects. Howerver, TransactionContextImpl is supposed to > be internal but not a public API. It is better to add a Builder for > TransactionContext. The Builder is a public API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (RATIS-394) Remove the assertion while setting the exception in TransactionContextImpl
Shashikant Banerjee created RATIS-394: - Summary: Remove the assertion while setting the exception in TransactionContextImpl Key: RATIS-394 URL: https://issues.apache.org/jira/browse/RATIS-394 Project: Ratis Issue Type: Improvement Reporter: Shashikant Banerjee Assignee: Shashikant Banerjee Fix For: 0.3.0 In the below code in TransactionContaextImpl, {code:java} @Override public TransactionContext setException(Exception ioe) { assert exception != null; this.exception = ioe; return this; } {code} While setting the exception it asserts the exception maintained in the object is not null or not. While setting the exception first time, it will be null always and hence asserts. We should relax the check here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-394) Remove the assertion while setting the exception in TransactionContextImpl
[ https://issues.apache.org/jira/browse/RATIS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-394: -- Description: In the below code in TransactionContextImpl, {code:java} @Override public TransactionContext setException(Exception ioe) { assert exception != null; this.exception = ioe; return this; } {code} While setting the exception it asserts the exception maintained in the object is not null or not. While setting the exception first time, it will be null always and hence asserts. We should relax the check here. was: In the below code in TransactionContaextImpl, {code:java} @Override public TransactionContext setException(Exception ioe) { assert exception != null; this.exception = ioe; return this; } {code} While setting the exception it asserts the exception maintained in the object is not null or not. While setting the exception first time, it will be null always and hence asserts. We should relax the check here. > Remove the assertion while setting the exception in TransactionContextImpl > -- > > Key: RATIS-394 > URL: https://issues.apache.org/jira/browse/RATIS-394 > Project: Ratis > Issue Type: Improvement >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > > In the below code in TransactionContextImpl, > {code:java} > @Override > public TransactionContext setException(Exception ioe) { > assert exception != null; > this.exception = ioe; > return this; > } > {code} > While setting the exception it asserts the exception maintained in the object > is not null or not. While setting the exception first time, it will be null > always and hence asserts. We should relax the check here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-394) Remove the assertion while setting the exception in TransactionContextImpl
[ https://issues.apache.org/jira/browse/RATIS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-394: -- Attachment: RATIS-394.000.patch > Remove the assertion while setting the exception in TransactionContextImpl > -- > > Key: RATIS-394 > URL: https://issues.apache.org/jira/browse/RATIS-394 > Project: Ratis > Issue Type: Improvement >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > Attachments: RATIS-394.000.patch > > > In the below code in TransactionContextImpl, > {code:java} > @Override > public TransactionContext setException(Exception ioe) { > assert exception != null; > this.exception = ioe; > return this; > } > {code} > While setting the exception it asserts the exception maintained in the object > is not null or not. While setting the exception first time, it will be null > always and hence asserts. We should relax the check here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-394) Remove the assertion while setting the exception in TransactionContextImpl
[ https://issues.apache.org/jira/browse/RATIS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-394: -- Description: In the below code in TransactionContextImpl, {code:java} @Override public TransactionContext setException(Exception ioe) { assert exception != null; this.exception = ioe; return this; } {code} While setting the exception it asserts based on the exception maintained in the object is not null or not. While setting the exception first time, it will be null always and hence asserts. We should relax the check here. was: In the below code in TransactionContextImpl, {code:java} @Override public TransactionContext setException(Exception ioe) { assert exception != null; this.exception = ioe; return this; } {code} While setting the exception it asserts the exception maintained in the object is not null or not. While setting the exception first time, it will be null always and hence asserts. We should relax the check here. > Remove the assertion while setting the exception in TransactionContextImpl > -- > > Key: RATIS-394 > URL: https://issues.apache.org/jira/browse/RATIS-394 > Project: Ratis > Issue Type: Improvement >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > Attachments: RATIS-394.000.patch > > > In the below code in TransactionContextImpl, > {code:java} > @Override > public TransactionContext setException(Exception ioe) { > assert exception != null; > this.exception = ioe; > return this; > } > {code} > While setting the exception it asserts based on the exception maintained in > the object is not null or not. While setting the exception first time, it > will be null always and hence asserts. We should relax the check here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-386) Raft Client Async API's should honor Retry Policy
[ https://issues.apache.org/jira/browse/RATIS-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-386: -- Attachment: RATIS-386.000.patch > Raft Client Async API's should honor Retry Policy > -- > > Key: RATIS-386 > URL: https://issues.apache.org/jira/browse/RATIS-386 > Project: Ratis > Issue Type: Improvement > Components: client >Affects Versions: 0.3.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > Attachments: RATIS-386.000.patch > > > Raft client sync Api has support for retry policies. Similarly, for Async > API's including watch Api, support for Retry Policy is required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-386) Raft Client Async API's should honor Retry Policy
[ https://issues.apache.org/jira/browse/RATIS-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686481#comment-16686481 ] Shashikant Banerjee commented on RATIS-386: --- Thanks [~szetszwo] for the comments. Moving the retryPolicy Check to here : {code:java} private CompletableFuture sendRequestWithRetryAsync( RaftClientRequest request, intattemptCount) { LOG.debug("{}: send* {}", clientId, request); return clientRpc.sendRequestAsync(request).thenApply(reply -> { LOG.info("{}: receive* {}", clientId, reply); reply = handleNotLeaderException(request, reply); if (reply == null) { if (!retryPolicy.shouldRetry(attemptCount)) { LOG.info(" fail with max attempts failed"); reply = new RaftClientReply(request, new RaftException("Failed " + request + " for " + attemptCount + " attempts with " + retryPolicy), null); } } if (reply != null) { getSlidingWindow(request).receiveReply( request.getSeqNum(), reply, this::sendRequestWithRetryAsync); } return reply; }).exceptionally(e -> { if (LOG.isTraceEnabled()) { LOG.trace(clientId + ": Failed " + request, e); } else { LOG.debug("{}: Failed {} with {}", clientId, request, e); } e = JavaUtils.unwrapCompletionException(e); if (e instanceof GroupMismatchException) { throw new CompletionException(e); } else if (e instanceof IOException) { handleIOException(request, (IOException)e, null); } else { throw new CompletionException(e); } return null; }); }{code} In case, clientRpc.sendRequestAsync(request) timeout, it will execute the code in exceptionally Path. In such case, #sendRequestWithRetryAsync will keep on retrying calling #sendRequestAsync as the retry validation will only be executed if clientRpc.sendRequestAsync(request) completes normally. Also, in case the retryValidation check fails, we just return null for RaftClientReply for the sync API here without throwing any exception: {code:java} private RaftClientReply sendRequestWithRetry( Supplier supplier) throws InterruptedIOException, StateMachineException, GroupMismatchException { for(int attemptCount = 0;; attemptCount++) { final RaftClientRequest request = supplier.get(); final RaftClientReply reply = sendRequest(request); if (reply != null) { return reply; } if (!retryPolicy.shouldRetry(attemptCount)) { return null; } try { retryPolicy.getSleepTime().sleep(); } catch (InterruptedException e) { throw new InterruptedIOException("retry policy=" + retryPolicy); } } } {code} I think ,we probably should have same result for sync/async api's here. Let me know if i am missing something here. > Raft Client Async API's should honor Retry Policy > -- > > Key: RATIS-386 > URL: https://issues.apache.org/jira/browse/RATIS-386 > Project: Ratis > Issue Type: Improvement > Components: client >Affects Versions: 0.3.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > Attachments: RATIS-386.000.patch > > > Raft client sync Api has support for retry policies. Similarly, for Async > API's including watch Api, support for Retry Policy is required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (RATIS-386) Raft Client Async API's should honor Retry Policy
[ https://issues.apache.org/jira/browse/RATIS-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686481#comment-16686481 ] Shashikant Banerjee edited comment on RATIS-386 at 11/15/18 1:07 AM: - Thanks [~szetszwo] for the comments. Moving the retryPolicy Check to here : {code:java} private CompletableFuture sendRequestAsync( RaftClientRequest request, intattemptCount) { LOG.debug("{}: send* {}", clientId, request); return clientRpc.sendRequestAsync(request).thenApply(reply -> { LOG.info("{}: receive* {}", clientId, reply); reply = handleNotLeaderException(request, reply); if (reply == null) { if (!retryPolicy.shouldRetry(attemptCount)) { LOG.info(" fail with max attempts failed"); reply = new RaftClientReply(request, new RaftException("Failed " + request + " for " + attemptCount + " attempts with " + retryPolicy), null); } } if (reply != null) { getSlidingWindow(request).receiveReply( request.getSeqNum(), reply, this::sendRequestWithRetryAsync); } return reply; }).exceptionally(e -> { if (LOG.isTraceEnabled()) { LOG.trace(clientId + ": Failed " + request, e); } else { LOG.debug("{}: Failed {} with {}", clientId, request, e); } e = JavaUtils.unwrapCompletionException(e); if (e instanceof GroupMismatchException) { throw new CompletionException(e); } else if (e instanceof IOException) { handleIOException(request, (IOException)e, null); } else { throw new CompletionException(e); } return null; }); }{code} In case, clientRpc.sendRequestAsync(request) timeout, it will execute the code in exceptionally Path. In such case, #sendRequestWithRetryAsync will keep on retrying calling #sendRequestAsync as the retry validation will only be executed if clientRpc.sendRequestAsync(request) completes normally. Also, in case the retryValidation check fails, we just return null for RaftClientReply for the sync API here without throwing any exception: {code:java} private RaftClientReply sendRequestWithRetry( Supplier supplier) throws InterruptedIOException, StateMachineException, GroupMismatchException { for(int attemptCount = 0;; attemptCount++) { final RaftClientRequest request = supplier.get(); final RaftClientReply reply = sendRequest(request); if (reply != null) { return reply; } if (!retryPolicy.shouldRetry(attemptCount)) { return null; } try { retryPolicy.getSleepTime().sleep(); } catch (InterruptedException e) { throw new InterruptedIOException("retry policy=" + retryPolicy); } } } {code} I think ,we probably should have same result for sync/async api's here. Let me know if i am missing something here. was (Author: shashikant): Thanks [~szetszwo] for the comments. Moving the retryPolicy Check to here : {code:java} private CompletableFuture sendRequestWithRetryAsync( RaftClientRequest request, intattemptCount) { LOG.debug("{}: send* {}", clientId, request); return clientRpc.sendRequestAsync(request).thenApply(reply -> { LOG.info("{}: receive* {}", clientId, reply); reply = handleNotLeaderException(request, reply); if (reply == null) { if (!retryPolicy.shouldRetry(attemptCount)) { LOG.info(" fail with max attempts failed"); reply = new RaftClientReply(request, new RaftException("Failed " + request + " for " + attemptCount + " attempts with " + retryPolicy), null); } } if (reply != null) { getSlidingWindow(request).receiveReply( request.getSeqNum(), reply, this::sendRequestWithRetryAsync); } return reply; }).exceptionally(e -> { if (LOG.isTraceEnabled()) { LOG.trace(clientId + ": Failed " + request, e); } else { LOG.debug("{}: Failed {} with {}", clientId, request, e); } e = JavaUtils.unwrapCompletionException(e); if (e instanceof GroupMismatchException) { throw new CompletionException(e); } else if (e instanceof IOException) { handleIOException(request, (IOException)e, null); } else { throw new CompletionException(e); } return null; }); }{code} In case, clientRpc.sendRequestAsync(request) timeout, it will execute the code in exceptionally Path. In such case, #sendRequestWithRetryAsync will keep on retrying calling #sendRequestAsync as the retry validation will only be executed if clientRpc.sendRequestAsync(request) completes normally. Also, in case the retryValidation check fails, we just return null for RaftClientReply for the sync API here without throwing any exception: {code:java} private RaftClientReply sendRequestWithRetry( Supplier supplier) throws InterruptedIOException, StateMachineException, GroupMismatchException { for(int attemptCount = 0;; attemptCount++) { final Raft
[jira] [Updated] (RATIS-386) Raft Client Async API's should honor Retry Policy
[ https://issues.apache.org/jira/browse/RATIS-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-386: -- Attachment: RATIS-386.001.patch > Raft Client Async API's should honor Retry Policy > -- > > Key: RATIS-386 > URL: https://issues.apache.org/jira/browse/RATIS-386 > Project: Ratis > Issue Type: Improvement > Components: client >Affects Versions: 0.3.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > Attachments: RATIS-386.000.patch, RATIS-386.001.patch > > > Raft client sync Api has support for retry policies. Similarly, for Async > API's including watch Api, support for Retry Policy is required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-386) Raft Client Async API's should honor Retry Policy
[ https://issues.apache.org/jira/browse/RATIS-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-386: -- Attachment: (was: RATIS-386.001.patch) > Raft Client Async API's should honor Retry Policy > -- > > Key: RATIS-386 > URL: https://issues.apache.org/jira/browse/RATIS-386 > Project: Ratis > Issue Type: Improvement > Components: client >Affects Versions: 0.3.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > Attachments: RATIS-386.001.patch > > > Raft client sync Api has support for retry policies. Similarly, for Async > API's including watch Api, support for Retry Policy is required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (RATIS-386) Raft Client Async API's should honor Retry Policy
[ https://issues.apache.org/jira/browse/RATIS-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-386: -- Attachment: RATIS-386.001.patch > Raft Client Async API's should honor Retry Policy > -- > > Key: RATIS-386 > URL: https://issues.apache.org/jira/browse/RATIS-386 > Project: Ratis > Issue Type: Improvement > Components: client >Affects Versions: 0.3.0 >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.3.0 > > Attachments: RATIS-386.001.patch > > > Raft client sync Api has support for retry policies. Similarly, for Async > API's including watch Api, support for Retry Policy is required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)