[jira] [Commented] (RATIS-688) Avoid buffer copies while submitting client requests in Ratis

2019-09-23 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936231#comment-16936231
 ] 

Tsz Wo Nicholas Sze commented on RATIS-688:
---

It seems there is no simple way to avoid data copying.  As stated in 
https://developers.google.com/protocol-buffers/docs/techniques#large-data ,
bq. Protocol Buffers are not designed to handle large messages. As a general 
rule of thumb, if you are dealing in messages larger than a megabyte each, it 
may be time to consider an alternate strategy. 

One possible way is to break the ContainerCommandRequestProto to two parts, 
header and body.  Serialize header and body to individual ByteString(s).  Then, 
serialize the final ByteString using ByteString.concat(..).  It works since, 
bq. In general, the concatenate involves no copying.
according to 
https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString.html#concat-com.google.protobuf.ByteString-

> Avoid buffer copies while submitting client requests in Ratis
> -
>
> Key: RATIS-688
> URL: https://issues.apache.org/jira/browse/RATIS-688
> Project: Ratis
>  Issue Type: Bug
>  Components: client, server
>Reporter: Shashikant Banerjee
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Fix For: 0.4.0
>
>
> Currently, while sending write requests to Ratis from ozone, a protobuf 
> object containing data encoded  and then resultant protobuf is again 
> converted to a byteString which internally does a copy of the buffer embedded 
> inside the protobuf again so that it can be submitted over to Ratis client. 
> Again, while sending the appendRequest as well while building up the 
> appendRequestProto, it might be again copying the data. The idea here is to 
> provide client so pass the raw data(stateMachine data) separately to ratis 
> client without copying overhead. 
>  
> {code:java}
> private CompletableFuture sendRequestAsync(
> ContainerCommandRequestProto request) {
>   try (Scope scope = GlobalTracer.get()
>   .buildSpan("XceiverClientRatis." + request.getCmdType().name())
>   .startActive(true)) {
> ContainerCommandRequestProto finalPayload =
> ContainerCommandRequestProto.newBuilder(request)
> .setTraceID(TracingUtil.exportCurrentSpan())
> .build();
> boolean isReadOnlyRequest = HddsUtils.isReadOnly(finalPayload);
> //  finalPayload already has the byteString data embedded. 
> ByteString byteString = finalPayload.toByteString(); -> It involves a 
> copy again.
> if (LOG.isDebugEnabled()) {
>   LOG.debug("sendCommandAsync {} {}", isReadOnlyRequest,
>   sanitizeForDebug(finalPayload));
> }
> return isReadOnlyRequest ?
> getClient().sendReadOnlyAsync(() -> byteString) :
> getClient().sendAsync(() -> byteString);
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-677) Log entry marked corrupt due to ChecksumException

2019-09-23 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936185#comment-16936185
 ] 

Tsz Wo Nicholas Sze commented on RATIS-677:
---

It seems not useful to track the corrupted entires for the moment.  Let track 
them if we find it useful later on.

> Log entry marked corrupt due to ChecksumException
> -
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch, r677_20190919.patch, 
> r677_20190919b.patch, r677_20190920.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-677) Log entry marked corrupt due to ChecksumException

2019-09-23 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-677:
--
Summary: Log entry marked corrupt due to ChecksumException  (was: Logentry 
marked corrupt due to ChecksumException)

> Log entry marked corrupt due to ChecksumException
> -
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch, r677_20190919.patch, 
> r677_20190919b.patch, r677_20190920.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-20 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934681#comment-16934681
 ] 

Tsz Wo Nicholas Sze commented on RATIS-677:
---

r677_20190920.patch: add one more test and some minor changes.

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch, r677_20190919.patch, 
> r677_20190919b.patch, r677_20190920.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-20 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-677:
--
Attachment: r677_20190920.patch

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch, r677_20190919.patch, 
> r677_20190919b.patch, r677_20190920.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-686) Enforce standard getter and setter names in config keys

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933919#comment-16933919
 ] 

Tsz Wo Nicholas Sze commented on RATIS-686:
---

r686_20190919c.patch: fixes checkstyle warning.

> Enforce standard getter and setter names in config keys
> ---
>
> Key: RATIS-686
> URL: https://issues.apache.org/jira/browse/RATIS-686
> Project: Ratis
>  Issue Type: Improvement
>  Components: conf
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r686_20190919.patch, r686_20190919b.patch, 
> r686_20190919c.patch
>
>
> In the conf keys, some getter/setter methods are missing.  Some of the method 
> names are not using the standard naming convention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-686) Enforce standard getter and setter names in config keys

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-686:
--
Attachment: r686_20190919c.patch

> Enforce standard getter and setter names in config keys
> ---
>
> Key: RATIS-686
> URL: https://issues.apache.org/jira/browse/RATIS-686
> Project: Ratis
>  Issue Type: Improvement
>  Components: conf
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r686_20190919.patch, r686_20190919b.patch, 
> r686_20190919c.patch
>
>
> In the conf keys, some getter/setter methods are missing.  Some of the method 
> names are not using the standard naming convention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-686) Enforce standard getter and setter names in config keys

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-686:
--
Attachment: r686_20190919b.patch

> Enforce standard getter and setter names in config keys
> ---
>
> Key: RATIS-686
> URL: https://issues.apache.org/jira/browse/RATIS-686
> Project: Ratis
>  Issue Type: Improvement
>  Components: conf
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r686_20190919.patch, r686_20190919b.patch
>
>
> In the conf keys, some getter/setter methods are missing.  Some of the method 
> names are not using the standard naming convention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-686) Enforce standard getter and setter names in config keys

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933811#comment-16933811
 ] 

Tsz Wo Nicholas Sze commented on RATIS-686:
---

r686_20190919b.patch: fixes checkstyle warning.

> Enforce standard getter and setter names in config keys
> ---
>
> Key: RATIS-686
> URL: https://issues.apache.org/jira/browse/RATIS-686
> Project: Ratis
>  Issue Type: Improvement
>  Components: conf
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r686_20190919.patch, r686_20190919b.patch
>
>
> In the conf keys, some getter/setter methods are missing.  Some of the method 
> names are not using the standard naming convention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-677:
--
Attachment: r677_20190919b.patch

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch, r677_20190919.patch, 
> r677_20190919b.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933806#comment-16933806
 ] 

Tsz Wo Nicholas Sze commented on RATIS-677:
---

r677_20190919b.patch : adds a test.

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch, r677_20190919.patch, 
> r677_20190919b.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933768#comment-16933768
 ] 

Tsz Wo Nicholas Sze commented on RATIS-677:
---

Discussed with [~jnpandey], we agreed that when a log file was corrupted, it 
should just stop reading the file and return the earlier entries.

r677_20190919.patch: sync'ed with master.  Will add a test.

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch, r677_20190919.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-677:
--
Attachment: r677_20190919.patch

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch, r677_20190919.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-686) Enforce standard getter and setter names in config keys

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-686:
--
Attachment: r686_20190919.patch

> Enforce standard getter and setter names in config keys
> ---
>
> Key: RATIS-686
> URL: https://issues.apache.org/jira/browse/RATIS-686
> Project: Ratis
>  Issue Type: Improvement
>  Components: conf
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r686_20190919.patch
>
>
> In the conf keys, some getter/setter methods are missing.  Some of the method 
> names are not using the standard naming convention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-686) Enforce standard getter and setter names in config keys

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)
Tsz Wo Nicholas Sze created RATIS-686:
-

 Summary: Enforce standard getter and setter names in config keys
 Key: RATIS-686
 URL: https://issues.apache.org/jira/browse/RATIS-686
 Project: Ratis
  Issue Type: Improvement
  Components: conf
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: r686_20190919.patch

In the conf keys, some getter/setter methods are missing.  Some of the method 
names are not using the standard naming convention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-681) Fix non-standard field names in config keys

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933667#comment-16933667
 ] 

Tsz Wo Nicholas Sze commented on RATIS-681:
---

r681_20190919.patch: reverts a whitespace change.

Since the change is minor, will commit the patch without waiting for Jenkins.

> Fix non-standard field names in config keys
> ---
>
> Key: RATIS-681
> URL: https://issues.apache.org/jira/browse/RATIS-681
> Project: Ratis
>  Issue Type: Improvement
>  Components: conf
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r681_20190917.patch, r681_20190919.patch
>
>
> The following fields found in conf were not following the naming convention:
> - GrpcConfigKeys.TLS.TLS_ROOT_PREFIX
> - RaftServerConfigKeys.SLEEP_DEVIATION_THRESHOLD
> - RaftServerConfigKeys.Snapshot.RETENTION_POLICY_KEY



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-681) Fix non-standard field names in config keys

2019-09-19 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-681:
--
Attachment: r681_20190919.patch

> Fix non-standard field names in config keys
> ---
>
> Key: RATIS-681
> URL: https://issues.apache.org/jira/browse/RATIS-681
> Project: Ratis
>  Issue Type: Improvement
>  Components: conf
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r681_20190917.patch, r681_20190919.patch
>
>
> The following fields found in conf were not following the naming convention:
> - GrpcConfigKeys.TLS.TLS_ROOT_PREFIX
> - RaftServerConfigKeys.SLEEP_DEVIATION_THRESHOLD
> - RaftServerConfigKeys.Snapshot.RETENTION_POLICY_KEY



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-18 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932731#comment-16932731
 ] 

Tsz Wo Nicholas Sze commented on RATIS-677:
---

> ... We should make sure that the corresponding entries and log segment is 
> considered as corrupted in later operations.

When the corrupted log entries are not loaded to memory, they will be 
considered as non-existent.

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-681) Fix non-standard field names in config keys

2019-09-17 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-681:
--
Attachment: r681_20190917.patch

> Fix non-standard field names in config keys
> ---
>
> Key: RATIS-681
> URL: https://issues.apache.org/jira/browse/RATIS-681
> Project: Ratis
>  Issue Type: Improvement
>  Components: conf
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r681_20190917.patch
>
>
> The following fields found in conf were not following the naming convention:
> - GrpcConfigKeys.TLS.TLS_ROOT_PREFIX
> - RaftServerConfigKeys.SLEEP_DEVIATION_THRESHOLD
> - RaftServerConfigKeys.Snapshot.RETENTION_POLICY_KEY



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (RATIS-681) Fix non-standard field names in config keys

2019-09-17 Thread Tsz Wo Nicholas Sze (Jira)
Tsz Wo Nicholas Sze created RATIS-681:
-

 Summary: Fix non-standard field names in config keys
 Key: RATIS-681
 URL: https://issues.apache.org/jira/browse/RATIS-681
 Project: Ratis
  Issue Type: Improvement
  Components: conf
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze


The following fields found in conf were not following the naming convention:
- GrpcConfigKeys.TLS.TLS_ROOT_PREFIX
- RaftServerConfigKeys.SLEEP_DEVIATION_THRESHOLD
- RaftServerConfigKeys.Snapshot.RETENTION_POLICY_KEY




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-17 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931721#comment-16931721
 ] 

Tsz Wo Nicholas Sze commented on RATIS-677:
---

> ... If we ignore the exception while reading a segment file wouldn't that 
> make the log segments inconsistent? ...

[~ljain], the idea is to stop reading the log once an exception has been thrown 
instead of stopping the server.  The log segment in file is corrupted.  We 
won't be able to correct it.  The log segment
in memory will be truncated up to the exception.

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-655) Change LeaderState and FollowerState and to use RaftGroupMemberId

2019-09-16 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930844#comment-16930844
 ] 

Tsz Wo Nicholas Sze commented on RATIS-655:
---

> It is possible but it is going to be another big patch since getId() is used 
> a lot in the tests.

I mean we should do it separately, if we are going to do so.

[~msingh], are you +1 on the current patch?

> Change LeaderState and FollowerState and  to use RaftGroupMemberId
> --
>
> Key: RATIS-655
> URL: https://issues.apache.org/jira/browse/RATIS-655
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r655_20190807.patch, r655_20190807b.patch, 
> r655_20190830.patch
>
>
> This is the last JIRA split from the huge patch in RATIS-605.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-13 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929563#comment-16929563
 ] 

Tsz Wo Nicholas Sze commented on RATIS-677:
---

r677_20190913.patch: adds a conf so that it could change the way to handle log 
corruption.

Will add some tests once we have agreed on the approach.

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-13 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-677:
--
Attachment: r677_20190913.patch

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
> Attachments: r677_20190913.patch
>
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-13 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-677:
--
Component/s: server

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-677) Logentry marked corrupt due to ChecksumException

2019-09-13 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929407#comment-16929407
 ] 

Tsz Wo Nicholas Sze commented on RATIS-677:
---

What is the expectation?  Should the server ignore the corrupted log and 
continue startup?

> Logentry marked corrupt due to ChecksumException
> 
>
> Key: RATIS-677
> URL: https://issues.apache.org/jira/browse/RATIS-677
> Project: Ratis
>  Issue Type: Bug
>Reporter: Sammi Chen
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
>
> Steps:
> 1.  Run Teragen and generated a few GB data in a 4 datanodes cluster.  
> 2.  Stoped the datanodes through ./stop-ozone.sh.
> 3.  Changed the ozone binaries
> 4.  Start the cluster through ./start-ozone.sh.
> 5.  Two datanode regisisterd to SCM. Two datanode fail to appear at SCM side. 
>  
> Checked these two failed node, datanode process is still running. In the 
> logfile, I found a lot of following errors. 
> 2019-09-12 21:06:45,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Attempting to start container services.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Background container scanner has been disabled.
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] INFO   - 
> Starting XceiverServerRatis ba17ad5e-714e-4d82-85d8-ff2e0737fcf9 at port 9858
> 2019-09-12 21:06:47,255 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at 10.120.110.183:9861 for past 2100 
> seconds.
> org.apache.ratis.protocol.ChecksumException: LogEntry is corrupt. Calculated 
> checksum is -134141393 but read checksum 0
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:299)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:185)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:121)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:94)
> at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:117)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:310)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:234)
> at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:204)
> at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
> at 
> org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:190)
> at 
> org.apache.ratis.server.impl.ServerState.(ServerState.java:120)
> at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:110)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-543) Ratis GRPC client produces excessive logging while writing data.

2019-09-11 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927990#comment-16927990
 ] 

Tsz Wo Nicholas Sze commented on RATIS-543:
---

[~ljain], thanks for reviewing and committing the patch.

> Ratis GRPC client produces excessive logging while writing data.
> 
>
> Key: RATIS-543
> URL: https://issues.apache.org/jira/browse/RATIS-543
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Aravindan Vijayan
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
>  Labels: ozone
> Fix For: 0.4.0
>
> Attachments: r543_20190827.patch
>
>
> {code}
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1352, SUCCESS, logIndex=15195,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1355, SUCCESS, logIndex=15196,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1357, SUCCESS, logIndex=15197,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6: receive 
> RaftClientReply:client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6@group-AE803AF42C5D,
>  cid=1370, SUCCESS, logIndex=0, com
> mits[5a076d87-abf9-4ade-ae37-adab741d99a6:c16423, 
> 6e21905d-9796-4248-834e-ed97ea6763ef:c16422, 
> 34e8d6e5-456f-4e2a-99a5-4f21fd9c4a7e:c16423]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836: receive 
> RaftClientReply:client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836@group-4E41299EA191,
>  cid=1376, SUCCESS, logIndex=0, com
> mits[a5729949-67f1-496e-a0d3-1bfc0e139836:c4764, 
> 111d4c23-756f-4c8a-a48d-aa2a327a5179:c4764, 
> 287eccfb-8461-419a-8732-529d042380b3:c4764]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb: receive 
> RaftClientReply:client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb@group-D1BB7F32F754,
>  cid=1382, FAILED org.apache.ratis.
> protocol.NotLeaderException: Server 0bb45975-b0d2-499e-85cc-22ea22c57ecb is 
> not the leader (f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:10.17.200.18:9858). 
> Request must be sent to leader., logIndex=0, 
> commits[0bb45975-b0d2-499e-85cc-22ea22c57ecb:c15358, 6c7a
> 780f-5474-49da-b880-3eaf69d9d83d:c15358, 
> f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:c15358]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1359, SUCCESS, logIndex=15208, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1362, SUCCESS, logIndex=15209, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1363, SUCCESS, logIndex=15210, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:32 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> 

[jira] [Updated] (RATIS-543) Ratis GRPC client produces excessive logging while writing data.

2019-09-09 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-543:
--
Attachment: r543_20190827.patch

> Ratis GRPC client produces excessive logging while writing data.
> 
>
> Key: RATIS-543
> URL: https://issues.apache.org/jira/browse/RATIS-543
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Aravindan Vijayan
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
>  Labels: ozone
> Attachments: r543_20190827.patch
>
>
> {code}
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1352, SUCCESS, logIndex=15195,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1355, SUCCESS, logIndex=15196,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1357, SUCCESS, logIndex=15197,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6: receive 
> RaftClientReply:client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6@group-AE803AF42C5D,
>  cid=1370, SUCCESS, logIndex=0, com
> mits[5a076d87-abf9-4ade-ae37-adab741d99a6:c16423, 
> 6e21905d-9796-4248-834e-ed97ea6763ef:c16422, 
> 34e8d6e5-456f-4e2a-99a5-4f21fd9c4a7e:c16423]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836: receive 
> RaftClientReply:client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836@group-4E41299EA191,
>  cid=1376, SUCCESS, logIndex=0, com
> mits[a5729949-67f1-496e-a0d3-1bfc0e139836:c4764, 
> 111d4c23-756f-4c8a-a48d-aa2a327a5179:c4764, 
> 287eccfb-8461-419a-8732-529d042380b3:c4764]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb: receive 
> RaftClientReply:client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb@group-D1BB7F32F754,
>  cid=1382, FAILED org.apache.ratis.
> protocol.NotLeaderException: Server 0bb45975-b0d2-499e-85cc-22ea22c57ecb is 
> not the leader (f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:10.17.200.18:9858). 
> Request must be sent to leader., logIndex=0, 
> commits[0bb45975-b0d2-499e-85cc-22ea22c57ecb:c15358, 6c7a
> 780f-5474-49da-b880-3eaf69d9d83d:c15358, 
> f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:c15358]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1359, SUCCESS, logIndex=15208, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1362, SUCCESS, logIndex=15209, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1363, SUCCESS, logIndex=15210, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:32 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1371, SUCCESS, logIndex=15211, 
> 

[jira] [Commented] (RATIS-543) Ratis GRPC client produces excessive logging while writing data.

2019-09-09 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926131#comment-16926131
 ] 

Tsz Wo Nicholas Sze commented on RATIS-543:
---

[~ljain], you are right.  The patch is not for here.  Deleted.

> Ratis GRPC client produces excessive logging while writing data.
> 
>
> Key: RATIS-543
> URL: https://issues.apache.org/jira/browse/RATIS-543
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Aravindan Vijayan
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
>  Labels: ozone
> Attachments: r543_20190827.patch
>
>
> {code}
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1352, SUCCESS, logIndex=15195,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1355, SUCCESS, logIndex=15196,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1357, SUCCESS, logIndex=15197,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6: receive 
> RaftClientReply:client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6@group-AE803AF42C5D,
>  cid=1370, SUCCESS, logIndex=0, com
> mits[5a076d87-abf9-4ade-ae37-adab741d99a6:c16423, 
> 6e21905d-9796-4248-834e-ed97ea6763ef:c16422, 
> 34e8d6e5-456f-4e2a-99a5-4f21fd9c4a7e:c16423]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836: receive 
> RaftClientReply:client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836@group-4E41299EA191,
>  cid=1376, SUCCESS, logIndex=0, com
> mits[a5729949-67f1-496e-a0d3-1bfc0e139836:c4764, 
> 111d4c23-756f-4c8a-a48d-aa2a327a5179:c4764, 
> 287eccfb-8461-419a-8732-529d042380b3:c4764]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb: receive 
> RaftClientReply:client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb@group-D1BB7F32F754,
>  cid=1382, FAILED org.apache.ratis.
> protocol.NotLeaderException: Server 0bb45975-b0d2-499e-85cc-22ea22c57ecb is 
> not the leader (f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:10.17.200.18:9858). 
> Request must be sent to leader., logIndex=0, 
> commits[0bb45975-b0d2-499e-85cc-22ea22c57ecb:c15358, 6c7a
> 780f-5474-49da-b880-3eaf69d9d83d:c15358, 
> f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:c15358]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1359, SUCCESS, logIndex=15208, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1362, SUCCESS, logIndex=15209, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1363, SUCCESS, logIndex=15210, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:32 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1371, 

[jira] [Updated] (RATIS-543) Ratis GRPC client produces excessive logging while writing data.

2019-09-09 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-543:
--
Attachment: (was: r485_20190827.patch)

> Ratis GRPC client produces excessive logging while writing data.
> 
>
> Key: RATIS-543
> URL: https://issues.apache.org/jira/browse/RATIS-543
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Aravindan Vijayan
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
>  Labels: ozone
>
> {code}
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1352, SUCCESS, logIndex=15195,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1355, SUCCESS, logIndex=15196,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1357, SUCCESS, logIndex=15197,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6: receive 
> RaftClientReply:client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6@group-AE803AF42C5D,
>  cid=1370, SUCCESS, logIndex=0, com
> mits[5a076d87-abf9-4ade-ae37-adab741d99a6:c16423, 
> 6e21905d-9796-4248-834e-ed97ea6763ef:c16422, 
> 34e8d6e5-456f-4e2a-99a5-4f21fd9c4a7e:c16423]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836: receive 
> RaftClientReply:client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836@group-4E41299EA191,
>  cid=1376, SUCCESS, logIndex=0, com
> mits[a5729949-67f1-496e-a0d3-1bfc0e139836:c4764, 
> 111d4c23-756f-4c8a-a48d-aa2a327a5179:c4764, 
> 287eccfb-8461-419a-8732-529d042380b3:c4764]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb: receive 
> RaftClientReply:client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb@group-D1BB7F32F754,
>  cid=1382, FAILED org.apache.ratis.
> protocol.NotLeaderException: Server 0bb45975-b0d2-499e-85cc-22ea22c57ecb is 
> not the leader (f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:10.17.200.18:9858). 
> Request must be sent to leader., logIndex=0, 
> commits[0bb45975-b0d2-499e-85cc-22ea22c57ecb:c15358, 6c7a
> 780f-5474-49da-b880-3eaf69d9d83d:c15358, 
> f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:c15358]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1359, SUCCESS, logIndex=15208, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1362, SUCCESS, logIndex=15209, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1363, SUCCESS, logIndex=15210, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:32 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1371, SUCCESS, logIndex=15211, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15211, 
> 

[jira] [Commented] (RATIS-619) Avoid loading cache with pre-snapshot entries for the group

2019-09-06 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924473#comment-16924473
 ] 

Tsz Wo Nicholas Sze commented on RATIS-619:
---

> Yes, how about we divide this into 2 tasks, in the first one, we just avoid 
> loading the segment data ...

I guess you mean reading the segment files but not caching them.  For this 
approach, I doubt if it is a real improvement since caching is cheap but 
reading the file is expensive.  If we already have paid the expensive cost to 
read the files, why not keeping them in cache as well?  Otherwise, when the 
leader needs to send the log to a follower, we need to pay the expensive cost 
to read the files again.  The log entries seems to be not cached again, then we 
pay the expensive cost each time when there is a follower needing the entries.

> Avoid loading cache with pre-snapshot entries for the group
> ---
>
> Key: RATIS-619
> URL: https://issues.apache.org/jira/browse/RATIS-619
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Siddharth Wagle
>Priority: Major
>  Labels: ozone
> Attachments: RATIS-619.01.patch, RATIS-619.02.patch
>
>
> Even after taking a snapshot, the raft log loads all the segment in the log
> {code}
> 2019-07-01 23:22:47,481 [pool-18-thread-1] INFO   - Setting the last 
> applied index to (t:2, i:15237039)
> {code}
> {code}
> 2019-07-01 23:22:47,516 INFO org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.log.statemachine.data.caching.enabled = true (custom)
> 2019-07-01 23:22:47,531 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 62941ca3-f244-4298-8497-f4c0bd57430a:group-4D230AB58084 set configuration 0: 
> [1f3d7936-cb4e-4b68-86ed-578070472dea:1
> 0.17.213.36:9858, 62941ca3-f244-4298-8497-f4c0bd57430a:10.17.213.35:9858, 
> f07c1f87-b377-40d9-8c56-4f1440c4fa77:10.17.213.37:9858], old=null at 0
> 2019-07-01 23:22:47,578 INFO org.apache.hadoop.http.HttpServer2: Jetty bound 
> to port 9882
> 2019-07-01 23:22:47,579 INFO org.eclipse.jetty.server.Server: 
> jetty-9.3.24.v20180605, build timestamp: 2018-06-05T10:11:56-07:00, git hash: 
> 84205aa28f11a4f31f2a3b86d1bba2cc8ab69827
> 2019-07-01 23:22:47,601 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7461 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_0-7460
> 2019-07-01 23:22:47,608 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.s.ServletContextHandler@6ce90bc5{/logs,file:///var/log/ozone/,AVAILABLE}
> 2019-07-01 23:22:47,608 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.s.ServletContextHandler@4b1c0397{/static,jar:file:/var/lib/hadoop-ozone/ozone-0.5.0-SNAPSHOT/share
> /ozone/lib/hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar!/webapps/static,AVAILABLE}
> 2019-07-01 23:22:47,635 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7386 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_7461-14846
> 2019-07-01 23:22:47,663 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7440 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_14847-22286
> 2019-07-01 23:22:47,664 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.w.WebAppContext@8a62297{/,file:///tmp/jetty-0.0.0.0-9882-hddsDatanode-_-any-7539213566265642568.di
> r/webapp/,AVAILABLE}{/hddsDatanode}
> 2019-07-01 23:22:47,681 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7353 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_22287-29639
> 2019-07-01 23:22:47,695 INFO org.eclipse.jetty.server.AbstractConnector: 
> Started ServerConnector@5116ac09{HTTP/1.1,[http/1.1]}{0.0.0.0:9882}
> 2019-07-01 23:22:47,695 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7291 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_29640-36930
> 2019-07-01 23:22:47,695 INFO org.eclipse.jetty.server.Server: Started @56648ms
> 2019-07-01 23:22:47,695 INFO org.apache.hadoop.hdds.server.BaseHttpServer: 
> HTTP server of HDDSDATANODE is listening at http://0.0.0.0:9882
> 2019-07-01 23:22:47,709 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7049 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_36931-43979
> 2019-07-01 23:22:47,732 INFO 
> 

[jira] [Commented] (RATIS-619) Avoid loading cache with pre-snapshot entries for the group

2019-09-05 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923667#comment-16923667
 ] 

Tsz Wo Nicholas Sze commented on RATIS-619:
---

> If here if the leader restart, then the follower will not be able to read the 
> logs from the leader because the log index 99 and after are not loaded. ...

We may load the log at the time when this happens.  This is going to be a big 
improvement when there are a lot of log entires stored on the disk.

> Avoid loading cache with pre-snapshot entries for the group
> ---
>
> Key: RATIS-619
> URL: https://issues.apache.org/jira/browse/RATIS-619
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Siddharth Wagle
>Priority: Major
>  Labels: ozone
> Attachments: RATIS-619.01.patch, RATIS-619.02.patch
>
>
> Even after taking a snapshot, the raft log loads all the segment in the log
> {code}
> 2019-07-01 23:22:47,481 [pool-18-thread-1] INFO   - Setting the last 
> applied index to (t:2, i:15237039)
> {code}
> {code}
> 2019-07-01 23:22:47,516 INFO org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.log.statemachine.data.caching.enabled = true (custom)
> 2019-07-01 23:22:47,531 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 62941ca3-f244-4298-8497-f4c0bd57430a:group-4D230AB58084 set configuration 0: 
> [1f3d7936-cb4e-4b68-86ed-578070472dea:1
> 0.17.213.36:9858, 62941ca3-f244-4298-8497-f4c0bd57430a:10.17.213.35:9858, 
> f07c1f87-b377-40d9-8c56-4f1440c4fa77:10.17.213.37:9858], old=null at 0
> 2019-07-01 23:22:47,578 INFO org.apache.hadoop.http.HttpServer2: Jetty bound 
> to port 9882
> 2019-07-01 23:22:47,579 INFO org.eclipse.jetty.server.Server: 
> jetty-9.3.24.v20180605, build timestamp: 2018-06-05T10:11:56-07:00, git hash: 
> 84205aa28f11a4f31f2a3b86d1bba2cc8ab69827
> 2019-07-01 23:22:47,601 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7461 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_0-7460
> 2019-07-01 23:22:47,608 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.s.ServletContextHandler@6ce90bc5{/logs,file:///var/log/ozone/,AVAILABLE}
> 2019-07-01 23:22:47,608 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.s.ServletContextHandler@4b1c0397{/static,jar:file:/var/lib/hadoop-ozone/ozone-0.5.0-SNAPSHOT/share
> /ozone/lib/hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar!/webapps/static,AVAILABLE}
> 2019-07-01 23:22:47,635 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7386 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_7461-14846
> 2019-07-01 23:22:47,663 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7440 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_14847-22286
> 2019-07-01 23:22:47,664 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.w.WebAppContext@8a62297{/,file:///tmp/jetty-0.0.0.0-9882-hddsDatanode-_-any-7539213566265642568.di
> r/webapp/,AVAILABLE}{/hddsDatanode}
> 2019-07-01 23:22:47,681 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7353 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_22287-29639
> 2019-07-01 23:22:47,695 INFO org.eclipse.jetty.server.AbstractConnector: 
> Started ServerConnector@5116ac09{HTTP/1.1,[http/1.1]}{0.0.0.0:9882}
> 2019-07-01 23:22:47,695 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7291 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_29640-36930
> 2019-07-01 23:22:47,695 INFO org.eclipse.jetty.server.Server: Started @56648ms
> 2019-07-01 23:22:47,695 INFO org.apache.hadoop.hdds.server.BaseHttpServer: 
> HTTP server of HDDSDATANODE is listening at http://0.0.0.0:9882
> 2019-07-01 23:22:47,709 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7049 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_36931-43979
> 2019-07-01 23:22:47,732 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7141 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_43980-51120
> 2019-07-01 23:22:47,747 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7321 
> entries from segment file 
> 

[jira] [Commented] (RATIS-655) Change LeaderState and FollowerState and to use RaftGroupMemberId

2019-09-04 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922789#comment-16922789
 ] 

Tsz Wo Nicholas Sze commented on RATIS-655:
---

It is possible but it is going to be another big patch since getId() is used a 
lot in the tests.

> Change LeaderState and FollowerState and  to use RaftGroupMemberId
> --
>
> Key: RATIS-655
> URL: https://issues.apache.org/jira/browse/RATIS-655
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r655_20190807.patch, r655_20190807b.patch, 
> r655_20190830.patch
>
>
> This is the last JIRA split from the huge patch in RATIS-605.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-619) Avoid loading cache with pre-snapshot entries for the group

2019-09-03 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921683#comment-16921683
 ] 

Tsz Wo Nicholas Sze commented on RATIS-619:
---

Why moving away from the original idea of skipping loading the entries before 
the snapshot?  It sounds even better.

Also, why this is a blocker?  

> Avoid loading cache with pre-snapshot entries for the group
> ---
>
> Key: RATIS-619
> URL: https://issues.apache.org/jira/browse/RATIS-619
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Siddharth Wagle
>Priority: Blocker
>  Labels: ozone
> Attachments: RATIS-619.01.patch, RATIS-619.02.patch
>
>
> Even after taking a snapshot, the raft log loads all the segment in the log
> {code}
> 2019-07-01 23:22:47,481 [pool-18-thread-1] INFO   - Setting the last 
> applied index to (t:2, i:15237039)
> {code}
> {code}
> 2019-07-01 23:22:47,516 INFO org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.log.statemachine.data.caching.enabled = true (custom)
> 2019-07-01 23:22:47,531 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 62941ca3-f244-4298-8497-f4c0bd57430a:group-4D230AB58084 set configuration 0: 
> [1f3d7936-cb4e-4b68-86ed-578070472dea:1
> 0.17.213.36:9858, 62941ca3-f244-4298-8497-f4c0bd57430a:10.17.213.35:9858, 
> f07c1f87-b377-40d9-8c56-4f1440c4fa77:10.17.213.37:9858], old=null at 0
> 2019-07-01 23:22:47,578 INFO org.apache.hadoop.http.HttpServer2: Jetty bound 
> to port 9882
> 2019-07-01 23:22:47,579 INFO org.eclipse.jetty.server.Server: 
> jetty-9.3.24.v20180605, build timestamp: 2018-06-05T10:11:56-07:00, git hash: 
> 84205aa28f11a4f31f2a3b86d1bba2cc8ab69827
> 2019-07-01 23:22:47,601 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7461 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_0-7460
> 2019-07-01 23:22:47,608 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.s.ServletContextHandler@6ce90bc5{/logs,file:///var/log/ozone/,AVAILABLE}
> 2019-07-01 23:22:47,608 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.s.ServletContextHandler@4b1c0397{/static,jar:file:/var/lib/hadoop-ozone/ozone-0.5.0-SNAPSHOT/share
> /ozone/lib/hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar!/webapps/static,AVAILABLE}
> 2019-07-01 23:22:47,635 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7386 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_7461-14846
> 2019-07-01 23:22:47,663 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7440 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_14847-22286
> 2019-07-01 23:22:47,664 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.w.WebAppContext@8a62297{/,file:///tmp/jetty-0.0.0.0-9882-hddsDatanode-_-any-7539213566265642568.di
> r/webapp/,AVAILABLE}{/hddsDatanode}
> 2019-07-01 23:22:47,681 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7353 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_22287-29639
> 2019-07-01 23:22:47,695 INFO org.eclipse.jetty.server.AbstractConnector: 
> Started ServerConnector@5116ac09{HTTP/1.1,[http/1.1]}{0.0.0.0:9882}
> 2019-07-01 23:22:47,695 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7291 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_29640-36930
> 2019-07-01 23:22:47,695 INFO org.eclipse.jetty.server.Server: Started @56648ms
> 2019-07-01 23:22:47,695 INFO org.apache.hadoop.hdds.server.BaseHttpServer: 
> HTTP server of HDDSDATANODE is listening at http://0.0.0.0:9882
> 2019-07-01 23:22:47,709 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7049 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_36931-43979
> 2019-07-01 23:22:47,732 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7141 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_43980-51120
> 2019-07-01 23:22:47,747 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7321 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230ab58084/current/log_51121-58441
> 2019-07-01 23:22:47,768 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 

[jira] [Commented] (RATIS-661) Add call in state machine to handle group removal

2019-09-03 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921628#comment-16921628
 ] 

Tsz Wo Nicholas Sze commented on RATIS-661:
---

> I mean an api which can tell that the group is in Closing or Closed state. ...

We should add such information to getGroupInfos.  We may as well change it to 
return a reply instead of GroupMismatchException when the group does not exist.

> Add call in state machine to handle group removal
> -
>
> Key: RATIS-661
> URL: https://issues.apache.org/jira/browse/RATIS-661
> Project: Ratis
>  Issue Type: New Feature
>  Components: API
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 0.4.0
>
> Attachments: RATIS-661.001.patch, RATIS-661.002.patch, 
> RATIS-661.003.patch, RATIS-661.004.patch, RATIS-661.005.patch
>
>
> Currently during RaftServerProxy#groupRemoveAsync there is no way for 
> stateMachine to know that the RaftGroup will be removed. This Jira aims to 
> add a call in the stateMachine to handle group removal.
> It also changes the logic of groupRemoval api to remove the RaftServerImpl 
> from the RaftServerProxy#impls map after the shutdown is complete. This is 
> required to synchronize the removal with the corresponding api of 
> RaftServer#getGroupIds. RaftServer#getGroupIds uses the RaftServerProxy#impls 
> map to get the groupIds.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-655) Change LeaderState and FollowerState and to use RaftGroupMemberId

2019-08-30 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919911#comment-16919911
 ] 

Tsz Wo Nicholas Sze commented on RATIS-655:
---

r655_20190830.patch: updates with trunk.

> Change LeaderState and FollowerState and  to use RaftGroupMemberId
> --
>
> Key: RATIS-655
> URL: https://issues.apache.org/jira/browse/RATIS-655
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r655_20190807.patch, r655_20190807b.patch, 
> r655_20190830.patch
>
>
> This is the last JIRA split from the huge patch in RATIS-605.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (RATIS-655) Change LeaderState and FollowerState and to use RaftGroupMemberId

2019-08-30 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-655:
--
Attachment: r655_20190830.patch

> Change LeaderState and FollowerState and  to use RaftGroupMemberId
> --
>
> Key: RATIS-655
> URL: https://issues.apache.org/jira/browse/RATIS-655
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r655_20190807.patch, r655_20190807b.patch, 
> r655_20190830.patch
>
>
> This is the last JIRA split from the huge patch in RATIS-605.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-485) TimeoutScheduler is leaked by gRPC client implementation

2019-08-30 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919897#comment-16919897
 ] 

Tsz Wo Nicholas Sze commented on RATIS-485:
---

Thanks [~elserj]!

I have committed this.

> TimeoutScheduler is leaked by gRPC client implementation
> 
>
> Key: RATIS-485
> URL: https://issues.apache.org/jira/browse/RATIS-485
> Project: Ratis
>  Issue Type: Bug
>  Components: examples
>Reporter: Clay B.
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Fix For: 0.5.0
>
> Attachments: RATIS-485.003.patch, RATIS-485.004.patch, loadgen.log, 
> r485_20190827.patch, r485_20190828.patch
>
>
> Running the load generator without a Ratis cluster (e.g. spurious node IPs) 
> results in an OOM.
> If one has a single Ratis server it tries seemingly indefinitely:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1{code}
> If one has two Ratis servers it OOMs:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1,n1:127.0.0.1:2
> [...]
> 1/787867107@5e5792a0 with java.util.concurrent.CompletionException: 
> java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:417 - client-272A2E13A5DD: suggested new 
> leader: null. Failed 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:437 - client-272A2E13A5DD: change Leader 
> from n1 to n0
> 2019-02-14 07:47:22 DEBUG RaftClient:291 - schedule attempt #10740 with 
> policy RetryForeverNoSleep for 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:323 - client-272A2E13A5DD: send* 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:338 - client-272A2E13A5DD: Failed 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: 
> unable to create new native thread
> Exception in thread "main" java.util.concurrent.CompletionException: 
> java.lang.OutOfMemoryError: unable to create new native thread
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$sendRequestAsync$14(RaftClientImpl.java:349)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
>     at 
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestAsync(RaftClientImpl.java:334)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetryAsync(RaftClientImpl.java:286)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:243)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.retry(SlidingWindow.java:259)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$null$10(RaftClientImpl.java:293)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:85)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:104)
>     at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:50)
>     at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:91)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> 

[jira] [Updated] (RATIS-659) StateMachineUpdater#stopAndJoin might not take snapshot due to race condition

2019-08-30 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-659:
--
Component/s: server

> StateMachineUpdater#stopAndJoin might not take snapshot due to race condition
> -
>
> Key: RATIS-659
> URL: https://issues.apache.org/jira/browse/RATIS-659
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-659.001.patch
>
>
> StateMachineUpdater might not take snapshot during close. This might happen 
> if the StateMachineUpdater#stopAndJoin is called right after the snapshot 
> check in StateMachineUpdater:156-162.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (RATIS-661) Add call in state machine to handle group removal

2019-08-30 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-661:
--
Component/s: API
 Issue Type: New Feature  (was: Bug)

> Add call in state machine to handle group removal
> -
>
> Key: RATIS-661
> URL: https://issues.apache.org/jira/browse/RATIS-661
> Project: Ratis
>  Issue Type: New Feature
>  Components: API
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-661.001.patch, RATIS-661.002.patch, 
> RATIS-661.003.patch, RATIS-661.004.patch, RATIS-661.005.patch
>
>
> Currently during RaftServerProxy#groupRemoveAsync there is no way for 
> stateMachine to know that the RaftGroup will be removed. This Jira aims to 
> add a call in the stateMachine to handle group removal.
> It also changes the logic of groupRemoval api to remove the RaftServerImpl 
> from the RaftServerProxy#impls map after the shutdown is complete. This is 
> required to synchronize the removal with the corresponding api of 
> RaftServer#getGroupIds. RaftServer#getGroupIds uses the RaftServerProxy#impls 
> map to get the groupIds.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-661) Add call in state machine to handle group removal

2019-08-30 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919851#comment-16919851
 ] 

Tsz Wo Nicholas Sze commented on RATIS-661:
---

+1 the 005 patch looks good.

> Add call in state machine to handle group removal
> -
>
> Key: RATIS-661
> URL: https://issues.apache.org/jira/browse/RATIS-661
> Project: Ratis
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-661.001.patch, RATIS-661.002.patch, 
> RATIS-661.003.patch, RATIS-661.004.patch, RATIS-661.005.patch
>
>
> Currently during RaftServerProxy#groupRemoveAsync there is no way for 
> stateMachine to know that the RaftGroup will be removed. This Jira aims to 
> add a call in the stateMachine to handle group removal.
> It also changes the logic of groupRemoval api to remove the RaftServerImpl 
> from the RaftServerProxy#impls map after the shutdown is complete. This is 
> required to synchronize the removal with the corresponding api of 
> RaftServer#getGroupIds. RaftServer#getGroupIds uses the RaftServerProxy#impls 
> map to get the groupIds.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-661) Add call in state machine to handle group removal

2019-08-30 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919850#comment-16919850
 ] 

Tsz Wo Nicholas Sze commented on RATIS-661:
---

> Should we support an api called groupExists in RaftServer? ...

getGroupInfos covers groupExists, although it will fail with 
GroupMismatchException when a group does not exist.

> Add call in state machine to handle group removal
> -
>
> Key: RATIS-661
> URL: https://issues.apache.org/jira/browse/RATIS-661
> Project: Ratis
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-661.001.patch, RATIS-661.002.patch, 
> RATIS-661.003.patch, RATIS-661.004.patch, RATIS-661.005.patch
>
>
> Currently during RaftServerProxy#groupRemoveAsync there is no way for 
> stateMachine to know that the RaftGroup will be removed. This Jira aims to 
> add a call in the stateMachine to handle group removal.
> It also changes the logic of groupRemoval api to remove the RaftServerImpl 
> from the RaftServerProxy#impls map after the shutdown is complete. This is 
> required to synchronize the removal with the corresponding api of 
> RaftServer#getGroupIds. RaftServer#getGroupIds uses the RaftServerProxy#impls 
> map to get the groupIds.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-659) StateMachineUpdater#stopAndJoin might not take snapshot due to race condition

2019-08-30 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919833#comment-16919833
 ] 

Tsz Wo Nicholas Sze commented on RATIS-659:
---

Good catch on the bug!

+1 patch looks good. 

> StateMachineUpdater#stopAndJoin might not take snapshot due to race condition
> -
>
> Key: RATIS-659
> URL: https://issues.apache.org/jira/browse/RATIS-659
> Project: Ratis
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-659.001.patch
>
>
> StateMachineUpdater might not take snapshot during close. This might happen 
> if the StateMachineUpdater#stopAndJoin is called right after the snapshot 
> check in StateMachineUpdater:156-162.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-485) TimeoutScheduler is leaked by gRPC client implementation

2019-08-30 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919832#comment-16919832
 ] 

Tsz Wo Nicholas Sze commented on RATIS-485:
---

+1 the changes look good.  Just some minor changes on the log message:

RATIS-485.004.patch

 

> TimeoutScheduler is leaked by gRPC client implementation
> 
>
> Key: RATIS-485
> URL: https://issues.apache.org/jira/browse/RATIS-485
> Project: Ratis
>  Issue Type: Bug
>  Components: examples
>Reporter: Clay B.
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Fix For: 0.5.0
>
> Attachments: RATIS-485.003.patch, RATIS-485.004.patch, loadgen.log, 
> r485_20190827.patch, r485_20190828.patch
>
>
> Running the load generator without a Ratis cluster (e.g. spurious node IPs) 
> results in an OOM.
> If one has a single Ratis server it tries seemingly indefinitely:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1{code}
> If one has two Ratis servers it OOMs:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1,n1:127.0.0.1:2
> [...]
> 1/787867107@5e5792a0 with java.util.concurrent.CompletionException: 
> java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:417 - client-272A2E13A5DD: suggested new 
> leader: null. Failed 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:437 - client-272A2E13A5DD: change Leader 
> from n1 to n0
> 2019-02-14 07:47:22 DEBUG RaftClient:291 - schedule attempt #10740 with 
> policy RetryForeverNoSleep for 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:323 - client-272A2E13A5DD: send* 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:338 - client-272A2E13A5DD: Failed 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: 
> unable to create new native thread
> Exception in thread "main" java.util.concurrent.CompletionException: 
> java.lang.OutOfMemoryError: unable to create new native thread
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$sendRequestAsync$14(RaftClientImpl.java:349)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
>     at 
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestAsync(RaftClientImpl.java:334)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetryAsync(RaftClientImpl.java:286)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:243)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.retry(SlidingWindow.java:259)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$null$10(RaftClientImpl.java:293)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:85)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:104)
>     at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:50)
>     at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:91)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> 

[jira] [Updated] (RATIS-485) TimeoutScheduler is leaked by gRPC client implementation

2019-08-30 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-485:
--
Attachment: RATIS-485.004.patch

> TimeoutScheduler is leaked by gRPC client implementation
> 
>
> Key: RATIS-485
> URL: https://issues.apache.org/jira/browse/RATIS-485
> Project: Ratis
>  Issue Type: Bug
>  Components: examples
>Reporter: Clay B.
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Fix For: 0.5.0
>
> Attachments: RATIS-485.003.patch, RATIS-485.004.patch, loadgen.log, 
> r485_20190827.patch, r485_20190828.patch
>
>
> Running the load generator without a Ratis cluster (e.g. spurious node IPs) 
> results in an OOM.
> If one has a single Ratis server it tries seemingly indefinitely:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1{code}
> If one has two Ratis servers it OOMs:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1,n1:127.0.0.1:2
> [...]
> 1/787867107@5e5792a0 with java.util.concurrent.CompletionException: 
> java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:417 - client-272A2E13A5DD: suggested new 
> leader: null. Failed 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:437 - client-272A2E13A5DD: change Leader 
> from n1 to n0
> 2019-02-14 07:47:22 DEBUG RaftClient:291 - schedule attempt #10740 with 
> policy RetryForeverNoSleep for 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:323 - client-272A2E13A5DD: send* 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:338 - client-272A2E13A5DD: Failed 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: 
> unable to create new native thread
> Exception in thread "main" java.util.concurrent.CompletionException: 
> java.lang.OutOfMemoryError: unable to create new native thread
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$sendRequestAsync$14(RaftClientImpl.java:349)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
>     at 
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestAsync(RaftClientImpl.java:334)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetryAsync(RaftClientImpl.java:286)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:243)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.retry(SlidingWindow.java:259)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$null$10(RaftClientImpl.java:293)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:85)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:104)
>     at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:50)
>     at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:91)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at 

[jira] [Commented] (RATIS-666) Coalesced heartbeat in multiraft

2019-08-29 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918976#comment-16918976
 ] 

Tsz Wo Nicholas Sze commented on RATIS-666:
---

If the connection is good, heartbeats in both pipeline 1 and 2 will be good.  
When the leader in pipeline 1 calls appendEntries, since the follower is bad, 
it will fail.  When the leader in pipeline 2 calls appendEntries, it will 
succeed.  

Suppose there is no appendEntries in pipeline 1, the leader won't realize the 
follower is bad as long as all the previous calls were acknowledged.

> Coalesced heartbeat in multiraft
> 
>
> Key: RATIS-666
> URL: https://issues.apache.org/jira/browse/RATIS-666
> Project: Ratis
>  Issue Type: Improvement
>  Components: raft-group
>Reporter: Li Cheng
>Priority: Major
>
> I'm using this issue to discuss the coalesced heartbeat plan in multi-raft. 
> We are looking at incorporating multi-raft feature in ratis into Hadoop 
> Ozone. So in ozone, every datanode would be in multiple raft groups or say 
> pipelines with multi-raft, which brings:
>  # Is there any plan for coalesced heartbeat on single node? 
>  # Are we going to use gRPC to achieve coalesced heartbeat like what 
> cockroach does? Shall we assume only Java APIs are required?
>  # Either we have coalesced heartbeat, every node would have chances to be 
> selected as leader in each raft group. So to the extreme extend, one node, 
> say node A, would be the leader to all raft groups. If we implement coalesced 
> heartbeat, there would more easily push node A to be the bottleneck for 
> future stumbling in performance. Any idea on how to avoid this extremity? 
> Maybe do a candidate scrub?
>  # How do we plan to test the 'single node, multi raft groups' scenario? 
> Furthermore, if we allow coalesced heartbeat configurable, how to determine 
> when and whether to use it?
>  
> [~szetszwo] [~Sammi] [~xyao] [~waterlx]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-485) Load Generator OOMs if Ratis Unavailable

2019-08-28 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918141#comment-16918141
 ] 

Tsz Wo Nicholas Sze commented on RATIS-485:
---

r485_20190828.patch: cancel  previous shutdown task.

[~elserj], please feel free to combine it with you patch.

> Load Generator OOMs if Ratis Unavailable
> 
>
> Key: RATIS-485
> URL: https://issues.apache.org/jira/browse/RATIS-485
> Project: Ratis
>  Issue Type: Bug
>  Components: examples
>Reporter: Clay B.
>Priority: Trivial
> Attachments: loadgen.log, r485_20190827.patch, r485_20190828.patch
>
>
> Running the load generator without a Ratis cluster (e.g. spurious node IPs) 
> results in an OOM.
> If one has a single Ratis server it tries seemingly indefinitely:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1{code}
> If one has two Ratis servers it OOMs:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1,n1:127.0.0.1:2
> [...]
> 1/787867107@5e5792a0 with java.util.concurrent.CompletionException: 
> java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:417 - client-272A2E13A5DD: suggested new 
> leader: null. Failed 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:437 - client-272A2E13A5DD: change Leader 
> from n1 to n0
> 2019-02-14 07:47:22 DEBUG RaftClient:291 - schedule attempt #10740 with 
> policy RetryForeverNoSleep for 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:323 - client-272A2E13A5DD: send* 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:338 - client-272A2E13A5DD: Failed 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: 
> unable to create new native thread
> Exception in thread "main" java.util.concurrent.CompletionException: 
> java.lang.OutOfMemoryError: unable to create new native thread
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$sendRequestAsync$14(RaftClientImpl.java:349)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
>     at 
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestAsync(RaftClientImpl.java:334)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetryAsync(RaftClientImpl.java:286)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:243)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.retry(SlidingWindow.java:259)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$null$10(RaftClientImpl.java:293)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:85)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:104)
>     at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:50)
>     at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:91)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: 

[jira] [Updated] (RATIS-485) Load Generator OOMs if Ratis Unavailable

2019-08-28 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-485:
--
Attachment: r485_20190828.patch

> Load Generator OOMs if Ratis Unavailable
> 
>
> Key: RATIS-485
> URL: https://issues.apache.org/jira/browse/RATIS-485
> Project: Ratis
>  Issue Type: Bug
>  Components: examples
>Reporter: Clay B.
>Priority: Trivial
> Attachments: loadgen.log, r485_20190827.patch, r485_20190828.patch
>
>
> Running the load generator without a Ratis cluster (e.g. spurious node IPs) 
> results in an OOM.
> If one has a single Ratis server it tries seemingly indefinitely:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1{code}
> If one has two Ratis servers it OOMs:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1,n1:127.0.0.1:2
> [...]
> 1/787867107@5e5792a0 with java.util.concurrent.CompletionException: 
> java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:417 - client-272A2E13A5DD: suggested new 
> leader: null. Failed 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:437 - client-272A2E13A5DD: change Leader 
> from n1 to n0
> 2019-02-14 07:47:22 DEBUG RaftClient:291 - schedule attempt #10740 with 
> policy RetryForeverNoSleep for 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:323 - client-272A2E13A5DD: send* 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:338 - client-272A2E13A5DD: Failed 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: 
> unable to create new native thread
> Exception in thread "main" java.util.concurrent.CompletionException: 
> java.lang.OutOfMemoryError: unable to create new native thread
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$sendRequestAsync$14(RaftClientImpl.java:349)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
>     at 
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestAsync(RaftClientImpl.java:334)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetryAsync(RaftClientImpl.java:286)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:243)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.retry(SlidingWindow.java:259)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$null$10(RaftClientImpl.java:293)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:85)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:104)
>     at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:50)
>     at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:91)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>     at java.lang.Thread.start0(Native Method)
> 

[jira] [Commented] (RATIS-485) Load Generator OOMs if Ratis Unavailable

2019-08-28 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918128#comment-16918128
 ] 

Tsz Wo Nicholas Sze commented on RATIS-485:
---

[~elserj], thanks for digging out the bug.  I agree that we should not schedule 
another shutdown task when (1) there is shutdown task is already and (2) it is 
still valid.  When the previous shutdown task becomes invalid, we should cancel 
it.  Will work on a patch.


> Load Generator OOMs if Ratis Unavailable
> 
>
> Key: RATIS-485
> URL: https://issues.apache.org/jira/browse/RATIS-485
> Project: Ratis
>  Issue Type: Bug
>  Components: examples
>Reporter: Clay B.
>Priority: Trivial
> Attachments: loadgen.log, r485_20190827.patch
>
>
> Running the load generator without a Ratis cluster (e.g. spurious node IPs) 
> results in an OOM.
> If one has a single Ratis server it tries seemingly indefinitely:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1{code}
> If one has two Ratis servers it OOMs:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1,n1:127.0.0.1:2
> [...]
> 1/787867107@5e5792a0 with java.util.concurrent.CompletionException: 
> java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:417 - client-272A2E13A5DD: suggested new 
> leader: null. Failed 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:437 - client-272A2E13A5DD: change Leader 
> from n1 to n0
> 2019-02-14 07:47:22 DEBUG RaftClient:291 - schedule attempt #10740 with 
> policy RetryForeverNoSleep for 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:323 - client-272A2E13A5DD: send* 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:338 - client-272A2E13A5DD: Failed 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: 
> unable to create new native thread
> Exception in thread "main" java.util.concurrent.CompletionException: 
> java.lang.OutOfMemoryError: unable to create new native thread
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$sendRequestAsync$14(RaftClientImpl.java:349)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
>     at 
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestAsync(RaftClientImpl.java:334)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetryAsync(RaftClientImpl.java:286)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:243)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.retry(SlidingWindow.java:259)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$null$10(RaftClientImpl.java:293)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:85)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:104)
>     at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:50)
>     at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:91)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> 

[jira] [Commented] (RATIS-666) Coalesced heartbeat in multiraft

2019-08-28 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917992#comment-16917992
 ] 

Tsz Wo Nicholas Sze commented on RATIS-666:
---

> ... which makes A very hard to tell which one of the 5 groups is so unhealthy 
> as to drag the communication.  ...

Such thing should not happen -- heartbeat is to tell if the connection between 
two machines is good.  When one group is unhealthy and the connection is still 
good, the appendEntries to that group will fail but the heartbeats should 
succeed.

> Coalesced heartbeat in multiraft
> 
>
> Key: RATIS-666
> URL: https://issues.apache.org/jira/browse/RATIS-666
> Project: Ratis
>  Issue Type: Improvement
>  Components: raft-group
>Reporter: Li Cheng
>Priority: Major
>
> I'm using this issue to discuss the coalesced heartbeat plan in multi-raft. 
> We are looking at incorporating multi-raft feature in ratis into Hadoop 
> Ozone. So in ozone, every datanode would be in multiple raft groups or say 
> pipelines with multi-raft, which brings:
>  # Is there any plan for coalesced heartbeat on single node? 
>  # Are we going to use gRPC to achieve coalesced heartbeat like what 
> cockroach does? Shall we assume only Java APIs are required?
>  # Either we have coalesced heartbeat, every node would have chances to be 
> selected as leader in each raft group. So to the extreme extend, one node, 
> say node A, would be the leader to all raft groups. If we implement coalesced 
> heartbeat, there would more easily push node A to be the bottleneck for 
> future stumbling in performance. Any idea on how to avoid this extremity? 
> Maybe do a candidate scrub?
>  # How do we plan to test the 'single node, multi raft groups' scenario? 
> Furthermore, if we allow coalesced heartbeat configurable, how to determine 
> when and whether to use it?
>  
> [~szetszwo] [~Sammi] [~xyao] [~waterlx]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-661) Add call in state machine to handle group removal

2019-08-27 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917069#comment-16917069
 ] 

Tsz Wo Nicholas Sze commented on RATIS-661:
---

>  Since the impl is removed earlier, RaftServer#getGroupIds would not give the 
>corresponding groupId ...

When the group is being removed, it is correct to have RaftServer#getGroupIds 
not returning that id.  Ozone datanode could use notifyGroupRemove() to check 
when the server impl is shutdown.

If the group is not removed from the map in the beginning, new calls including 
client requests and another groupRemoveAsync(..) call can happen.  It will have 
race condition.

 

> Add call in state machine to handle group removal
> -
>
> Key: RATIS-661
> URL: https://issues.apache.org/jira/browse/RATIS-661
> Project: Ratis
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-661.001.patch, RATIS-661.002.patch, 
> RATIS-661.003.patch, RATIS-661.004.patch
>
>
> Currently during RaftServerProxy#groupRemoveAsync there is no way for 
> stateMachine to know that the RaftGroup will be removed. This Jira aims to 
> add a call in the stateMachine to handle group removal.
> It also changes the logic of groupRemoval api to remove the RaftServerImpl 
> from the RaftServerProxy#impls map after the shutdown is complete. This is 
> required to synchronize the removal with the corresponding api of 
> RaftServer#getGroupIds. RaftServer#getGroupIds uses the RaftServerProxy#impls 
> map to get the groupIds.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-661) Add call in state machine to handle group removal

2019-08-27 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916965#comment-16916965
 ] 

Tsz Wo Nicholas Sze commented on RATIS-661:
---

[~ljain], thanks for working on this.
-  Why changing remove(..) to get(..) below?  It could have a race condition 
when there are multiple groupRemoveAsync(..) calls.
{code}
 }
-final CompletableFuture f = impls.remove(groupId);
+final CompletableFuture f = impls.get(groupId);
 if (f == null) {
{code}
- Let's call the new method notifyGroupRemove() in StateMachine.
- Let's do not change shutdown(..) since the groupRemoval parameter is always 
false except for groupRemoveAsync(..). Just make the call there as below.
{code}
@@ -403,6 +403,7 @@ public class RaftServerProxy implements RaftServer {
 }
 return f.thenApply(impl -> {
   final Collection commitInfos = impl.getCommitInfos();
+  impl.getStateMachine().notifyGroupRemove();
   impl.shutdown(deleteDirectory);
   return new RaftClientReply(request, commitInfos);
 });
 {code}


> Add call in state machine to handle group removal
> -
>
> Key: RATIS-661
> URL: https://issues.apache.org/jira/browse/RATIS-661
> Project: Ratis
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-661.001.patch, RATIS-661.002.patch, 
> RATIS-661.003.patch, RATIS-661.004.patch
>
>
> Currently during RaftServerProxy#groupRemoveAsync there is no way for 
> stateMachine to know that the RaftGroup will be removed. This Jira aims to 
> add a call in the stateMachine to handle group removal.
> It also changes the logic of groupRemoval api to remove the RaftServerImpl 
> from the RaftServerProxy#impls map after the shutdown is complete. This is 
> required to synchronize the removal with the corresponding api of 
> RaftServer#getGroupIds. RaftServer#getGroupIds uses the RaftServerProxy#impls 
> map to get the groupIds.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (RATIS-543) Ratis GRPC client produces excessive logging while writing data.

2019-08-27 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze reassigned RATIS-543:
-

Assignee: Tsz Wo Nicholas Sze

> Ratis GRPC client produces excessive logging while writing data.
> 
>
> Key: RATIS-543
> URL: https://issues.apache.org/jira/browse/RATIS-543
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Aravindan Vijayan
>Assignee: Tsz Wo Nicholas Sze
>Priority: Blocker
>  Labels: ozone
> Attachments: r485_20190827.patch
>
>
> {code}
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1352, SUCCESS, logIndex=15195,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1355, SUCCESS, logIndex=15196,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1357, SUCCESS, logIndex=15197,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6: receive 
> RaftClientReply:client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6@group-AE803AF42C5D,
>  cid=1370, SUCCESS, logIndex=0, com
> mits[5a076d87-abf9-4ade-ae37-adab741d99a6:c16423, 
> 6e21905d-9796-4248-834e-ed97ea6763ef:c16422, 
> 34e8d6e5-456f-4e2a-99a5-4f21fd9c4a7e:c16423]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836: receive 
> RaftClientReply:client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836@group-4E41299EA191,
>  cid=1376, SUCCESS, logIndex=0, com
> mits[a5729949-67f1-496e-a0d3-1bfc0e139836:c4764, 
> 111d4c23-756f-4c8a-a48d-aa2a327a5179:c4764, 
> 287eccfb-8461-419a-8732-529d042380b3:c4764]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb: receive 
> RaftClientReply:client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb@group-D1BB7F32F754,
>  cid=1382, FAILED org.apache.ratis.
> protocol.NotLeaderException: Server 0bb45975-b0d2-499e-85cc-22ea22c57ecb is 
> not the leader (f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:10.17.200.18:9858). 
> Request must be sent to leader., logIndex=0, 
> commits[0bb45975-b0d2-499e-85cc-22ea22c57ecb:c15358, 6c7a
> 780f-5474-49da-b880-3eaf69d9d83d:c15358, 
> f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:c15358]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1359, SUCCESS, logIndex=15208, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1362, SUCCESS, logIndex=15209, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1363, SUCCESS, logIndex=15210, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:32 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1371, SUCCESS, logIndex=15211, 
> 

[jira] [Updated] (RATIS-543) Ratis GRPC client produces excessive logging while writing data.

2019-08-27 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-543:
--
Component/s: gRPC

r543_20190827.patch: change the log to trace.

> Ratis GRPC client produces excessive logging while writing data.
> 
>
> Key: RATIS-543
> URL: https://issues.apache.org/jira/browse/RATIS-543
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Aravindan Vijayan
>Priority: Blocker
>  Labels: ozone
> Attachments: r485_20190827.patch
>
>
> {code}
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1352, SUCCESS, logIndex=15195,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1355, SUCCESS, logIndex=15196,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1357, SUCCESS, logIndex=15197,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6: receive 
> RaftClientReply:client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6@group-AE803AF42C5D,
>  cid=1370, SUCCESS, logIndex=0, com
> mits[5a076d87-abf9-4ade-ae37-adab741d99a6:c16423, 
> 6e21905d-9796-4248-834e-ed97ea6763ef:c16422, 
> 34e8d6e5-456f-4e2a-99a5-4f21fd9c4a7e:c16423]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836: receive 
> RaftClientReply:client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836@group-4E41299EA191,
>  cid=1376, SUCCESS, logIndex=0, com
> mits[a5729949-67f1-496e-a0d3-1bfc0e139836:c4764, 
> 111d4c23-756f-4c8a-a48d-aa2a327a5179:c4764, 
> 287eccfb-8461-419a-8732-529d042380b3:c4764]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb: receive 
> RaftClientReply:client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb@group-D1BB7F32F754,
>  cid=1382, FAILED org.apache.ratis.
> protocol.NotLeaderException: Server 0bb45975-b0d2-499e-85cc-22ea22c57ecb is 
> not the leader (f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:10.17.200.18:9858). 
> Request must be sent to leader., logIndex=0, 
> commits[0bb45975-b0d2-499e-85cc-22ea22c57ecb:c15358, 6c7a
> 780f-5474-49da-b880-3eaf69d9d83d:c15358, 
> f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:c15358]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1359, SUCCESS, logIndex=15208, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1362, SUCCESS, logIndex=15209, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1363, SUCCESS, logIndex=15210, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:32 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1371, SUCCESS, logIndex=15211, 
> 

[jira] [Updated] (RATIS-543) Ratis GRPC client produces excessive logging while writing data.

2019-08-27 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-543:
--
Attachment: r485_20190827.patch

> Ratis GRPC client produces excessive logging while writing data.
> 
>
> Key: RATIS-543
> URL: https://issues.apache.org/jira/browse/RATIS-543
> Project: Ratis
>  Issue Type: Bug
>Reporter: Aravindan Vijayan
>Priority: Blocker
>  Labels: ozone
> Attachments: r485_20190827.patch
>
>
> {code}
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1352, SUCCESS, logIndex=15195,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1355, SUCCESS, logIndex=15196,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1357, SUCCESS, logIndex=15197,
>  commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15201, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15189, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15186]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6: receive 
> RaftClientReply:client-C46A037579AA->5a076d87-abf9-4ade-ae37-adab741d99a6@group-AE803AF42C5D,
>  cid=1370, SUCCESS, logIndex=0, com
> mits[5a076d87-abf9-4ade-ae37-adab741d99a6:c16423, 
> 6e21905d-9796-4248-834e-ed97ea6763ef:c16422, 
> 34e8d6e5-456f-4e2a-99a5-4f21fd9c4a7e:c16423]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836: receive 
> RaftClientReply:client-EBF618C3F968->a5729949-67f1-496e-a0d3-1bfc0e139836@group-4E41299EA191,
>  cid=1376, SUCCESS, logIndex=0, com
> mits[a5729949-67f1-496e-a0d3-1bfc0e139836:c4764, 
> 111d4c23-756f-4c8a-a48d-aa2a327a5179:c4764, 
> 287eccfb-8461-419a-8732-529d042380b3:c4764]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb: receive 
> RaftClientReply:client-4D5E3CDC8889->0bb45975-b0d2-499e-85cc-22ea22c57ecb@group-D1BB7F32F754,
>  cid=1382, FAILED org.apache.ratis.
> protocol.NotLeaderException: Server 0bb45975-b0d2-499e-85cc-22ea22c57ecb is 
> not the leader (f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:10.17.200.18:9858). 
> Request must be sent to leader., logIndex=0, 
> commits[0bb45975-b0d2-499e-85cc-22ea22c57ecb:c15358, 6c7a
> 780f-5474-49da-b880-3eaf69d9d83d:c15358, 
> f1a756c3-6b42-4ece-8093-dbcdac5f8d5b:c15358]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1359, SUCCESS, logIndex=15208, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1362, SUCCESS, logIndex=15209, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:31 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1363, SUCCESS, logIndex=15210, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15210, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> aaf673a3-95ac-43aa-8614-b1a324142430:c15189]
> 19/05/03 10:23:32 INFO client.GrpcClientProtocolClient: 
> client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da: receive 
> RaftClientReply:client-FD23551CACEE->51711703-9f9d-4c79-bfb1-38726f0059da@group-1EADCA052664,
>  cid=1371, SUCCESS, logIndex=15211, 
> commits[51711703-9f9d-4c79-bfb1-38726f0059da:c15211, 
> 0beac0f1-af74-43ac-ba73-0a92ecb9f0ae:c15201, 
> 

[jira] [Commented] (RATIS-485) Load Generator OOMs if Ratis Unavailable

2019-08-27 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916920#comment-16916920
 ] 

Tsz Wo Nicholas Sze commented on RATIS-485:
---

Is the test creating a lot of RaftClient(s)?  Each client has a 
TimeoutScheduler which may cause the OOM.   Let's make the scheduler static to 
see if it could fix the OOM: r485_20190827.patch

> Load Generator OOMs if Ratis Unavailable
> 
>
> Key: RATIS-485
> URL: https://issues.apache.org/jira/browse/RATIS-485
> Project: Ratis
>  Issue Type: Bug
>  Components: examples
>Reporter: Clay B.
>Priority: Trivial
> Attachments: loadgen.log, r485_20190827.patch
>
>
> Running the load generator without a Ratis cluster (e.g. spurious node IPs) 
> results in an OOM.
> If one has a single Ratis server it tries seemingly indefinitely:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1{code}
> If one has two Ratis servers it OOMs:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1,n1:127.0.0.1:2
> [...]
> 1/787867107@5e5792a0 with java.util.concurrent.CompletionException: 
> java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:417 - client-272A2E13A5DD: suggested new 
> leader: null. Failed 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:437 - client-272A2E13A5DD: change Leader 
> from n1 to n0
> 2019-02-14 07:47:22 DEBUG RaftClient:291 - schedule attempt #10740 with 
> policy RetryForeverNoSleep for 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:323 - client-272A2E13A5DD: send* 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:338 - client-272A2E13A5DD: Failed 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: 
> unable to create new native thread
> Exception in thread "main" java.util.concurrent.CompletionException: 
> java.lang.OutOfMemoryError: unable to create new native thread
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$sendRequestAsync$14(RaftClientImpl.java:349)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
>     at 
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestAsync(RaftClientImpl.java:334)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetryAsync(RaftClientImpl.java:286)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:243)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.retry(SlidingWindow.java:259)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$null$10(RaftClientImpl.java:293)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:85)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:104)
>     at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:50)
>     at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:91)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  

[jira] [Updated] (RATIS-485) Load Generator OOMs if Ratis Unavailable

2019-08-27 Thread Tsz Wo Nicholas Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-485:
--
Attachment: r485_20190827.patch

> Load Generator OOMs if Ratis Unavailable
> 
>
> Key: RATIS-485
> URL: https://issues.apache.org/jira/browse/RATIS-485
> Project: Ratis
>  Issue Type: Bug
>  Components: examples
>Reporter: Clay B.
>Priority: Trivial
> Attachments: loadgen.log, r485_20190827.patch
>
>
> Running the load generator without a Ratis cluster (e.g. spurious node IPs) 
> results in an OOM.
> If one has a single Ratis server it tries seemingly indefinitely:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1{code}
> If one has two Ratis servers it OOMs:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1,n1:127.0.0.1:2
> [...]
> 1/787867107@5e5792a0 with java.util.concurrent.CompletionException: 
> java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:417 - client-272A2E13A5DD: suggested new 
> leader: null. Failed 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:437 - client-272A2E13A5DD: change Leader 
> from n1 to n0
> 2019-02-14 07:47:22 DEBUG RaftClient:291 - schedule attempt #10740 with 
> policy RetryForeverNoSleep for 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:323 - client-272A2E13A5DD: send* 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:338 - client-272A2E13A5DD: Failed 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: 
> unable to create new native thread
> Exception in thread "main" java.util.concurrent.CompletionException: 
> java.lang.OutOfMemoryError: unable to create new native thread
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$sendRequestAsync$14(RaftClientImpl.java:349)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>     at 
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
>     at 
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestAsync(RaftClientImpl.java:334)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetryAsync(RaftClientImpl.java:286)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:243)
>     at 
> org.apache.ratis.util.SlidingWindow$Client.retry(SlidingWindow.java:259)
>     at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$null$10(RaftClientImpl.java:293)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:85)
>     at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:104)
>     at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:50)
>     at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:91)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>     at java.lang.Thread.start0(Native Method)
>     at 

[jira] [Commented] (RATIS-619) Ratis server on restart loads all the entries for the group

2019-08-26 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916158#comment-16916158
 ] 

Tsz Wo Nicholas Sze commented on RATIS-619:
---

The patch seems not fixing the problem: It still loads all the entries for the 
group.  It just does not keep the pre-snapshot entries in the cache.

> Ratis server on restart loads all the entries for the group
> ---
>
> Key: RATIS-619
> URL: https://issues.apache.org/jira/browse/RATIS-619
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Siddharth Wagle
>Priority: Blocker
>  Labels: ozone
> Attachments: RATIS-619.01.patch, RATIS-619.02.patch
>
>
> Even after taking a snapshot, the raft log loads all the segment in the log
> {code}
> 2019-07-01 23:22:47,481 [pool-18-thread-1] INFO   - Setting the last 
> applied index to (t:2, i:15237039)
> {code}
> {code}
> 2019-07-01 23:22:47,516 INFO org.apache.ratis.server.RaftServerConfigKeys: 
> raft.server.log.statemachine.data.caching.enabled = true (custom)
> 2019-07-01 23:22:47,531 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 62941ca3-f244-4298-8497-f4c0bd57430a:group-4D230AB58084 set configuration 0: 
> [1f3d7936-cb4e-4b68-86ed-578070472dea:1
> 0.17.213.36:9858, 62941ca3-f244-4298-8497-f4c0bd57430a:10.17.213.35:9858, 
> f07c1f87-b377-40d9-8c56-4f1440c4fa77:10.17.213.37:9858], old=null at 0
> 2019-07-01 23:22:47,578 INFO org.apache.hadoop.http.HttpServer2: Jetty bound 
> to port 9882
> 2019-07-01 23:22:47,579 INFO org.eclipse.jetty.server.Server: 
> jetty-9.3.24.v20180605, build timestamp: 2018-06-05T10:11:56-07:00, git hash: 
> 84205aa28f11a4f31f2a3b86d1bba2cc8ab69827
> 2019-07-01 23:22:47,601 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7461 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_0-7460
> 2019-07-01 23:22:47,608 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.s.ServletContextHandler@6ce90bc5{/logs,file:///var/log/ozone/,AVAILABLE}
> 2019-07-01 23:22:47,608 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.s.ServletContextHandler@4b1c0397{/static,jar:file:/var/lib/hadoop-ozone/ozone-0.5.0-SNAPSHOT/share
> /ozone/lib/hadoop-hdds-container-service-0.5.0-SNAPSHOT.jar!/webapps/static,AVAILABLE}
> 2019-07-01 23:22:47,635 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7386 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_7461-14846
> 2019-07-01 23:22:47,663 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7440 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_14847-22286
> 2019-07-01 23:22:47,664 INFO org.eclipse.jetty.server.handler.ContextHandler: 
> Started 
> o.e.j.w.WebAppContext@8a62297{/,file:///tmp/jetty-0.0.0.0-9882-hddsDatanode-_-any-7539213566265642568.di
> r/webapp/,AVAILABLE}{/hddsDatanode}
> 2019-07-01 23:22:47,681 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7353 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_22287-29639
> 2019-07-01 23:22:47,695 INFO org.eclipse.jetty.server.AbstractConnector: 
> Started ServerConnector@5116ac09{HTTP/1.1,[http/1.1]}{0.0.0.0:9882}
> 2019-07-01 23:22:47,695 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7291 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_29640-36930
> 2019-07-01 23:22:47,695 INFO org.eclipse.jetty.server.Server: Started @56648ms
> 2019-07-01 23:22:47,695 INFO org.apache.hadoop.hdds.server.BaseHttpServer: 
> HTTP server of HDDSDATANODE is listening at http://0.0.0.0:9882
> 2019-07-01 23:22:47,709 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7049 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_36931-43979
> 2019-07-01 23:22:47,732 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7141 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230
> ab58084/current/log_43980-51120
> 2019-07-01 23:22:47,747 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 7321 
> entries from segment file 
> /data/1/ozone-0701/ratis/log/f7ddda32-45e0-4bec-a3e7-4d230ab58084/current/log_51121-58441
> 2019-07-01 23:22:47,768 INFO 
> org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 

[jira] [Commented] (RATIS-569) StatusRuntimeException on the datanode because clients do not shutdown the observer cleanly.

2019-08-26 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916150#comment-16916150
 ] 

Tsz Wo Nicholas Sze commented on RATIS-569:
---

> ... Working on a unit test.

Are you going to add a test?

> StatusRuntimeException on the datanode because clients do not shutdown the 
> observer cleanly.
> 
>
> Key: RATIS-569
> URL: https://issues.apache.org/jira/browse/RATIS-569
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Siddharth Wagle
>Priority: Blocker
>  Labels: ozone
> Attachments: RATIS-569.01.patch
>
>
> Running TestDataValidate in Ozone leads to StatusRuntimeException on the 
> datanode frequently.
> This causes an unclean shutdown on the stream on the datanode.
> In GrpcClientProtocolClient, shutdownNow should be followed by a 
> awaitTermination to wait for a clean shutdown.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-569) StatusRuntimeException on the datanode because clients do not shutdown the observer cleanly.

2019-08-26 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916148#comment-16916148
 ] 

Tsz Wo Nicholas Sze commented on RATIS-569:
---

Thanks [~swagle]. 
- The [gRPC 
example|https://github.com/grpc/grpc-java/blob/master/examples/src/main/java/io/grpc/examples/routeguide/RouteGuideClient.java#L63]
 uses 5s timeout.  Let's do the same?
- Let's re-throw the example, if there is any.
{code}
  public void close() throws InterruptedIOException {
try {
  channel.shutdownNow().awaitTermination(5, TimeUnit.SECONDS);
} catch (InterruptedException e) {
  throw IOUtils.toInterruptedIOException("Failed to close.", e);
}
  }
{code}

> StatusRuntimeException on the datanode because clients do not shutdown the 
> observer cleanly.
> 
>
> Key: RATIS-569
> URL: https://issues.apache.org/jira/browse/RATIS-569
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Siddharth Wagle
>Priority: Blocker
>  Labels: ozone
> Attachments: RATIS-569.01.patch
>
>
> Running TestDataValidate in Ozone leads to StatusRuntimeException on the 
> datanode frequently.
> This causes an unclean shutdown on the stream on the datanode.
> In GrpcClientProtocolClient, shutdownNow should be followed by a 
> awaitTermination to wait for a clean shutdown.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-666) Coalesced heartbeat in multiraft

2019-08-26 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916074#comment-16916074
 ] 

Tsz Wo Nicholas Sze commented on RATIS-666:
---

> 4. How do we plan to test the 'single node, multi raft groups' scenario? 
> Furthermore, if we allow coalesced heartbeat configurable, how to determine 
> when and whether to use it?

For testing, we definitely should add unit tests as usual.  When the feature is 
enabled in Ozone, the Ozone performance tests will also test it.

Why configurable but not always enabling it?  It seems there is no downside 
enabling it.  No?


> Coalesced heartbeat in multiraft
> 
>
> Key: RATIS-666
> URL: https://issues.apache.org/jira/browse/RATIS-666
> Project: Ratis
>  Issue Type: Improvement
>  Components: raft-group
>Reporter: Li Cheng
>Priority: Major
>
> I'm using this issue to discuss the coalesced heartbeat plan in multi-raft. 
> We are looking at incorporating multi-raft feature in ratis into Hadoop 
> Ozone. So in ozone, every datanode would be in multiple raft groups or say 
> pipelines with multi-raft, which brings:
>  # Is there any plan for coalesced heartbeat on single node? 
>  # Are we going to use gRPC to achieve coalesced heartbeat like what 
> cockroach does? Shall we assume only Java APIs are required?
>  # Either we have coalesced heartbeat, every node would have chances to be 
> selected as leader in each raft group. So to the extreme extend, one node, 
> say node A, would be the leader to all raft groups. If we implement coalesced 
> heartbeat, there would more easily push node A to be the bottleneck for 
> future stumbling in performance. Any idea on how to avoid this extremity? 
> Maybe do a candidate scrub?
>  # How do we plan to test the 'single node, multi raft groups' scenario? 
> Furthermore, if we allow coalesced heartbeat configurable, how to determine 
> when and whether to use it?
>  
> [~szetszwo] [~Sammi] [~xyao] [~waterlx]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-666) Coalesced heartbeat in multiraft

2019-08-26 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916072#comment-16916072
 ] 

Tsz Wo Nicholas Sze commented on RATIS-666:
---

> 3. Either we have coalesced heartbeat, every node would have chances to be 
> selected as leader in each raft group. So to the extreme extend, one node, 
> say node A, would be the leader to all raft groups. ...

This seems like a group management issue.  The problem would happen 
with/without the heartbeat improvement.

> Coalesced heartbeat in multiraft
> 
>
> Key: RATIS-666
> URL: https://issues.apache.org/jira/browse/RATIS-666
> Project: Ratis
>  Issue Type: Improvement
>  Components: raft-group
>Reporter: Li Cheng
>Priority: Major
>
> I'm using this issue to discuss the coalesced heartbeat plan in multi-raft. 
> We are looking at incorporating multi-raft feature in ratis into Hadoop 
> Ozone. So in ozone, every datanode would be in multiple raft groups or say 
> pipelines with multi-raft, which brings:
>  # Is there any plan for coalesced heartbeat on single node? 
>  # Are we going to use gRPC to achieve coalesced heartbeat like what 
> cockroach does? Shall we assume only Java APIs are required?
>  # Either we have coalesced heartbeat, every node would have chances to be 
> selected as leader in each raft group. So to the extreme extend, one node, 
> say node A, would be the leader to all raft groups. If we implement coalesced 
> heartbeat, there would more easily push node A to be the bottleneck for 
> future stumbling in performance. Any idea on how to avoid this extremity? 
> Maybe do a candidate scrub?
>  # How do we plan to test the 'single node, multi raft groups' scenario? 
> Furthermore, if we allow coalesced heartbeat configurable, how to determine 
> when and whether to use it?
>  
> [~szetszwo] [~Sammi] [~xyao] [~waterlx]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-666) Coalesced heartbeat in multiraft

2019-08-26 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916067#comment-16916067
 ] 

Tsz Wo Nicholas Sze commented on RATIS-666:
---

> 2. Are we going to use gRPC to achieve coalesced heartbeat like what 
> cockroach does? Shall we assume only Java APIs are required?

As long as the rpc is asynchronous, we can combine the heartbeats as 
[mentioned|https://issues.apache.org/jira/browse/RATIS-666?focusedCommentId=16916064=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16916064].
  We should only assume Java APIs.

> Coalesced heartbeat in multiraft
> 
>
> Key: RATIS-666
> URL: https://issues.apache.org/jira/browse/RATIS-666
> Project: Ratis
>  Issue Type: Improvement
>  Components: raft-group
>Reporter: Li Cheng
>Priority: Major
>
> I'm using this issue to discuss the coalesced heartbeat plan in multi-raft. 
> We are looking at incorporating multi-raft feature in ratis into Hadoop 
> Ozone. So in ozone, every datanode would be in multiple raft groups or say 
> pipelines with multi-raft, which brings:
>  # Is there any plan for coalesced heartbeat on single node? 
>  # Are we going to use gRPC to achieve coalesced heartbeat like what 
> cockroach does? Shall we assume only Java APIs are required?
>  # Either we have coalesced heartbeat, every node would have chances to be 
> selected as leader in each raft group. So to the extreme extend, one node, 
> say node A, would be the leader to all raft groups. If we implement coalesced 
> heartbeat, there would more easily push node A to be the bottleneck for 
> future stumbling in performance. Any idea on how to avoid this extremity? 
> Maybe do a candidate scrub?
>  # How do we plan to test the 'single node, multi raft groups' scenario? 
> Furthermore, if we allow coalesced heartbeat configurable, how to determine 
> when and whether to use it?
>  
> [~szetszwo] [~Sammi] [~xyao] [~waterlx]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-666) Coalesced heartbeat in multiraft

2019-08-26 Thread Tsz Wo Nicholas Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916064#comment-16916064
 ] 

Tsz Wo Nicholas Sze commented on RATIS-666:
---

> 1. Is there any plan for coalesced heartbeat on single node? 

An easy improvement is to separate heartbeats from appendEntries calls:
- Move heartbeats from appendEntries in each group to a server level heartbeat 
manager.
- When a server S0 wants to send heartbeats to another server S1, the server S0 
registers S1 in its heartbeat manager.
-* When there are multiple groups in S0 sending heartbeats to S1, the heartbeat 
manager only sends one set of heartbeats.

> Coalesced heartbeat in multiraft
> 
>
> Key: RATIS-666
> URL: https://issues.apache.org/jira/browse/RATIS-666
> Project: Ratis
>  Issue Type: Improvement
>  Components: raft-group
>Reporter: Li Cheng
>Priority: Major
>
> I'm using this issue to discuss the coalesced heartbeat plan in multi-raft. 
> We are looking at incorporating multi-raft feature in ratis into Hadoop 
> Ozone. So in ozone, every datanode would be in multiple raft groups or say 
> pipelines with multi-raft, which brings:
>  # Is there any plan for coalesced heartbeat on single node? 
>  # Are we going to use gRPC to achieve coalesced heartbeat like what 
> cockroach does? Shall we assume only Java APIs are required?
>  # Either we have coalesced heartbeat, every node would have chances to be 
> selected as leader in each raft group. So to the extreme extend, one node, 
> say node A, would be the leader to all raft groups. If we implement coalesced 
> heartbeat, there would more easily push node A to be the bottleneck for 
> future stumbling in performance. Any idea on how to avoid this extremity? 
> Maybe do a candidate scrub?
>  # How do we plan to test the 'single node, multi raft groups' scenario? 
> Furthermore, if we allow coalesced heartbeat configurable, how to determine 
> when and whether to use it?
>  
> [~szetszwo] [~Sammi] [~xyao] [~waterlx]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (RATIS-654) Fix generation LICENSE and NOTICE for third-party dependencies

2019-08-12 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905570#comment-16905570
 ] 

Tsz Wo Nicholas Sze commented on RATIS-654:
---

+1 the v2 patch looks good.

> -1whitespace  0m 0s   The patch 5 line(s) with tabs.

Will fix the tabs when committing the patch.

> Fix generation LICENSE and NOTICE for third-party dependencies
> --
>
> Key: RATIS-654
> URL: https://issues.apache.org/jira/browse/RATIS-654
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Attachments: RATIS-654.patch, RATIS-654_v1.patch, RATIS-654_v2.patch
>
>
> Details on licenses, what can be bundled and what can't be as per apache:-
> http://www.apache.org/legal/resolved.html
> Below is the guide on how a dev should be assembling LICENSE and NOTICE:
> http://www.apache.org/dev/licensing-howto.html
> We need to include LICENSE and NOTICE for transitive dependencies as well
> http://www.apache.org/dev/licensing-howto.html#deps-of-deps
> The supplemental model[s1] of maven can help in supplementing missing 
> information of LICENSE and NOTICE in the third-party dependencies in our 
> bundled LICENSE and NOTICE
> [1] 
> https://maven.apache.org/plugins/maven-remote-resources-plugin/supplemental-models.html
> Here, I have copied the resource-bundle created by HBase , so that we don't 
> need to re-write whole logic of generating LICENSE and NOTICE in apache way.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-654) Fix generation LICENSE and NOTICE for third-party dependencies

2019-08-12 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905490#comment-16905490
 ] 

Tsz Wo Nicholas Sze commented on RATIS-654:
---

- ratis-assembly/pom.xml
{code}
+org.codehaus.mojo
+exec-maven-plugin
+1.3.1
{code}
The version can be omitted since it is already specified in the parent pom.
- pom.xml
{code}
@@ -165,7 +170,7 @@
 -->
 0.12
 1.9
-1.3.1
+1.6.0
 3.0.0
 1.0-alpha-8
 1.0
{code}
How about we update the version to the latest as above?

> Fix generation LICENSE and NOTICE for third-party dependencies
> --
>
> Key: RATIS-654
> URL: https://issues.apache.org/jira/browse/RATIS-654
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Attachments: RATIS-654.patch, RATIS-654_v1.patch
>
>
> Details on licenses, what can be bundled and what can't be as per apache:-
> http://www.apache.org/legal/resolved.html
> Below is the guide on how a dev should be assembling LICENSE and NOTICE:
> http://www.apache.org/dev/licensing-howto.html
> We need to include LICENSE and NOTICE for transitive dependencies as well
> http://www.apache.org/dev/licensing-howto.html#deps-of-deps
> The supplemental model[s1] of maven can help in supplementing missing 
> information of LICENSE and NOTICE in the third-party dependencies in our 
> bundled LICENSE and NOTICE
> [1] 
> https://maven.apache.org/plugins/maven-remote-resources-plugin/supplemental-models.html
> Here, I have copied the resource-bundle created by HBase , so that we don't 
> need to re-write whole logic of generating LICENSE and NOTICE in apache way.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-655) Change LeaderState and FollowerState and to use RaftGroupMemberId

2019-08-08 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903155#comment-16903155
 ] 

Tsz Wo Nicholas Sze commented on RATIS-655:
---

[~msingh], thanks for taking a look.  The test failures and NPE are not 
related.  We have RATIS-645 for TestRaftStateMachineExceptionWithSimulatedRpc.  
There other two NPE tests are in logservice.

> Change LeaderState and FollowerState and  to use RaftGroupMemberId
> --
>
> Key: RATIS-655
> URL: https://issues.apache.org/jira/browse/RATIS-655
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r655_20190807.patch, r655_20190807b.patch
>
>
> This is the last JIRA split from the huge patch in RATIS-605.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (RATIS-645) testRetryOnExceptionDuringReplication may throw NPE.

2019-08-08 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze reopened RATIS-645:
---

RATIS-578 did not completely fix the NPE.  Reopening this.

> testRetryOnExceptionDuringReplication may throw NPE.
> 
>
> Key: RATIS-645
> URL: https://issues.apache.org/jira/browse/RATIS-645
> Project: Ratis
>  Issue Type: Bug
>  Components: test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Minor
>
> RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication has 
> failed in Jenkins with NullPointerException; for example 
> https://builds.apache.org/job/PreCommit-RATIS-Build/915/testReport/org.apache.ratis.server.simulation/TestRaftStateMachineExceptionWithSimulatedRpc/testRetryOnExceptionDuringReplication/
> Just able to reproduce it:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.ratis.server.impl.RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication(RaftStateMachineExceptionTests.java:172)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-654) Fix generation LICENSE and NOTICE for third-party dependencies

2019-08-07 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902499#comment-16902499
 ] 

Tsz Wo Nicholas Sze commented on RATIS-654:
---

[~an...@apache.org], I forgot to mention that the patch does not apply to 
master.  Please update your local git.

> Fix generation LICENSE and NOTICE for third-party dependencies
> --
>
> Key: RATIS-654
> URL: https://issues.apache.org/jira/browse/RATIS-654
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Attachments: RATIS-654.patch
>
>
> Details on licenses, what can be bundled and what can't be as per apache:-
> http://www.apache.org/legal/resolved.html
> Below is the guide on how a dev should be assembling LICENSE and NOTICE:
> http://www.apache.org/dev/licensing-howto.html
> We need to include LICENSE and NOTICE for transitive dependencies as well
> http://www.apache.org/dev/licensing-howto.html#deps-of-deps
> The supplemental model[s1] of maven can help in supplementing missing 
> information of LICENSE and NOTICE in the third-party dependencies in our 
> bundled LICENSE and NOTICE
> [1] 
> https://maven.apache.org/plugins/maven-remote-resources-plugin/supplemental-models.html
> Here, I have copied the resource-bundle created by HBase , so that we don't 
> need to re-write whole logic of generating LICENSE and NOTICE in apache way.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-655) Change LeaderState and FollowerState and to use RaftGroupMemberId

2019-08-07 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902486#comment-16902486
 ] 

Tsz Wo Nicholas Sze commented on RATIS-655:
---

r655_20190807b.patch: fixes compilation error.

> Change LeaderState and FollowerState and  to use RaftGroupMemberId
> --
>
> Key: RATIS-655
> URL: https://issues.apache.org/jira/browse/RATIS-655
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r655_20190807.patch, r655_20190807b.patch
>
>
> This is the last JIRA split from the huge patch in RATIS-605.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (RATIS-655) Change LeaderState and FollowerState and to use RaftGroupMemberId

2019-08-07 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-655:
--
Attachment: r655_20190807b.patch

> Change LeaderState and FollowerState and  to use RaftGroupMemberId
> --
>
> Key: RATIS-655
> URL: https://issues.apache.org/jira/browse/RATIS-655
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r655_20190807.patch, r655_20190807b.patch
>
>
> This is the last JIRA split from the huge patch in RATIS-605.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-654) Fix generation LICENSE and NOTICE for third-party dependencies

2019-08-07 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902480#comment-16902480
 ] 

Tsz Wo Nicholas Sze commented on RATIS-654:
---

Some other comments on the patch:
- The patch uses exec-maven-plugin 1.4.  We are already using exec-maven-plugin 
1.3.1 in root/pom.xml.  Let's update the version from 1.3.1 to 1.4.
- The indentation is off in some of the pom.xml changes.

> Fix generation LICENSE and NOTICE for third-party dependencies
> --
>
> Key: RATIS-654
> URL: https://issues.apache.org/jira/browse/RATIS-654
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Attachments: RATIS-654.patch
>
>
> Details on licenses, what can be bundled and what can't be as per apache:-
> http://www.apache.org/legal/resolved.html
> Below is the guide on how a dev should be assembling LICENSE and NOTICE:
> http://www.apache.org/dev/licensing-howto.html
> We need to include LICENSE and NOTICE for transitive dependencies as well
> http://www.apache.org/dev/licensing-howto.html#deps-of-deps
> The supplemental model[s1] of maven can help in supplementing missing 
> information of LICENSE and NOTICE in the third-party dependencies in our 
> bundled LICENSE and NOTICE
> [1] 
> https://maven.apache.org/plugins/maven-remote-resources-plugin/supplemental-models.html
> Here, I have copied the resource-bundle created by HBase , so that we don't 
> need to re-write whole logic of generating LICENSE and NOTICE in apache way.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-654) Fix generation LICENSE and NOTICE for third-party dependencies

2019-08-07 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902474#comment-16902474
 ] 

Tsz Wo Nicholas Sze commented on RATIS-654:
---

If the Hadoop libraries is not included in ratis-examples*.jar.  The drawback 
is that the example won't be run with Hadoop RPC unless users install the 
libraries themselves.  It seems fine.

> Fix generation LICENSE and NOTICE for third-party dependencies
> --
>
> Key: RATIS-654
> URL: https://issues.apache.org/jira/browse/RATIS-654
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Attachments: RATIS-654.patch
>
>
> Details on licenses, what can be bundled and what can't be as per apache:-
> http://www.apache.org/legal/resolved.html
> Below is the guide on how a dev should be assembling LICENSE and NOTICE:
> http://www.apache.org/dev/licensing-howto.html
> We need to include LICENSE and NOTICE for transitive dependencies as well
> http://www.apache.org/dev/licensing-howto.html#deps-of-deps
> The supplemental model[s1] of maven can help in supplementing missing 
> information of LICENSE and NOTICE in the third-party dependencies in our 
> bundled LICENSE and NOTICE
> [1] 
> https://maven.apache.org/plugins/maven-remote-resources-plugin/supplemental-models.html
> Here, I have copied the resource-bundle created by HBase , so that we don't 
> need to re-write whole logic of generating LICENSE and NOTICE in apache way.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (RATIS-655) Change LeaderState and FollowerState and to use RaftGroupMemberId

2019-08-07 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-655:
--
Attachment: r655_20190807.patch

> Change LeaderState and FollowerState and  to use RaftGroupMemberId
> --
>
> Key: RATIS-655
> URL: https://issues.apache.org/jira/browse/RATIS-655
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r655_20190807.patch
>
>
> This is the last JIRA split from the huge patch in RATIS-605.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-654) Fix generation LICENSE and NOTICE for third-party dependencies

2019-08-07 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902391#comment-16902391
 ] 

Tsz Wo Nicholas Sze commented on RATIS-654:
---

Another way is to include our ratis-hadoop binary but not the dependencies from 
ratis-hadoop.  In this way, users have to install the ratis-hadoop dependencies 
themselves when they use ratis-hadoop.

> Fix generation LICENSE and NOTICE for third-party dependencies
> --
>
> Key: RATIS-654
> URL: https://issues.apache.org/jira/browse/RATIS-654
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Attachments: RATIS-654.patch
>
>
> Details on licenses, what can be bundled and what can't be as per apache:-
> http://www.apache.org/legal/resolved.html
> Below is the guide on how a dev should be assembling LICENSE and NOTICE:
> http://www.apache.org/dev/licensing-howto.html
> We need to include LICENSE and NOTICE for transitive dependencies as well
> http://www.apache.org/dev/licensing-howto.html#deps-of-deps
> The supplemental model[s1] of maven can help in supplementing missing 
> information of LICENSE and NOTICE in the third-party dependencies in our 
> bundled LICENSE and NOTICE
> [1] 
> https://maven.apache.org/plugins/maven-remote-resources-plugin/supplemental-models.html
> Here, I have copied the resource-bundle created by HBase , so that we don't 
> need to re-write whole logic of generating LICENSE and NOTICE in apache way.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (RATIS-655) Change LeaderState and FollowerState and to use RaftGroupMemberId

2019-08-07 Thread Tsz Wo Nicholas Sze (JIRA)
Tsz Wo Nicholas Sze created RATIS-655:
-

 Summary: Change LeaderState and FollowerState and  to use 
RaftGroupMemberId
 Key: RATIS-655
 URL: https://issues.apache.org/jira/browse/RATIS-655
 Project: Ratis
  Issue Type: Improvement
  Components: server
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze


This is the last JIRA split from the huge patch in RATIS-605.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-609) Change RaftLog to use RaftGroupMemberId

2019-08-07 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902362#comment-16902362
 ] 

Tsz Wo Nicholas Sze commented on RATIS-609:
---

There is one more to go; see RATIS-655.

> Change RaftLog to use RaftGroupMemberId
> ---
>
> Key: RATIS-609
> URL: https://issues.apache.org/jira/browse/RATIS-609
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Minor
>  Labels: ozone
> Attachments: r609_20190708.patch, r609_20190718.patch, 
> r609_20190718b.patch
>
>
> This an effort to reduce the patch size in RATIS-605.  The RaftLog related 
> change for RaftGroupMemberId will be done here.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-654) Fix generation LICENSE and NOTICE for third-party dependencies

2019-08-07 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902335#comment-16902335
 ] 

Tsz Wo Nicholas Sze commented on RATIS-654:
---

> you mean we do source release of ratis-hadoop module but not the binary 
> release to avoid unnecessary dependencies?

Yes, we should do a binary release without ratis-hadoop.  If there is a need, 
we may do another bin release with it (or we may release ratis-hadoop 
separately).

> Fix generation LICENSE and NOTICE for third-party dependencies
> --
>
> Key: RATIS-654
> URL: https://issues.apache.org/jira/browse/RATIS-654
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Attachments: RATIS-654.patch
>
>
> Details on licenses, what can be bundled and what can't be as per apache:-
> http://www.apache.org/legal/resolved.html
> Below is the guide on how a dev should be assembling LICENSE and NOTICE:
> http://www.apache.org/dev/licensing-howto.html
> We need to include LICENSE and NOTICE for transitive dependencies as well
> http://www.apache.org/dev/licensing-howto.html#deps-of-deps
> The supplemental model[s1] of maven can help in supplementing missing 
> information of LICENSE and NOTICE in the third-party dependencies in our 
> bundled LICENSE and NOTICE
> [1] 
> https://maven.apache.org/plugins/maven-remote-resources-plugin/supplemental-models.html
> Here, I have copied the resource-bundle created by HBase , so that we don't 
> need to re-write whole logic of generating LICENSE and NOTICE in apache way.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-635) Add an API to get the min replicated logIndex for a raftGroup in raftServer

2019-08-06 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901513#comment-16901513
 ] 

Tsz Wo Nicholas Sze commented on RATIS-635:
---

The static helper function would look like the one line method below.  
Honestly, it seems not very useful.
{code:java}
  static Long min(Collection commitInfos) {
return 
commitInfos.stream().map(CommitInfoProto::getCommitIndex).min(Long::compareTo).orElse(null);
  }
{code}

> Add an API to get the min replicated logIndex for a raftGroup in raftServer
> ---
>
> Key: RATIS-635
> URL: https://issues.apache.org/jira/browse/RATIS-635
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.4.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: ozone
> Fix For: 0.4.0
>
> Attachments: RATIS-635.000.patch
>
>
> This feature is required by Ozone(HDDS-1753) to figure the min replicated 
> index across all servers of a RaftGroup.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-654) Fix generation LICENSE and NOTICE for third-party dependencies

2019-08-06 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901472#comment-16901472
 ] 

Tsz Wo Nicholas Sze commented on RATIS-654:
---

Hi [~an...@apache.org], really appreciated that you work on this.

> ...  copied the resource-bundle created by HBase , ...

HBase is a much bigger project compared to Ratis.  I wonder if we would include 
a lot of unnecessary dependencies.

BTW, we should make Ratis not depending on Hadoop (or at least an option) since
# Hadoop has a huge dependency tree.  It unnecessarily taxes the users who are 
not using ratis-hadoop (i.e. hadoop rpc).
# The ratis-hadoop module is not well maintained.  Even the unit tests may fail.

> Fix generation LICENSE and NOTICE for third-party dependencies
> --
>
> Key: RATIS-654
> URL: https://issues.apache.org/jira/browse/RATIS-654
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Attachments: RATIS-654.patch
>
>
> Details on licenses, what can be bundled and what can't be as per apache:-
> http://www.apache.org/legal/resolved.html
> Below is the guide on how a dev should be assembling LICENSE and NOTICE:
> http://www.apache.org/dev/licensing-howto.html
> We need to include LICENSE and NOTICE for transitive dependencies as well
> http://www.apache.org/dev/licensing-howto.html#deps-of-deps
> The supplemental model[s1] of maven can help in supplementing missing 
> information of LICENSE and NOTICE in the third-party dependencies in our 
> bundled LICENSE and NOTICE
> [1] 
> https://maven.apache.org/plugins/maven-remote-resources-plugin/supplemental-models.html
> Here, I have copied the resource-bundle created by HBase , so that we don't 
> need to re-write whole logic of generating LICENSE and NOTICE in apache way.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-654) Fix generation LICENSE and NOTICE for third-party dependencies

2019-08-06 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901474#comment-16901474
 ] 

Tsz Wo Nicholas Sze commented on RATIS-654:
---

[~elserj], would you like to take a look the patch?

> Fix generation LICENSE and NOTICE for third-party dependencies
> --
>
> Key: RATIS-654
> URL: https://issues.apache.org/jira/browse/RATIS-654
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Attachments: RATIS-654.patch
>
>
> Details on licenses, what can be bundled and what can't be as per apache:-
> http://www.apache.org/legal/resolved.html
> Below is the guide on how a dev should be assembling LICENSE and NOTICE:
> http://www.apache.org/dev/licensing-howto.html
> We need to include LICENSE and NOTICE for transitive dependencies as well
> http://www.apache.org/dev/licensing-howto.html#deps-of-deps
> The supplemental model[s1] of maven can help in supplementing missing 
> information of LICENSE and NOTICE in the third-party dependencies in our 
> bundled LICENSE and NOTICE
> [1] 
> https://maven.apache.org/plugins/maven-remote-resources-plugin/supplemental-models.html
> Here, I have copied the resource-bundle created by HBase , so that we don't 
> need to re-write whole logic of generating LICENSE and NOTICE in apache way.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-653) Fix LICENSE and NOTICE files for release

2019-08-06 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901414#comment-16901414
 ] 

Tsz Wo Nicholas Sze commented on RATIS-653:
---

Thanks [~an...@apache.org], will check RATIS-654 first.

> Fix LICENSE and NOTICE files for release
> 
>
> Key: RATIS-653
> URL: https://issues.apache.org/jira/browse/RATIS-653
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.3.0
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Fix For: 0.4.0
>
> Attachments: LICENSE, NOTICE, RATIS-653.patch, RATIS-653_v1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-653) Fix LICENSE and NOTICE files for release

2019-08-05 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900397#comment-16900397
 ] 

Tsz Wo Nicholas Sze commented on RATIS-653:
---

[~an...@apache.org], thanks for working on this.  Could you describe how you 
come the patch?  It would be very useful for everyone.

> Fix LICENSE and NOTICE files for release
> 
>
> Key: RATIS-653
> URL: https://issues.apache.org/jira/browse/RATIS-653
> Project: Ratis
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.3.0
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Fix For: 0.4.0
>
> Attachments: LICENSE, NOTICE, RATIS-653.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-635) Add an API to get the min replicated logIndex for a raftGroup in raftServer

2019-08-02 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899181#comment-16899181
 ] 

Tsz Wo Nicholas Sze commented on RATIS-635:
---

[~shashikant], thanks for working on this.
- We already have a getGroupInfo(..) method in 
AdminProtocol/AdminAsynchronousProtocol.   The GroupInfoReply has all the 
commit infos.  Let's make it work for local calls (it may be already working) 
so that Ozone can use it?


> Add an API to get the min replicated logIndex for a raftGroup in raftServer
> ---
>
> Key: RATIS-635
> URL: https://issues.apache.org/jira/browse/RATIS-635
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.4.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: ozone
> Fix For: 0.4.0
>
> Attachments: RATIS-635.000.patch
>
>
> This feature is required by Ozone(HDDS-1753) to figure the min replicated 
> index across all servers of a RaftGroup.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (RATIS-645) testRetryOnExceptionDuringReplication may throw NPE.

2019-08-02 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze resolved RATIS-645.
---
Resolution: Duplicate

Resolving this as a duplication of RATIS-578.

> testRetryOnExceptionDuringReplication may throw NPE.
> 
>
> Key: RATIS-645
> URL: https://issues.apache.org/jira/browse/RATIS-645
> Project: Ratis
>  Issue Type: Bug
>  Components: test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Minor
>
> RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication has 
> failed in Jenkins with NullPointerException; for example 
> https://builds.apache.org/job/PreCommit-RATIS-Build/915/testReport/org.apache.ratis.server.simulation/TestRaftStateMachineExceptionWithSimulatedRpc/testRetryOnExceptionDuringReplication/
> Just able to reproduce it:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.ratis.server.impl.RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication(RaftStateMachineExceptionTests.java:172)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (RATIS-578) Illegal State transition in LeaderElection

2019-08-02 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze reassigned RATIS-578:
-

Resolution: Fixed
  Assignee: Tsz Wo Nicholas Sze  (was: Siddharth Wagle)

Thanks [~shashikant] for reviewing the patch.

I have committed this.

> Illegal State transition in LeaderElection
> --
>
> Key: RATIS-578
> URL: https://issues.apache.org/jira/browse/RATIS-578
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>  Labels: ozone
> Fix For: 0.4.0
>
> Attachments: r578_20190731.patch
>
>
> Illegal State transition in LeaderElection
> {code}
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 3d75e29e-ff2a-47a6-82c4-6408d200876d:2019group--CB73AD2587F6:LeaderElection13,
>  STARTING -> CLOSED
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 3d75e29e-ff2a-47a6-82c4-6408d200876d:group-CB73AD2587F6:LeaderElection13, 
> CLOSED -> RUNNING
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 37da83b0-33ff-44cf-aeb9-67a102e13468:group-9FC4313E1696:LeaderElection217, 
> RUNNING -> CLOSED
> java.lang.IllegalStateException: IL2LEGAL TRANSITION: I0n 
> 95ef0599-6d8a-40f8-a69c-7ba0c956dc6c:group-21734B88A322:LeaderElection265, 
> RUNNING -> CLOSED
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-578) Illegal State transition in LeaderElection

2019-08-01 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898366#comment-16898366
 ] 

Tsz Wo Nicholas Sze commented on RATIS-578:
---

The failed test is not related.

> Illegal State transition in LeaderElection
> --
>
> Key: RATIS-578
> URL: https://issues.apache.org/jira/browse/RATIS-578
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Siddharth Wagle
>Priority: Major
>  Labels: ozone
> Fix For: 0.4.0
>
> Attachments: r578_20190731.patch
>
>
> Illegal State transition in LeaderElection
> {code}
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 3d75e29e-ff2a-47a6-82c4-6408d200876d:2019group--CB73AD2587F6:LeaderElection13,
>  STARTING -> CLOSED
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 3d75e29e-ff2a-47a6-82c4-6408d200876d:group-CB73AD2587F6:LeaderElection13, 
> CLOSED -> RUNNING
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 37da83b0-33ff-44cf-aeb9-67a102e13468:group-9FC4313E1696:LeaderElection217, 
> RUNNING -> CLOSED
> java.lang.IllegalStateException: IL2LEGAL TRANSITION: I0n 
> 95ef0599-6d8a-40f8-a69c-7ba0c956dc6c:group-21734B88A322:LeaderElection265, 
> RUNNING -> CLOSED
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-642) Bump the copyright year of the NOTICE.txt

2019-08-01 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898322#comment-16898322
 ] 

Tsz Wo Nicholas Sze commented on RATIS-642:
---

+1 the 003 patch looks good.

> Bump the copyright year of the NOTICE.txt
> -
>
> Key: RATIS-642
> URL: https://issues.apache.org/jira/browse/RATIS-642
> Project: Ratis
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 0.4.0
>
> Attachments: RATIS-642.001.patch, RATIS-642.002.patch, 
> RATIS-642.003.patch
>
>
> Update ratis notice to reflect latest timestamps i.e. to 2019.
> Thanks [~jghoman] for noticing this during 0.4.0 rc0 vote.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-645) testRetryOnExceptionDuringReplication may throw NPE.

2019-07-31 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897495#comment-16897495
 ] 

Tsz Wo Nicholas Sze commented on RATIS-645:
---

Will fix this in RATIS-578.

> testRetryOnExceptionDuringReplication may throw NPE.
> 
>
> Key: RATIS-645
> URL: https://issues.apache.org/jira/browse/RATIS-645
> Project: Ratis
>  Issue Type: Bug
>  Components: test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Minor
>
> RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication has 
> failed in Jenkins with NullPointerException; for example 
> https://builds.apache.org/job/PreCommit-RATIS-Build/915/testReport/org.apache.ratis.server.simulation/TestRaftStateMachineExceptionWithSimulatedRpc/testRetryOnExceptionDuringReplication/
> Just able to reproduce it:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.ratis.server.impl.RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication(RaftStateMachineExceptionTests.java:172)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (RATIS-578) Illegal State transition in LeaderElection

2019-07-31 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-578:
--
Attachment: r578_20190731.patch

> Illegal State transition in LeaderElection
> --
>
> Key: RATIS-578
> URL: https://issues.apache.org/jira/browse/RATIS-578
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Siddharth Wagle
>Priority: Major
>  Labels: ozone
> Fix For: 0.4.0
>
> Attachments: r578_20190731.patch
>
>
> Illegal State transition in LeaderElection
> {code}
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 3d75e29e-ff2a-47a6-82c4-6408d200876d:2019group--CB73AD2587F6:LeaderElection13,
>  STARTING -> CLOSED
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 3d75e29e-ff2a-47a6-82c4-6408d200876d:group-CB73AD2587F6:LeaderElection13, 
> CLOSED -> RUNNING
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 37da83b0-33ff-44cf-aeb9-67a102e13468:group-9FC4313E1696:LeaderElection217, 
> RUNNING -> CLOSED
> java.lang.IllegalStateException: IL2LEGAL TRANSITION: I0n 
> 95ef0599-6d8a-40f8-a69c-7ba0c956dc6c:group-21734B88A322:LeaderElection265, 
> RUNNING -> CLOSED
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (RATIS-645) testRetryOnExceptionDuringReplication may throw NPE.

2019-07-31 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-645:
--
Priority: Minor  (was: Major)

> testRetryOnExceptionDuringReplication may throw NPE.
> 
>
> Key: RATIS-645
> URL: https://issues.apache.org/jira/browse/RATIS-645
> Project: Ratis
>  Issue Type: Bug
>  Components: test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Minor
>
> RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication has 
> failed in Jenkins with NullPointerException; for example 
> https://builds.apache.org/job/PreCommit-RATIS-Build/915/testReport/org.apache.ratis.server.simulation/TestRaftStateMachineExceptionWithSimulatedRpc/testRetryOnExceptionDuringReplication/
> Just able to reproduce it:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.ratis.server.impl.RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication(RaftStateMachineExceptionTests.java:172)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (RATIS-645) testRetryOnExceptionDuringReplication may throw NPE.

2019-07-31 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-645:
--
Component/s: test

> testRetryOnExceptionDuringReplication may throw NPE.
> 
>
> Key: RATIS-645
> URL: https://issues.apache.org/jira/browse/RATIS-645
> Project: Ratis
>  Issue Type: Bug
>  Components: test
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>
> RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication has 
> failed in Jenkins with NullPointerException; for example 
> https://builds.apache.org/job/PreCommit-RATIS-Build/915/testReport/org.apache.ratis.server.simulation/TestRaftStateMachineExceptionWithSimulatedRpc/testRetryOnExceptionDuringReplication/
> Just able to reproduce it:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.ratis.server.impl.RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication(RaftStateMachineExceptionTests.java:172)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (RATIS-645) testRetryOnExceptionDuringReplication may throw NPE.

2019-07-31 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze reassigned RATIS-645:
-

Assignee: Tsz Wo Nicholas Sze

> testRetryOnExceptionDuringReplication may throw NPE.
> 
>
> Key: RATIS-645
> URL: https://issues.apache.org/jira/browse/RATIS-645
> Project: Ratis
>  Issue Type: Bug
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>
> RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication has 
> failed in Jenkins with NullPointerException; for example 
> https://builds.apache.org/job/PreCommit-RATIS-Build/915/testReport/org.apache.ratis.server.simulation/TestRaftStateMachineExceptionWithSimulatedRpc/testRetryOnExceptionDuringReplication/
> Just able to reproduce it:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.ratis.server.impl.RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication(RaftStateMachineExceptionTests.java:172)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (RATIS-645) testRetryOnExceptionDuringReplication may throw NPE.

2019-07-31 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-645:
--
Description: 
RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication has failed 
in Jenkins with NullPointerException; for example 
https://builds.apache.org/job/PreCommit-RATIS-Build/915/testReport/org.apache.ratis.server.simulation/TestRaftStateMachineExceptionWithSimulatedRpc/testRetryOnExceptionDuringReplication/

Just able to reproduce it:
{code}
java.lang.NullPointerException
at 
org.apache.ratis.server.impl.RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication(RaftStateMachineExceptionTests.java:172)
{code}


  was:
RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication has failed 
in Jenkins with NullPointerException; for example 
https://builds.apache.org/job/PreCommit-RATIS-Build/915/testReport/org.apache.ratis.server.simulation/TestRaftStateMachineExceptionWithSimulatedRpc/testRetryOnExceptionDuringReplication/

Unfortunately, the stack trace is missing and I cannot reproduce it locally.



> testRetryOnExceptionDuringReplication may throw NPE.
> 
>
> Key: RATIS-645
> URL: https://issues.apache.org/jira/browse/RATIS-645
> Project: Ratis
>  Issue Type: Bug
>Reporter: Tsz Wo Nicholas Sze
>Priority: Major
>
> RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication has 
> failed in Jenkins with NullPointerException; for example 
> https://builds.apache.org/job/PreCommit-RATIS-Build/915/testReport/org.apache.ratis.server.simulation/TestRaftStateMachineExceptionWithSimulatedRpc/testRetryOnExceptionDuringReplication/
> Just able to reproduce it:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.ratis.server.impl.RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication(RaftStateMachineExceptionTests.java:172)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-578) Illegal State transition in LeaderElection

2019-07-31 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897424#comment-16897424
 ] 

Tsz Wo Nicholas Sze commented on RATIS-578:
---

Just found that the failure can be reproduced by running 
TestRaftStateMachineExceptionWithSimulatedRpc.testRetryOnExceptionDuringReplication
 multiple times.

> Illegal State transition in LeaderElection
> --
>
> Key: RATIS-578
> URL: https://issues.apache.org/jira/browse/RATIS-578
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Siddharth Wagle
>Priority: Major
>  Labels: ozone
> Fix For: 0.4.0
>
>
> Illegal State transition in LeaderElection
> {code}
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 3d75e29e-ff2a-47a6-82c4-6408d200876d:2019group--CB73AD2587F6:LeaderElection13,
>  STARTING -> CLOSED
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 3d75e29e-ff2a-47a6-82c4-6408d200876d:group-CB73AD2587F6:LeaderElection13, 
> CLOSED -> RUNNING
> java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> 37da83b0-33ff-44cf-aeb9-67a102e13468:group-9FC4313E1696:LeaderElection217, 
> RUNNING -> CLOSED
> java.lang.IllegalStateException: IL2LEGAL TRANSITION: I0n 
> 95ef0599-6d8a-40f8-a69c-7ba0c956dc6c:group-21734B88A322:LeaderElection265, 
> RUNNING -> CLOSED
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (RATIS-645) testRetryOnExceptionDuringReplication may throw NPE.

2019-07-31 Thread Tsz Wo Nicholas Sze (JIRA)
Tsz Wo Nicholas Sze created RATIS-645:
-

 Summary: testRetryOnExceptionDuringReplication may throw NPE.
 Key: RATIS-645
 URL: https://issues.apache.org/jira/browse/RATIS-645
 Project: Ratis
  Issue Type: Bug
Reporter: Tsz Wo Nicholas Sze


RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication has failed 
in Jenkins with NullPointerException; for example 
https://builds.apache.org/job/PreCommit-RATIS-Build/915/testReport/org.apache.ratis.server.simulation/TestRaftStateMachineExceptionWithSimulatedRpc/testRetryOnExceptionDuringReplication/

Unfortunately, the stack trace is missing and I cannot reproduce it locally.




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-641) Change RaftPeerId to extend RaftId

2019-07-31 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897417#comment-16897417
 ] 

Tsz Wo Nicholas Sze commented on RATIS-641:
---

The failed tests do not seem related.

> Change RaftPeerId to extend RaftId
> --
>
> Key: RATIS-641
> URL: https://issues.apache.org/jira/browse/RATIS-641
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r641_20190726.patch, r641_20190729.patch, 
> r641_20190730.patch
>
>
> RaftPeerId currently does not extend RaftId so that it is not consistent with 
> ClientId and RaftGroupId.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-641) Change RaftPeerId to extend RaftId

2019-07-30 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896523#comment-16896523
 ] 

Tsz Wo Nicholas Sze commented on RATIS-641:
---

r641_20190730.patch: fixes a bug in RaftId.

> Change RaftPeerId to extend RaftId
> --
>
> Key: RATIS-641
> URL: https://issues.apache.org/jira/browse/RATIS-641
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r641_20190726.patch, r641_20190729.patch, 
> r641_20190730.patch
>
>
> RaftPeerId currently does not extend RaftId so that it is not consistent with 
> ClientId and RaftGroupId.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (RATIS-641) Change RaftPeerId to extend RaftId

2019-07-30 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/RATIS-641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated RATIS-641:
--
Attachment: r641_20190730.patch

> Change RaftPeerId to extend RaftId
> --
>
> Key: RATIS-641
> URL: https://issues.apache.org/jira/browse/RATIS-641
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r641_20190726.patch, r641_20190729.patch, 
> r641_20190730.patch
>
>
> RaftPeerId currently does not extend RaftId so that it is not consistent with 
> ClientId and RaftGroupId.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (RATIS-641) Change RaftPeerId to extend RaftId

2019-07-29 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/RATIS-641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895621#comment-16895621
 ] 

Tsz Wo Nicholas Sze commented on RATIS-641:
---

r641_20190729.patch: fixes unit test failures.

> Change RaftPeerId to extend RaftId
> --
>
> Key: RATIS-641
> URL: https://issues.apache.org/jira/browse/RATIS-641
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
> Attachments: r641_20190726.patch, r641_20190729.patch
>
>
> RaftPeerId currently does not extend RaftId so that it is not consistent with 
> ClientId and RaftGroupId.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


  1   2   3   4   5   6   7   8   9   10   >