[jira] [Updated] (RATIS-1011) Define internal streaming APIs

2020-07-30 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-1011:
--
Description: 
Similar to ratis rpc, ratis streaming should define a set of internal APIs in 
order to support pluggable implementations.

The APIs must support asynchronous event driven.

  was:Similar to ratis rpc, ratis streaming should define a set of internal 
APIs in order to support pluggable implementations.


> Define internal streaming APIs
> --
>
> Key: RATIS-1011
> URL: https://issues.apache.org/jira/browse/RATIS-1011
> Project: Ratis
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tsz-wo Sze
>Assignee: Ansh Khanna
>Priority: Major
>
> Similar to ratis rpc, ratis streaming should define a set of internal APIs in 
> order to support pluggable implementations.
> The APIs must support asynchronous event driven.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-1012) Implement ratis streaming using netty

2020-07-30 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created RATIS-1012:
-

 Summary: Implement ratis streaming using netty
 Key: RATIS-1012
 URL: https://issues.apache.org/jira/browse/RATIS-1012
 Project: Ratis
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tsz-wo Sze
Assignee: Ansh Khanna


Since we are getting good results from RATIS-1009, we will continue to work on 
the first ratis streaming implementation using netty with zero buffer copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-1011) Define internal streaming APIs

2020-07-30 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created RATIS-1011:
-

 Summary: Define internal streaming APIs
 Key: RATIS-1011
 URL: https://issues.apache.org/jira/browse/RATIS-1011
 Project: Ratis
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tsz-wo Sze
Assignee: Ansh Khanna


Similar to ratis rpc, ratis streaming should define a set of internal APIs in 
order to support pluggable implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (RATIS-1009) A simple benchmark achieving zero-copy semantics using Netty.

2020-07-30 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-1009.
---
Fix Version/s: 1.1.0
   Resolution: Fixed

I have merged the pull request.  Thanks, Ansh!

> A simple benchmark achieving zero-copy semantics using Netty.
> -
>
> Key: RATIS-1009
> URL: https://issues.apache.org/jira/browse/RATIS-1009
> Project: Ratis
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Ansh Khanna
>Assignee: Ansh Khanna
>Priority: Major
> Fix For: 1.1.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Patch: [https://github.com/apache/incubator-ratis/pull/155]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-1009) A simple benchmark achieving zero-copy semantics using Netty.

2020-07-30 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-1009:
--
Component/s: Streaming

> A simple benchmark achieving zero-copy semantics using Netty.
> -
>
> Key: RATIS-1009
> URL: https://issues.apache.org/jira/browse/RATIS-1009
> Project: Ratis
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Ansh Khanna
>Assignee: Ansh Khanna
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Patch: [https://github.com/apache/incubator-ratis/pull/155]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-1004) Fix deadlock between grpc-default-executor and SegmentedRaftLogWorker

2020-07-19 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-1004:
--
Component/s: server

[~yjxxtd], thanks for filing this bug.

Do you know which two locks are causing the deadlock?

BTW, I have noticed that the line numbers shown in the stack traces are 
different from trunk.  Could you see if the deadlock can be reproducible from 
trunk?

> Fix deadlock between grpc-default-executor and SegmentedRaftLogWorker
> -
>
> Key: RATIS-1004
> URL: https://issues.apache.org/jira/browse/RATIS-1004
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Major
> Attachments: jstack-deadlock-2.txt, screenshot-1.png
>
>
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-979) Ratis streaming

2020-07-17 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-979:
-
Description: In this JIRA, we design and implement Ratis Streaming with 
zero buffer copying and asynchronous event driven.  (was: According to the 
[FlatBuffers white 
paper|https://google.github.io/flatbuffers/flatbuffers_white_paper.html],
{quote}
You define your object types in a schema, which can then be compiled to C++ or 
Java for low to zero overhead reading & writing. ...
{quote}
In this JIRA, we investigate if it can be used to achieve zero buffer copying 
in Ratis Streaming.)

> Ratis streaming
> ---
>
> Key: RATIS-979
> URL: https://issues.apache.org/jira/browse/RATIS-979
> Project: Ratis
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Tsz-wo Sze
>Assignee: Ansh Khanna
>Priority: Major
>
> In this JIRA, we design and implement Ratis Streaming with zero buffer 
> copying and asynchronous event driven.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (RATIS-997) Benchmarking Flatbuffers and Protobuffers for GRPC streaming

2020-07-17 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-997.
--
Fix Version/s: 0.6.0
   Resolution: Fixed

I just have merged the pull request.  Thanks, [~ansh.khanna]!

> Benchmarking Flatbuffers and Protobuffers for GRPC streaming
> 
>
> Key: RATIS-997
> URL: https://issues.apache.org/jira/browse/RATIS-997
> Project: Ratis
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Ansh Khanna
>Assignee: Ansh Khanna
>Priority: Major
> Fix For: 0.6.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> According to the [FlatBuffers white 
> paper|https://google.github.io/flatbuffers/flatbuffers_white_paper.html],
> {quote}
> You define your object types in a schema, which can then be compiled to C++ 
> or Java for low to zero overhead reading & writing. ...
> {quote}
> In this JIRA, we investigate if it can be used to achieve zero buffer copying 
> in Ratis Streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-997) Benchmarking Flatbuffers and Protobuffers for GRPC streaming

2020-07-17 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-997:
-
Description: 
According to the [FlatBuffers white 
paper|https://google.github.io/flatbuffers/flatbuffers_white_paper.html],
{quote}
You define your object types in a schema, which can then be compiled to C++ or 
Java for low to zero overhead reading & writing. ...
{quote}
In this JIRA, we investigate if it can be used to achieve zero buffer copying 
in Ratis Streaming.

> Benchmarking Flatbuffers and Protobuffers for GRPC streaming
> 
>
> Key: RATIS-997
> URL: https://issues.apache.org/jira/browse/RATIS-997
> Project: Ratis
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Ansh Khanna
>Assignee: Ansh Khanna
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> According to the [FlatBuffers white 
> paper|https://google.github.io/flatbuffers/flatbuffers_white_paper.html],
> {quote}
> You define your object types in a schema, which can then be compiled to C++ 
> or Java for low to zero overhead reading & writing. ...
> {quote}
> In this JIRA, we investigate if it can be used to achieve zero buffer copying 
> in Ratis Streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-997) Benchmarking Flatbuffers and Protobuffers for GRPC streaming

2020-07-17 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-997:
-
Component/s: Streaming

> Benchmarking Flatbuffers and Protobuffers for GRPC streaming
> 
>
> Key: RATIS-997
> URL: https://issues.apache.org/jira/browse/RATIS-997
> Project: Ratis
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Ansh Khanna
>Assignee: Ansh Khanna
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-979) Ratis streaming

2020-07-17 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-979:
-
Summary: Ratis streaming  (was: FlatBuffers streaming)

> Ratis streaming
> ---
>
> Key: RATIS-979
> URL: https://issues.apache.org/jira/browse/RATIS-979
> Project: Ratis
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Tsz-wo Sze
>Assignee: Ansh Khanna
>Priority: Major
>
> According to the [FlatBuffers white 
> paper|https://google.github.io/flatbuffers/flatbuffers_white_paper.html],
> {quote}
> You define your object types in a schema, which can then be compiled to C++ 
> or Java for low to zero overhead reading & writing. ...
> {quote}
> In this JIRA, we investigate if it can be used to achieve zero buffer copying 
> in Ratis Streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-990) Update flatbuffers version

2020-06-29 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148300#comment-17148300
 ] 

Tsz-wo Sze commented on RATIS-990:
--

Thanks [~msingh].

[~ansh.khanna], could you see if builder.createString(..) works in our case?

> Update flatbuffers version
> --
>
> Key: RATIS-990
> URL: https://issues.apache.org/jira/browse/RATIS-990
> Project: Ratis
>  Issue Type: Improvement
>  Components: thirdparty
>Reporter: Tsz-wo Sze
>Assignee: Ansh Khanna
>Priority: Major
>
> https://github.com/apache/incubator-ratis-thirdparty/blob/38b1c0c4201ec0856aed3230fd16ba26cb929e57/pom.xml#L76-L77
> {code}
> 
> 1.11.0
> {code}
> Currently, the flatbuffers version is 1.11.0 in ratis-thirdparty.  However, 
> 1.11.0 seems not supporting methods using ByteBuffer so that it does not 
> support zero buffer copying.  We should update it 1.12.0 or above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-990) Update flatbuffers version

2020-06-29 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-990:
-
Description: 
https://github.com/apache/incubator-ratis-thirdparty/blob/38b1c0c4201ec0856aed3230fd16ba26cb929e57/pom.xml#L76-L77
{code}

1.11.0
{code}
Currently, the flatbuffers version is 1.11.0 in ratis-thirdparty.  However, 
1.11.0 seems not supporting methods using ByteBuffer so that it does not 
support zero buffer copying.  We should update it 1.12.0 or above.

> Update flatbuffers version
> --
>
> Key: RATIS-990
> URL: https://issues.apache.org/jira/browse/RATIS-990
> Project: Ratis
>  Issue Type: Improvement
>  Components: thirdparty
>Reporter: Tsz-wo Sze
>Assignee: Ansh Khanna
>Priority: Major
>
> https://github.com/apache/incubator-ratis-thirdparty/blob/38b1c0c4201ec0856aed3230fd16ba26cb929e57/pom.xml#L76-L77
> {code}
> 
> 1.11.0
> {code}
> Currently, the flatbuffers version is 1.11.0 in ratis-thirdparty.  However, 
> 1.11.0 seems not supporting methods using ByteBuffer so that it does not 
> support zero buffer copying.  We should update it 1.12.0 or above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-990) Update flatbuffers version

2020-06-29 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created RATIS-990:


 Summary: Update flatbuffers version
 Key: RATIS-990
 URL: https://issues.apache.org/jira/browse/RATIS-990
 Project: Ratis
  Issue Type: Improvement
  Components: thirdparty
Reporter: Tsz-wo Sze
Assignee: Ansh Khanna






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (RATIS-980) Fix leader election happens too fast

2020-06-18 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-980.
--
Fix Version/s: 0.6.0
   Resolution: Fixed

> Fix leader election happens too fast
> 
>
> Key: RATIS-980
> URL: https://issues.apache.org/jira/browse/RATIS-980
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Major
> Fix For: 0.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> 2020-06-18T00:46:43.5446126Z 2020-06-18 00:46:43,543 [Thread-12] INFO  
> impl.FollowerState (FollowerState.java:run(117)) - 
> s2@group-3E7C5CE5BBB6-FollowerState was interrupted: 
> java.lang.InterruptedException: sleep interrupted
> 2020-06-18T00:46:43.6287348Z 2020-06-18 00:46:43,624 [nioEventLoopGroup-5-1] 
> DEBUG impl.RaftServerImpl (RaftServerImpl.java:requestVote(841)) - 
> s1@group-3E7C5CE5BBB6 replies to vote request: s0<-s1#0:OK-t1. Peer's state: 
> s1@group-3E7C5CE5BBB6:t1, leader=null, voted=s0, 
> raftlog=s1@group-3E7C5CE5BBB6-SegmentedRaftLog:OPENED:c-1,f-1,i0, conf=-1: 
> [s0:0.0.0.0:33663, s1:0.0.0.0:42355, s2:0.0.0.0:43021], old=null
> 2020-06-18T00:46:43.6302903Z 2020-06-18 00:46:43,625 [nioEventLoopGroup-9-1] 
> DEBUG impl.RaftServerImpl (RaftServerImpl.java:requestVote(841)) - 
> s2@group-3E7C5CE5BBB6 replies to vote request: s0<-s2#0:OK-t1. Peer's state: 
> s2@group-3E7C5CE5BBB6:t1, leader=null, voted=s0, 
> raftlog=s2@group-3E7C5CE5BBB6-SegmentedRaftLog:OPENED:c-1,f-1,i0, conf=-1: 
> [s0:0.0.0.0:33663, s1:0.0.0.0:42355, s2:0.0.0.0:43021], old=null
> 2020-06-18T00:46:43.6400885Z 2020-06-18 00:46:43,635 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  impl.LeaderElection 
> (LeaderElection.java:logAndReturn(61)) - 
> s0@group-3E7C5CE5BBB6-LeaderElection1: Election PASSED; received 1 
> response(s) [s0<-s1#0:OK-t1] and 0 exception(s); s0@group-3E7C5CE5BBB6:t1, 
> leader=null, voted=s0, 
> raftlog=s0@group-3E7C5CE5BBB6-SegmentedRaftLog:OPENED:c-1,f-1,i0, conf=-1: 
> [s0:0.0.0.0:33663, s1:0.0.0.0:42355, s2:0.0.0.0:43021], old=null
> 2020-06-18T00:46:43.6401898Z 2020-06-18 00:46:43,636 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  impl.RoleInfo 
> (RoleInfo.java:shutdownLeaderElection(134)) - s0: shutdown LeaderElection
> 2020-06-18T00:46:43.6402754Z 2020-06-18 00:46:43,636 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:setRole(174)) - s0@group-3E7C5CE5BBB6: changes role from 
> CANDIDATE to LEADER at term 1 for changeToLeader
> {color:#DE350B}2020-06-18T00:46:43.6403983Z 2020-06-18 00:46:43,636 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  impl.RaftServerImpl 
> (ServerState.java:setLeader(255)) - s0@group-3E7C5CE5BBB6: change Leader from 
> null to s0 at term 1 for becomeLeader, leader elected after 618ms{color}
> 2020-06-18T00:46:43.6404833Z 2020-06-18 00:46:43,639 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.staging.catchup.gap = 1000 (default)
> 2020-06-18T00:46:43.6416295Z 2020-06-18 00:46:43,639 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.rpc.sleep.time = 25ms (default)
> 2020-06-18T00:46:43.6440194Z 2020-06-18 00:46:43,643 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  metrics.RatisMetrics 
> (RatisMetrics.java:lambda$create$0(39)) - Creating Metrics Registry : 
> ratis.log_appender.s0@group-3E7C5CE5BBB6
> 2020-06-18T00:46:43.6442875Z 2020-06-18 00:46:43,643 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] WARN  impl.MetricRegistriesImpl 
> (MetricRegistriesImpl.java:lambda$create$1(61)) - First MetricRegistry has 
> been created without registering reporters. You may need to call 
> MetricRegistries.global().addReportRegistration(...) before.
> 2020-06-18T00:46:43.6500584Z 2020-06-18 00:46:43,646 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.write.element-limit = 4096 (default)
> 2020-06-18T00:46:43.6509832Z 2020-06-18 00:46:43,650 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.write.byte-limit = 64MB (=67108864) 
> (default)
> 2020-06-18T00:46:43.6641705Z 2020-06-18 00:46:43,656 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.watch.timeout = 10s (default)
> 2020-06-18T00:46:43.6648171Z 2020-06-18 00:46:43,658 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.watch.timeout.denomination = 1s 
> (default)
> 2020-06-18T00:46:43.6651041Z 2020-06-18 00:46:43,658 
> 

[jira] [Updated] (RATIS-980) Fix leader election happens too fast

2020-06-18 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-980:
-
Component/s: server

Thanks, [~yjxxtd]!  I just have merged the pull request.

> Fix leader election happens too fast
> 
>
> Key: RATIS-980
> URL: https://issues.apache.org/jira/browse/RATIS-980
> Project: Ratis
>  Issue Type: Improvement
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> 2020-06-18T00:46:43.5446126Z 2020-06-18 00:46:43,543 [Thread-12] INFO  
> impl.FollowerState (FollowerState.java:run(117)) - 
> s2@group-3E7C5CE5BBB6-FollowerState was interrupted: 
> java.lang.InterruptedException: sleep interrupted
> 2020-06-18T00:46:43.6287348Z 2020-06-18 00:46:43,624 [nioEventLoopGroup-5-1] 
> DEBUG impl.RaftServerImpl (RaftServerImpl.java:requestVote(841)) - 
> s1@group-3E7C5CE5BBB6 replies to vote request: s0<-s1#0:OK-t1. Peer's state: 
> s1@group-3E7C5CE5BBB6:t1, leader=null, voted=s0, 
> raftlog=s1@group-3E7C5CE5BBB6-SegmentedRaftLog:OPENED:c-1,f-1,i0, conf=-1: 
> [s0:0.0.0.0:33663, s1:0.0.0.0:42355, s2:0.0.0.0:43021], old=null
> 2020-06-18T00:46:43.6302903Z 2020-06-18 00:46:43,625 [nioEventLoopGroup-9-1] 
> DEBUG impl.RaftServerImpl (RaftServerImpl.java:requestVote(841)) - 
> s2@group-3E7C5CE5BBB6 replies to vote request: s0<-s2#0:OK-t1. Peer's state: 
> s2@group-3E7C5CE5BBB6:t1, leader=null, voted=s0, 
> raftlog=s2@group-3E7C5CE5BBB6-SegmentedRaftLog:OPENED:c-1,f-1,i0, conf=-1: 
> [s0:0.0.0.0:33663, s1:0.0.0.0:42355, s2:0.0.0.0:43021], old=null
> 2020-06-18T00:46:43.6400885Z 2020-06-18 00:46:43,635 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  impl.LeaderElection 
> (LeaderElection.java:logAndReturn(61)) - 
> s0@group-3E7C5CE5BBB6-LeaderElection1: Election PASSED; received 1 
> response(s) [s0<-s1#0:OK-t1] and 0 exception(s); s0@group-3E7C5CE5BBB6:t1, 
> leader=null, voted=s0, 
> raftlog=s0@group-3E7C5CE5BBB6-SegmentedRaftLog:OPENED:c-1,f-1,i0, conf=-1: 
> [s0:0.0.0.0:33663, s1:0.0.0.0:42355, s2:0.0.0.0:43021], old=null
> 2020-06-18T00:46:43.6401898Z 2020-06-18 00:46:43,636 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  impl.RoleInfo 
> (RoleInfo.java:shutdownLeaderElection(134)) - s0: shutdown LeaderElection
> 2020-06-18T00:46:43.6402754Z 2020-06-18 00:46:43,636 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:setRole(174)) - s0@group-3E7C5CE5BBB6: changes role from 
> CANDIDATE to LEADER at term 1 for changeToLeader
> {color:#DE350B}2020-06-18T00:46:43.6403983Z 2020-06-18 00:46:43,636 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  impl.RaftServerImpl 
> (ServerState.java:setLeader(255)) - s0@group-3E7C5CE5BBB6: change Leader from 
> null to s0 at term 1 for becomeLeader, leader elected after 618ms{color}
> 2020-06-18T00:46:43.6404833Z 2020-06-18 00:46:43,639 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.staging.catchup.gap = 1000 (default)
> 2020-06-18T00:46:43.6416295Z 2020-06-18 00:46:43,639 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.rpc.sleep.time = 25ms (default)
> 2020-06-18T00:46:43.6440194Z 2020-06-18 00:46:43,643 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  metrics.RatisMetrics 
> (RatisMetrics.java:lambda$create$0(39)) - Creating Metrics Registry : 
> ratis.log_appender.s0@group-3E7C5CE5BBB6
> 2020-06-18T00:46:43.6442875Z 2020-06-18 00:46:43,643 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] WARN  impl.MetricRegistriesImpl 
> (MetricRegistriesImpl.java:lambda$create$1(61)) - First MetricRegistry has 
> been created without registering reporters. You may need to call 
> MetricRegistries.global().addReportRegistration(...) before.
> 2020-06-18T00:46:43.6500584Z 2020-06-18 00:46:43,646 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.write.element-limit = 4096 (default)
> 2020-06-18T00:46:43.6509832Z 2020-06-18 00:46:43,650 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.write.byte-limit = 64MB (=67108864) 
> (default)
> 2020-06-18T00:46:43.6641705Z 2020-06-18 00:46:43,656 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.watch.timeout = 10s (default)
> 2020-06-18T00:46:43.6648171Z 2020-06-18 00:46:43,658 
> [s0@group-3E7C5CE5BBB6-LeaderElection1] INFO  server.RaftServerConfigKeys 
> (ConfUtils.java:logGet(44)) - raft.server.watch.timeout.denomination = 1s 
> (default)
> 2020-06-18T00:46:43.6651041Z 2020-06-18 00:46:43,658 
> 

[jira] [Comment Edited] (RATIS-958) Support multiple requests in a single MessageOutputStream

2020-06-18 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139130#comment-17139130
 ] 

Tsz-wo Sze edited comment on RATIS-958 at 6/18/20, 6:45 AM:


r958_20200617.patch: 1st patch.

Also have created https://github.com/apache/incubator-ratis/pull/132


was (Author: szetszwo):
r958_20200617.patch: 1st patch.

> Support multiple requests in a single MessageOutputStream
> -
>
> Key: RATIS-958
> URL: https://issues.apache.org/jira/browse/RATIS-958
> Project: Ratis
>  Issue Type: Improvement
>  Components: client, server
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r958_20200617.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, MessageOutputStream only support one request per stream.  In this 
> JIRA, we will change it to support multiple requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-958) Support multiple requests in a single MessageOutputStream

2020-06-18 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139130#comment-17139130
 ] 

Tsz-wo Sze commented on RATIS-958:
--

r958_20200617.patch: 1st patch.

> Support multiple requests in a single MessageOutputStream
> -
>
> Key: RATIS-958
> URL: https://issues.apache.org/jira/browse/RATIS-958
> Project: Ratis
>  Issue Type: Improvement
>  Components: client, server
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r958_20200617.patch
>
>
> Currently, MessageOutputStream only support one request per stream.  In this 
> JIRA, we will change it to support multiple requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-958) Support multiple requests in a single MessageOutputStream

2020-06-18 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-958:
-
Attachment: r958_20200617.patch

> Support multiple requests in a single MessageOutputStream
> -
>
> Key: RATIS-958
> URL: https://issues.apache.org/jira/browse/RATIS-958
> Project: Ratis
>  Issue Type: Improvement
>  Components: client, server
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r958_20200617.patch
>
>
> Currently, MessageOutputStream only support one request per stream.  In this 
> JIRA, we will change it to support multiple requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-979) FlatBuffers streaming

2020-06-17 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created RATIS-979:


 Summary: FlatBuffers streaming
 Key: RATIS-979
 URL: https://issues.apache.org/jira/browse/RATIS-979
 Project: Ratis
  Issue Type: New Feature
  Components: Streaming
Reporter: Tsz-wo Sze
Assignee: Ansh Khanna


According to the [FlatBuffers white 
paper|https://google.github.io/flatbuffers/flatbuffers_white_paper.html],
{quote}
You define your object types in a schema, which can then be compiled to C++ or 
Java for low to zero overhead reading & writing. ...
{quote}
In this JIRA, we investigate if it can be used to achieve zero buffer copying 
in Ratis Streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-960) Add APIs to support streaming state machine data

2020-06-11 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133588#comment-17133588
 ] 

Tsz-wo Sze commented on RATIS-960:
--

Thanks [~shashikant].  Just have created 
https://github.com/apache/incubator-ratis/pull/121

> Add APIs to support streaming state machine data
> 
>
> Key: RATIS-960
> URL: https://issues.apache.org/jira/browse/RATIS-960
> Project: Ratis
>  Issue Type: New Feature
>  Components: StateMachine
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r960_20200529.patch, r960_20200603.patch, 
> r960_20200610.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> //StateMachine
> CompletableFuture writeStateMachineData(LogEntryProto entry)
> {code}
> In StateMachine, we have writeStateMachineData to write the state machine 
> data in the given log entry.  It is inefficient to process state machine data 
> in a log entry when the data size is large.
> In this JIRA, we add new APIs to support streaming state machine data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-960) Add APIs to support streaming state machine data

2020-06-11 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-960:
-
Summary: Add APIs to support streaming state machine data  (was: Support 
streaming state machine data)

> Add APIs to support streaming state machine data
> 
>
> Key: RATIS-960
> URL: https://issues.apache.org/jira/browse/RATIS-960
> Project: Ratis
>  Issue Type: New Feature
>  Components: StateMachine
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r960_20200529.patch, r960_20200603.patch, 
> r960_20200610.patch
>
>
> {code}
> //StateMachine
> CompletableFuture writeStateMachineData(LogEntryProto entry)
> {code}
> In StateMachine, we have writeStateMachineData to write the state machine 
> data in the given log entry.  It is inefficient to process state machine data 
> in a log entry when the data size is large.
> In this JIRA, we add new APIs to support streaming state machine data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-960) Support streaming state machine data

2020-06-10 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132750#comment-17132750
 ] 

Tsz-wo Sze commented on RATIS-960:
--

r960_20200610.patch: adds a new DataStream class so that the DataId class in 
the previous patch is no longer needed.

> Support streaming state machine data
> 
>
> Key: RATIS-960
> URL: https://issues.apache.org/jira/browse/RATIS-960
> Project: Ratis
>  Issue Type: New Feature
>  Components: StateMachine
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r960_20200529.patch, r960_20200603.patch, 
> r960_20200610.patch
>
>
> {code}
> //StateMachine
> CompletableFuture writeStateMachineData(LogEntryProto entry)
> {code}
> In StateMachine, we have writeStateMachineData to write the state machine 
> data in the given log entry.  It is inefficient to process state machine data 
> in a log entry when the data size is large.
> In this JIRA, we add new APIs to support streaming state machine data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-960) Support streaming state machine data

2020-06-10 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-960:
-
Attachment: r960_20200610.patch

> Support streaming state machine data
> 
>
> Key: RATIS-960
> URL: https://issues.apache.org/jira/browse/RATIS-960
> Project: Ratis
>  Issue Type: New Feature
>  Components: StateMachine
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r960_20200529.patch, r960_20200603.patch, 
> r960_20200610.patch
>
>
> {code}
> //StateMachine
> CompletableFuture writeStateMachineData(LogEntryProto entry)
> {code}
> In StateMachine, we have writeStateMachineData to write the state machine 
> data in the given log entry.  It is inefficient to process state machine data 
> in a log entry when the data size is large.
> In this JIRA, we add new APIs to support streaming state machine data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-960) Support streaming state machine data

2020-06-03 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125373#comment-17125373
 ] 

Tsz-wo Sze commented on RATIS-960:
--

r960_20200603.patch: adds more javadoc.

> Support streaming state machine data
> 
>
> Key: RATIS-960
> URL: https://issues.apache.org/jira/browse/RATIS-960
> Project: Ratis
>  Issue Type: New Feature
>  Components: StateMachine
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r960_20200529.patch, r960_20200603.patch
>
>
> {code}
> //StateMachine
> CompletableFuture writeStateMachineData(LogEntryProto entry)
> {code}
> In StateMachine, we have writeStateMachineData to write the state machine 
> data in the given log entry.  It is inefficient to process state machine data 
> in a log entry when the data size is large.
> In this JIRA, we add new APIs to support streaming state machine data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-960) Support streaming state machine data

2020-06-03 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-960:
-
Attachment: r960_20200603.patch

> Support streaming state machine data
> 
>
> Key: RATIS-960
> URL: https://issues.apache.org/jira/browse/RATIS-960
> Project: Ratis
>  Issue Type: New Feature
>  Components: StateMachine
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r960_20200529.patch, r960_20200603.patch
>
>
> {code}
> //StateMachine
> CompletableFuture writeStateMachineData(LogEntryProto entry)
> {code}
> In StateMachine, we have writeStateMachineData to write the state machine 
> data in the given log entry.  It is inefficient to process state machine data 
> in a log entry when the data size is large.
> In this JIRA, we add new APIs to support streaming state machine data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-959) Refactor xxxStateMachineData methods

2020-06-03 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124749#comment-17124749
 ] 

Tsz-wo Sze commented on RATIS-959:
--

Thanks [~shashikant].  Just have created 
https://github.com/apache/incubator-ratis/pull/119

> Refactor xxxStateMachineData methods
> 
>
> Key: RATIS-959
> URL: https://issues.apache.org/jira/browse/RATIS-959
> Project: Ratis
>  Issue Type: Improvement
>  Components: StateMachine
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r959_20200529.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, the StateMachine interface has quite a few methods related to 
> state machine data as below:
> - writeStateMachineData
> - readStateMachineData
> - flushStateMachineData
> - truncateStateMachineData
> We propose moving them to a new DataApi interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-959) Refactor xxxStateMachineData methods

2020-06-03 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-959:
-
Summary: Refactor xxxStateMachineData methods  (was: Refactor. 
xxxStateMachineData methods)

> Refactor xxxStateMachineData methods
> 
>
> Key: RATIS-959
> URL: https://issues.apache.org/jira/browse/RATIS-959
> Project: Ratis
>  Issue Type: Improvement
>  Components: StateMachine
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r959_20200529.patch
>
>
> Currently, the StateMachine interface has quite a few methods related to 
> state machine data as below:
> - writeStateMachineData
> - readStateMachineData
> - flushStateMachineData
> - truncateStateMachineData
> We propose moving them to a new DataApi interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-960) Support streaming state machine data

2020-05-29 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120065#comment-17120065
 ] 

Tsz-wo Sze commented on RATIS-960:
--

r960_20200529.patch: depends on RATIS-959

> Support streaming state machine data
> 
>
> Key: RATIS-960
> URL: https://issues.apache.org/jira/browse/RATIS-960
> Project: Ratis
>  Issue Type: New Feature
>  Components: StateMachine
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r960_20200529.patch
>
>
> {code}
> //StateMachine
> CompletableFuture writeStateMachineData(LogEntryProto entry)
> {code}
> In StateMachine, we have writeStateMachineData to write the state machine 
> data in the given log entry.  It is inefficient to process state machine data 
> in a log entry when the data size is large.
> In this JIRA, we add new APIs to support streaming state machine data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-960) Support streaming state machine data

2020-05-29 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-960:
-
Attachment: r960_20200529.patch

> Support streaming state machine data
> 
>
> Key: RATIS-960
> URL: https://issues.apache.org/jira/browse/RATIS-960
> Project: Ratis
>  Issue Type: New Feature
>  Components: StateMachine
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r960_20200529.patch
>
>
> {code}
> //StateMachine
> CompletableFuture writeStateMachineData(LogEntryProto entry)
> {code}
> In StateMachine, we have writeStateMachineData to write the state machine 
> data in the given log entry.  It is inefficient to process state machine data 
> in a log entry when the data size is large.
> In this JIRA, we add new APIs to support streaming state machine data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-960) Support streaming state machine data

2020-05-29 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-960:
-
Component/s: StateMachine

> Support streaming state machine data
> 
>
> Key: RATIS-960
> URL: https://issues.apache.org/jira/browse/RATIS-960
> Project: Ratis
>  Issue Type: New Feature
>  Components: StateMachine
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>
> {code}
> //StateMachine
> CompletableFuture writeStateMachineData(LogEntryProto entry)
> {code}
> In StateMachine, we have writeStateMachineData to write the state machine 
> data in the given log entry.  It is inefficient to process state machine data 
> in a log entry when the data size is large.
> In this JIRA, we add new APIs to support streaming state machine data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-960) Support streaming state machine data

2020-05-29 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created RATIS-960:


 Summary: Support streaming state machine data
 Key: RATIS-960
 URL: https://issues.apache.org/jira/browse/RATIS-960
 Project: Ratis
  Issue Type: New Feature
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze


{code}
//StateMachine
CompletableFuture writeStateMachineData(LogEntryProto entry)
{code}
In StateMachine, we have writeStateMachineData to write the state machine data 
in the given log entry.  It is inefficient to process state machine data in a 
log entry when the data size is large.

In this JIRA, we add new APIs to support streaming state machine data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-959) Refactor. xxxStateMachineData methods

2020-05-29 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created RATIS-959:


 Summary: Refactor. xxxStateMachineData methods
 Key: RATIS-959
 URL: https://issues.apache.org/jira/browse/RATIS-959
 Project: Ratis
  Issue Type: Improvement
  Components: StateMachine
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze


Currently, the StateMachine interface has quite a few methods related to state 
machine data as below:
- writeStateMachineData
- readStateMachineData
- flushStateMachineData
- truncateStateMachineData

We propose moving them to a new DataApi interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-958) Support multiple requests in a single MessageOutputStream

2020-05-29 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120056#comment-17120056
 ] 

Tsz-wo Sze commented on RATIS-958:
--

We propose adding an endOfRequest parameter in sendAsync as below.
{code:java}
//MessageOutputStream
  CompletableFuture sendAsync(Message message, boolean 
endOfRequest);
{code}

> Support multiple requests in a single MessageOutputStream
> -
>
> Key: RATIS-958
> URL: https://issues.apache.org/jira/browse/RATIS-958
> Project: Ratis
>  Issue Type: Improvement
>  Components: client, server
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>
> Currently, MessageOutputStream only support one request per stream.  In this 
> JIRA, we will change it to support multiple requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-958) Support multiple requests in a single MessageOutputStream

2020-05-29 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created RATIS-958:


 Summary: Support multiple requests in a single MessageOutputStream
 Key: RATIS-958
 URL: https://issues.apache.org/jira/browse/RATIS-958
 Project: Ratis
  Issue Type: Improvement
  Components: client, server
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze


Currently, MessageOutputStream only support one request per stream.  In this 
JIRA, we will change it to support multiple requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-840) Memory leak of LogAppender

2020-04-23 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091159#comment-17091159
 ] 

Tsz-wo Sze commented on RATIS-840:
--

[~yjxxtd], the test failures seem related.  Could you take a look?

> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: RATIS-840.001.patch, RATIS-840.002.patch, 
> RATIS-840.003.patch, image-2020-04-06-14-27-28-485.png, 
> image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders 
> ?*
>  I find a lot of GrpcLogAppender blocked inside logs4j. I think it's 
> GrpcLogAppender restart too fast, then blocked in logs4j.
>  !screenshot-2.png! 
> *Can the new GrpcLogAppender work normally ?*
> 1. Even though without the above problem, the new created GrpcLogAppender 
> still can not work normally. 
> 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
> LeaderState::addAndStartSenders -> 
> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
> FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
> 3. When the new created GrpcLogAppender append entry to follower, then the 
> follower response SUCCESS.
> 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
>  -> 
> [voterLists.get(0) | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607].
>  {color:#DE350B}Error happens because voterLists.get(0) return the 
> FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new 
> GrpcLogAppender. {color}
> 5. Because the majority commit got from the FollowerInfo of the old 
> GrpcLogAppender never changes. So even though follower has append entry 
> successfully, the leader can not update commit. So the new created 
> 

[jira] [Commented] (RATIS-841) Remove unnecessary exception checks in OrderedAsync#sendRequest

2020-04-23 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090380#comment-17090380
 ] 

Tsz-wo Sze commented on RATIS-841:
--

I see.  Thanks!

+1 the 002 patch looks good.

> Remove unnecessary exception checks in OrderedAsync#sendRequest
> ---
>
> Key: RATIS-841
> URL: https://issues.apache.org/jira/browse/RATIS-841
> Project: Ratis
>  Issue Type: Bug
>  Components: client
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-841.001.patch, RATIS-841.002.patch
>
>
> OrderedAsync#sendRequest does not require exception checks for 
> NotLeaderException as RafClientReply is already checked for these exceptions 
> in GrpcClientProtocolClient$AsyncStreamObservers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (RATIS-840) Memory leak of LogAppender

2020-04-23 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090357#comment-17090357
 ] 

Tsz-wo Sze edited comment on RATIS-840 at 4/23/20, 7:45 AM:


[~yjxxtd], in this JIRA, let's focus on fixing voterLists.

For the restart problem, we may limit the number of alive GrpcLogAppenders, say 
3.  When creating the 4th GrpcLogAppender, it has to wait until a previous 
appender dead.  Sound good?  We should do it in a separated JIRA.


was (Author: szetszwo):
[~yjxxtd], in this JIRA, let's focus on fixing voterLists.

For the restart problem, we may limit the number of alive GrpcLogAppender, say 
3.  When creating the fourth GrpcLogAppender, it has to wait unit a previous 
appender dead.  Sound good?  We should do it in a separated JIRA.

> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: RATIS-840.001.patch, RATIS-840.002.patch, 
> image-2020-04-06-14-27-28-485.png, image-2020-04-06-14-27-39-582.png, 
> screenshot-1.png, screenshot-2.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders 
> ?*
>  I find a lot of GrpcLogAppender blocked inside logs4j. I think it's 
> GrpcLogAppender restart too fast, then blocked in logs4j.
>  !screenshot-2.png! 
> *Can the new GrpcLogAppender work normally ?*
> 1. Even though without the above problem, the new created GrpcLogAppender 
> still can not work normally. 
> 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
> LeaderState::addAndStartSenders -> 
> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
> FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
> 3. When the new created GrpcLogAppender append entry to follower, then the 
> follower response SUCCESS.
> 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
>  -> 
> 

[jira] [Commented] (RATIS-841) Remove unnecessary exception checks in OrderedAsync#sendRequest

2020-04-23 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090351#comment-17090351
 ] 

Tsz-wo Sze commented on RATIS-841:
--

Why skilling client.handleLeaderException?  In RaftServerImpl.checkLeaderState, 
we do put LeaderNotReadyException in a reply.
{code}
//RaftServerImpl.checkLeaderState
  final LeaderNotReadyException lnre = new 
LeaderNotReadyException(getMemberId());
  final RaftClientReply reply = new RaftClientReply(request, lnre, 
getCommitInfos());
  return RetryCache.failWithReply(reply, entry);
{code}

> Remove unnecessary exception checks in OrderedAsync#sendRequest
> ---
>
> Key: RATIS-841
> URL: https://issues.apache.org/jira/browse/RATIS-841
> Project: Ratis
>  Issue Type: Bug
>  Components: client
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-841.001.patch, RATIS-841.002.patch
>
>
> OrderedAsync#sendRequest does not require exception checks for 
> NotLeaderException as RafClientReply is already checked for these exceptions 
> in GrpcClientProtocolClient$AsyncStreamObservers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-841) Remove unnecessary exception checks in OrderedAsync#sendRequest

2020-04-21 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089182#comment-17089182
 ] 

Tsz-wo Sze commented on RATIS-841:
--

Hi [~ljain], the 001 patch does not apply.

> Remove unnecessary exception checks in OrderedAsync#sendRequest
> ---
>
> Key: RATIS-841
> URL: https://issues.apache.org/jira/browse/RATIS-841
> Project: Ratis
>  Issue Type: Bug
>  Components: client
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-841.001.patch
>
>
> OrderedAsync#sendRequest does not require exception checks for 
> NotLeaderException as RafClientReply is already checked for these exceptions 
> in GrpcClientProtocolClient$AsyncStreamObservers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (RATIS-840) Memory leak of LogAppender

2020-04-21 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088969#comment-17088969
 ] 

Tsz-wo Sze edited comment on RATIS-840 at 4/21/20, 6:47 PM:


> ...  It passed because it forget to stop the old GrpcLogAppender, the old 
> GrpcLogAppender still work, and the new GprcLogAppender does not work at all.

There may be a bug but the correct fix is not to pollute SenderList since the 
bug is not in SenderList.  I am fine to add the stop call outside.

> ...  How can I avoid GrpcLogAppender infinite restart when log is close ?

As long as GrpcLogAppender could stop, it can be infinitely restarted.  We 
don't have to wait until it becomes dead.


was (Author: szetszwo):
> ...  It passed because it forget to stop the old GrpcLogAppender, the old 
> GrpcLogAppender still work, and the new GprcLogAppender does not work at all.

There may be a bug but the correct fix is not to pollute SenderList since the 
bug is not in SenderList.  I am fine to add the stop call outside.

> ...  How can I avoid GrpcLogAppender infinite restart when log is close ?

As long as GrpcLogAppender could stop, it can be infinitely restarted.  We 
don't have to wait it becoming dead so that we don't have to call join().

> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: RATIS-840.001.patch, RATIS-840.002.patch, 
> image-2020-04-06-14-27-28-485.png, image-2020-04-06-14-27-39-582.png, 
> screenshot-1.png, screenshot-2.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders 
> ?*
>  I find a lot of GrpcLogAppender blocked inside logs4j. I think it's 
> GrpcLogAppender restart too fast, then blocked in logs4j.
>  !screenshot-2.png! 
> *Can the new GrpcLogAppender work normally ?*
> 1. Even though without the above problem, the new created GrpcLogAppender 
> still can not work normally. 
> 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
> LeaderState::addAndStartSenders -> 
> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
> 

[jira] [Commented] (RATIS-840) Memory leak of LogAppender

2020-04-21 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088969#comment-17088969
 ] 

Tsz-wo Sze commented on RATIS-840:
--

> ...  It passed because it forget to stop the old GrpcLogAppender, the old 
> GrpcLogAppender still work, and the new GprcLogAppender does not work at all.

There may be a bug but the correct fix is not to pollute SenderList since the 
bug is not in SenderList.  I am fine to add the stop call outside.

> ...  How can I avoid GrpcLogAppender infinite restart when log is close ?

As long as GrpcLogAppender could stop, it can be infinitely restarted.  We 
don't have to wait it becoming dead so that we don't have to call join().

> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: RATIS-840.001.patch, RATIS-840.002.patch, 
> image-2020-04-06-14-27-28-485.png, image-2020-04-06-14-27-39-582.png, 
> screenshot-1.png, screenshot-2.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders 
> ?*
>  I find a lot of GrpcLogAppender blocked inside logs4j. I think it's 
> GrpcLogAppender restart too fast, then blocked in logs4j.
>  !screenshot-2.png! 
> *Can the new GrpcLogAppender work normally ?*
> 1. Even though without the above problem, the new created GrpcLogAppender 
> still can not work normally. 
> 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
> LeaderState::addAndStartSenders -> 
> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
> FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
> 3. When the new created GrpcLogAppender append entry to follower, then the 
> follower response SUCCESS.
> 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
>  -> 
> [voterLists.get(0) | 
> 

[jira] [Commented] (RATIS-840) Memory leak of LogAppender

2020-04-20 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088236#comment-17088236
 ] 

Tsz-wo Sze commented on RATIS-840:
--

> ...  because someone may forget to startAppender/stopAppender ...

It probably won't help.  How could we predict who would forget what?

> ...  I think we should block stop() before the LogAppender thread real stop. 
> ...

If this statement is correct, why not calling join() without timeout?  The idea 
is to interrupt the thread.  It is unimportant to wait for the thread becoming 
dead.
 
Adding  server.isAlive() is a bug, IMO.

> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: RATIS-840.001.patch, RATIS-840.002.patch, 
> image-2020-04-06-14-27-28-485.png, image-2020-04-06-14-27-39-582.png, 
> screenshot-1.png, screenshot-2.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders 
> ?*
>  I find a lot of GrpcLogAppender blocked inside logs4j. I think it's 
> GrpcLogAppender restart too fast, then blocked in logs4j.
>  !screenshot-2.png! 
> *Can the new GrpcLogAppender work normally ?*
> 1. Even though without the above problem, the new created GrpcLogAppender 
> still can not work normally. 
> 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
> LeaderState::addAndStartSenders -> 
> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
> FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
> 3. When the new created GrpcLogAppender append entry to follower, then the 
> follower response SUCCESS.
> 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
>  -> 
> [voterLists.get(0) | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607].
>  {color:#DE350B}Error happens because 

[jira] [Commented] (RATIS-840) Memory leak of LogAppender

2020-04-20 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088015#comment-17088015
 ] 

Tsz-wo Sze commented on RATIS-840:
--

[~yjxxtd], thanks for the patch!  Some comments
- SenderList is simply a list.  Let's do not put the startAppender/stopAppender 
there.
- Let's add a getFollowerInfos as below to covert RaftPeerId to FollowerInfo.  
Then
{code}
  private List getFollowerInfos(List followerIDs) {
...
  }
{code}
- In AppenderDaemon.stop(), join() should be called in a separated thread.  
Otherwise, it will slow down stop().  Since it is only for logging purpose, how 
about change this in a separated JIRA?

- For all the error messages, please put the name in the beginning and use 
member id instead of peer id.


> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: RATIS-840.001.patch, RATIS-840.002.patch, 
> image-2020-04-06-14-27-28-485.png, image-2020-04-06-14-27-39-582.png, 
> screenshot-1.png, screenshot-2.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders 
> ?*
>  I find a lot of GrpcLogAppender blocked inside logs4j. I think it's 
> GrpcLogAppender restart too fast, then blocked in logs4j.
>  !screenshot-2.png! 
> *Can the new GrpcLogAppender work normally ?*
> 1. Even though without the above problem, the new created GrpcLogAppender 
> still can not work normally. 
> 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
> LeaderState::addAndStartSenders -> 
> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
> FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
> 3. When the new created GrpcLogAppender append entry to follower, then the 
> follower response SUCCESS.
> 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
>  -> 
> [voterLists.get(0) | 
> 

[jira] [Comment Edited] (RATIS-840) Memory leak of LogAppender

2020-04-20 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088015#comment-17088015
 ] 

Tsz-wo Sze edited comment on RATIS-840 at 4/20/20, 6:51 PM:


[~yjxxtd], thanks for the patch! Some comments
 - SenderList is simply a list. Let's do not put the startAppender/stopAppender 
there.
 - Let's add a getFollowerInfos as below to covert RaftPeerId to FollowerInfo. 
Then
{code:java}
  private List getFollowerInfos(List followerIDs) {
...
  }
{code}

 - In AppenderDaemon.stop(), join() should be called in a separated thread. 
Otherwise, it will slow down stop(). Since it is only for logging purpose, how 
about changing it in a separated JIRA?

 - For all the error messages, please put the name in the beginning and use 
member id instead of peer id.


was (Author: szetszwo):
[~yjxxtd], thanks for the patch!  Some comments
- SenderList is simply a list.  Let's do not put the startAppender/stopAppender 
there.
- Let's add a getFollowerInfos as below to covert RaftPeerId to FollowerInfo.  
Then
{code}
  private List getFollowerInfos(List followerIDs) {
...
  }
{code}
- In AppenderDaemon.stop(), join() should be called in a separated thread.  
Otherwise, it will slow down stop().  Since it is only for logging purpose, how 
about change this in a separated JIRA?

- For all the error messages, please put the name in the beginning and use 
member id instead of peer id.


> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
> Attachments: RATIS-840.001.patch, RATIS-840.002.patch, 
> image-2020-04-06-14-27-28-485.png, image-2020-04-06-14-27-39-582.png, 
> screenshot-1.png, screenshot-2.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders 
> ?*
>  I find a lot of GrpcLogAppender blocked inside logs4j. I think it's 
> GrpcLogAppender restart too fast, then blocked in logs4j.
>  !screenshot-2.png! 
> *Can the new GrpcLogAppender work normally ?*
> 1. Even though without the above problem, the new created GrpcLogAppender 
> still can not work normally. 
> 2. When creat a 

[jira] [Updated] (RATIS-857) Thread unsafe RaftServerMetrics::metricsMap HashMap in multi thread

2020-04-20 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-857:
-
Component/s: metrics
Summary: Thread unsafe RaftServerMetrics::metricsMap HashMap in multi 
thread  (was: Thread unsafe HashMap in multi thread)

> Thread unsafe RaftServerMetrics::metricsMap HashMap in multi thread
> ---
>
> Key: RATIS-857
> URL: https://issues.apache.org/jira/browse/RATIS-857
> Project: Ratis
>  Issue Type: Bug
>  Components: metrics
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Major
> Attachments: RATIS-857.001.patch
>
>
> *What's the problem ?*
> The {color:#DE350B}static{color} variable 
> [RaftServerMetrics::metricsMap|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerMetrics.java#L71]
>  is type of HashMap, which is not thread safe. But entry will be 
> [put|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerMetrics.java#L76]
>  into metricsMap by different thread, when create each RaftServerImpl 
> instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-835) Include exception based attempt count in raft client request

2020-04-17 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085904#comment-17085904
 ] 

Tsz-wo Sze commented on RATIS-835:
--

The 004 patch looks good.  Just a minor comment:
- In  RaftClientImpl.sendRequestWithRetry, the dummy pending local variable is 
not needed.  Just use 1
{code}
  final int exceptionCount = ioe != null ? 1 : 0;
  final ClientRetryEvent event = new ClientRetryEvent(attemptCount, 
request, exceptionCount, ioe);
{code}


> Include exception based attempt count in raft client request
> 
>
> Key: RATIS-835
> URL: https://issues.apache.org/jira/browse/RATIS-835
> Project: Ratis
>  Issue Type: Bug
>  Components: client
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-835.001.patch, RATIS-835.002.patch, 
> RATIS-835.003.patch, RATIS-835.004.patch
>
>
> Client needs to maintain exception based attempt count for using Exception 
> Dependent retry policy. Exception dependent policy helps in specifying 
> individual policies for different exception types.
> Currently policy takes number of attempts as argument. Therefore the 
> individual policies require attempt counts for the particular exception while 
> handling retry event. This is particularly important for using 
> MulipleLinearRandomRetry policy which increases sleep interval based on 
> number of attempts made by the client. Raft Client can therefore use this 
> policy for ResourceUnavailableException and increase sleep interval for 
> subsequent retries of the request on the same exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-840) Memory leak of LogAppender

2020-04-16 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085273#comment-17085273
 ] 

Tsz-wo Sze commented on RATIS-840:
--

> ... Error happens because voterLists.get(0) return the FollowerInfo of the 
> old GrpcLogAppender, not the FollowerInfo of the new GrpcLogAppender. ...

Good catch!  It seems better to change voterLists to use RaftPeerId, i.e.
{code}
List> voterLists;
{code}
and then get the FollowerInfo from SenderList.

Look forward to see you patch.  Thanks a lot.


> Memory leak of LogAppender
> --
>
> Key: RATIS-840
> URL: https://issues.apache.org/jira/browse/RATIS-840
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Major
> Attachments: image-2020-04-06-14-27-28-485.png, 
> image-2020-04-06-14-27-39-582.png, screenshot-1.png
>
>
> *What's the problem ?*
>  When run hadoop-ozone for 4 days, datanode memory leak.  When dump heap, I 
> found there are 460710 instances of GrpcLogAppender. But there are only 6 
> instances of SenderList, and each SenderList contains 1-2 instance of 
> GrpcLogAppender. And there are a lot of logs related to 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428].
>  {code:java}INFO impl.RaftServerImpl: 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: 
> Restarting GrpcLogAppender for 
> 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code}
>  
>  So there are a lot of GrpcLogAppender did not stop the Daemon Thread when 
> removed from senders. 
>  !image-2020-04-06-14-27-28-485.png! 
>  !image-2020-04-06-14-27-39-582.png! 
>  
> *Why 
> [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]
>  so many times ?*
> 1. As the image shows, when remove group, SegmentedRaftLog will close, then 
> GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. 
> Then GrpcLogAppender will be 
> [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94],
>  and the new GrpcLogAppender throw exception again when find the 
> SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... 
> . It results in an infinite restart of GrpcLogAppender.
> 2. Actually, when remove group, GrpcLogAppender will be stoped: 
> RaftServerImpl::shutdown -> 
> [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266]
>  -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog 
> will be closed:  RaftServerImpl::shutdown -> 
> [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271]
>  ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, 
> but the GrpcLogAppender was stopped asynchronously. So infinite restart of 
> GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog 
> close.
>  !screenshot-1.png! 
> *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders 
> ?*
> {color:#DE350B}Still working. The previous patch has some problem, and I will 
> submit it again.{color}
> *Can the new GrpcLogAppender work normally ?*
> 1. Even though without the above problem, the new created GrpcLogAppender 
> still can not work normally. 
> 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: 
> LeaderState::addAndStartSenders -> 
> LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new 
> FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129]
> 3. When the new created GrpcLogAppender append entry to follower, then the 
> follower response SUCCESS.
> 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599]
>  -> 
> [voterLists.get(0) | 
> https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607].
>  {color:#DE350B}Error happens because voterLists.get(0) return the 
> FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new 
> GrpcLogAppender. {color}
> 5. Because the majority commit got from the FollowerInfo of the old 
> 

[jira] [Commented] (RATIS-848) Failed UT: TestRaftSnapshotWithGrpc.testBasicInstallSnapshot

2020-04-14 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083510#comment-17083510
 ] 

Tsz-wo Sze commented on RATIS-848:
--

[~yjxxtd], thanks a lot for filing and working on this.

[~avijayan], could you take a look?

> Failed UT: TestRaftSnapshotWithGrpc.testBasicInstallSnapshot
> 
>
> Key: RATIS-848
> URL: https://issues.apache.org/jira/browse/RATIS-848
> Project: Ratis
>  Issue Type: Bug
>  Components: test
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Minor
> Attachments: RATIS-848.001.patch
>
>
> *Why unit test TestRaftSnapshotWithGrpc.testBasicInstallSnapshot failed?*
> In the test, leader take snapshot, then add two followers, then [leader 
> restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/test/java/org/apache/ratis/statemachine/RaftSnapshotBaseTest.java#L234],
>  then 
> [verifyTakeSnapshotMetric|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/test/java/org/apache/ratis/statemachine/RaftSnapshotBaseTest.java#L236].
>  
> It must be failed when verifyTakeSnapshotMetric after restarting leader. 
> Because restart leader will create a new instance of RaftServer, the metric 
> of take snapshot is clean.
> *How to fix ?*
> verifyTakeSnapshotMetric after leader take snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-848) Failed UT: TestRaftSnapshotWithGrpc.testBasicInstallSnapshot

2020-04-14 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083505#comment-17083505
 ] 

Tsz-wo Sze commented on RATIS-848:
--

This bug is also reported in RATIS-819.

> Failed UT: TestRaftSnapshotWithGrpc.testBasicInstallSnapshot
> 
>
> Key: RATIS-848
> URL: https://issues.apache.org/jira/browse/RATIS-848
> Project: Ratis
>  Issue Type: Bug
>  Components: test
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Minor
> Attachments: RATIS-848.001.patch
>
>
> *Why unit test TestRaftSnapshotWithGrpc.testBasicInstallSnapshot failed?*
> In the test, leader take snapshot, then add two followers, then [leader 
> restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/test/java/org/apache/ratis/statemachine/RaftSnapshotBaseTest.java#L234],
>  then 
> [verifyTakeSnapshotMetric|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/test/java/org/apache/ratis/statemachine/RaftSnapshotBaseTest.java#L236].
>  
> It must be failed when verifyTakeSnapshotMetric after restarting leader. 
> Because restart leader will create a new instance of RaftServer, the metric 
> of take snapshot is clean.
> *How to fix ?*
> verifyTakeSnapshotMetric after leader take snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-848) Failed UT: TestRaftSnapshotWithGrpc.testBasicInstallSnapshot

2020-04-14 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-848:
-
Component/s: test
   Priority: Minor  (was: Major)

> Failed UT: TestRaftSnapshotWithGrpc.testBasicInstallSnapshot
> 
>
> Key: RATIS-848
> URL: https://issues.apache.org/jira/browse/RATIS-848
> Project: Ratis
>  Issue Type: Bug
>  Components: test
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Minor
> Attachments: RATIS-848.001.patch
>
>
> *Why unit test TestRaftSnapshotWithGrpc.testBasicInstallSnapshot failed?*
> In the test, leader take snapshot, then add two followers, then [leader 
> restart|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/test/java/org/apache/ratis/statemachine/RaftSnapshotBaseTest.java#L234],
>  then 
> [verifyTakeSnapshotMetric|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/test/java/org/apache/ratis/statemachine/RaftSnapshotBaseTest.java#L236].
>  
> It must be failed when verifyTakeSnapshotMetric after restarting leader. 
> Because restart leader will create a new instance of RaftServer, the metric 
> of take snapshot is clean.
> *How to fix ?*
> verifyTakeSnapshotMetric after leader take snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-819) testBasicInstallSnapshot is failing with metric

2020-04-14 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-819:
-
Parent: RATIS-812
Issue Type: Sub-task  (was: Bug)

> testBasicInstallSnapshot is failing with metric
> ---
>
> Key: RATIS-819
> URL: https://issues.apache.org/jira/browse/RATIS-819
> Project: Ratis
>  Issue Type: Sub-task
>  Components: test
>Reporter: Tsz-wo Sze
>Assignee: Aravindan Vijayan
>Priority: Minor
> Attachments: RATIS-819-000.patch
>
>
> {code}
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.ratis.statemachine.RaftSnapshotBaseTest.verifyTakeSnapshotMetric(RaftSnapshotBaseTest.java:257)
>   at 
> org.apache.ratis.statemachine.RaftSnapshotBaseTest.testBasicInstallSnapshot(RaftSnapshotBaseTest.java:236)
>   ...
> {code}
> It can be reproduced by running TestRaftSnapshotWithSimulatedRpc a few times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (RATIS-835) Include exception based attempt count in raft client request

2020-04-13 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082511#comment-17082511
 ] 

Tsz-wo Sze edited comment on RATIS-835 at 4/13/20, 5:35 PM:


Thanks [~ljain].
- In RaftClientImpl.PendingClientRequest, let's rename exceptionAttemptCount to 
exceptionCounts and the corresponding methods.  Also it should use Class, 
i.e.
{code}
private final Map, Integer> exceptionCounts = new 
ConcurrentHashMap<>();
{code}

- In ClientRetryEvent, the exceptionAttemptCount is counting the number of 
occurrences of the cause.  So, let's rename it to causeCount and the 
corresponding methods.
-* Let's also change the order of the parameter in ClientRetryEvent(..) to
{code}
  public ClientRetryEvent(int attemptCount, RaftClientRequest request, int 
causeCount, Throwable cause) {
{code}
so that it is more clear that the attemptCount is counting the request and the 
causeCount is counting the cause.

- PendingClientRequest.incrementExceptionCount (after rename) should return the 
new value, i.e.
{code}
int incrementExceptionCount(Throwable t) {
  return exceptionCounts.compute(t.getClass(), (k, v) -> v != null ? v + 1 
: 1);
}
{code}
Then the code can be simplified as below:
-# OrderedAsync
{code}
final int exceptionCount = pending.incrementExceptionCount(e);
final ClientRetryEvent event = new ClientRetryEvent(attemptCount, 
request, exceptionCount, e);
{code}
-# RaftClientImpl 
{code}
  final int exceptionCount = ioe != null? 
pending.incrementExceptionCount(ioe): 0;
  final ClientRetryEvent event = new ClientRetryEvent(attemptCount, 
request, exceptionCount, ioe);

{code}
-# UnorderedAsync 
{code}
final Throwable cause = replyException != null ? replyException : e;
final int causeCount = pending.incrementExceptionCount(cause);
final ClientRetryEvent event = new ClientRetryEvent(attemptCount, 
request, causeCount, cause)
{code}
getExceptionCount should check null.
{code}
int getExceptionCount(Throwable t) {
  return Optional.ofNullable(exceptionCounts.get(t.getClass())).orElse(0);
}
{code}



was (Author: szetszwo):
Thanks [~ljain].
- In RaftClientImpl.PendingClientRequest, let rename exceptionAttemptCount to 
exceptionCounts and the corresponding methods.  Also it should use , 
i.e.
{code}
private final Map, Integer> exceptionCounts = new 
ConcurrentHashMap<>();
{code}

- In ClientRetryEvent, the exceptionAttemptCount is counting the number of 
occurance of the cause.  So, let's rename it to causeCount and the 
corresponding methods.
-* Let's also change the order of the parameter in ClientRetryEvent(..) to
{code}
  public ClientRetryEvent(int attemptCount, RaftClientRequest request, int 
causeCount, Throwable cause) {
{code}
so that it is more clear that the attemptCount is counting the request and the 
causeCount is counting the cause.

- PendingClientRequest.incrementExceptionCount (after rename) should return the 
new value, i.e.
{code}
int incrementExceptionCount(Throwable t) {
  return exceptionCounts.compute(t.getClass(), (k, v) -> v != null ? v + 1 
: 1);
}
{code}
Then the code can be simplified as below:
-# OrderedAsync
{code}
final int exceptionCount = pending.incrementExceptionCount(e);
final ClientRetryEvent event = new ClientRetryEvent(attemptCount, 
request, exceptionCount, e);
{code}
-# RaftClientImpl 
{code}
  final int exceptionCount = ioe != null? 
pending.incrementExceptionCount(ioe): 0;
  final ClientRetryEvent event = new ClientRetryEvent(attemptCount, 
request, exceptionCount, ioe);

{code}
-# UnorderedAsync 
{code}
final Throwable cause = replyException != null ? replyException : e;
final int causeCount = pending.incrementExceptionCount(cause);
final ClientRetryEvent event = new ClientRetryEvent(attemptCount, 
request, causeCount, cause)
{code}
getExceptionCount should check null.
{code}
int getExceptionCount(Throwable t) {
  return Optional.ofNullable(exceptionCounts.get(t.getClass())).orElse(0);
}
{code}


> Include exception based attempt count in raft client request
> 
>
> Key: RATIS-835
> URL: https://issues.apache.org/jira/browse/RATIS-835
> Project: Ratis
>  Issue Type: Bug
>  Components: client
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-835.001.patch, RATIS-835.002.patch, 
> RATIS-835.003.patch
>
>
> Client needs to maintain exception based attempt count for using Exception 
> Dependent retry policy. Exception dependent policy helps in specifying 
> individual policies for different exception types.
> Currently policy takes number of attempts as argument. Therefore the 
> individual policies require attempt 

[jira] [Commented] (RATIS-835) Include exception based attempt count in raft client request

2020-04-13 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082511#comment-17082511
 ] 

Tsz-wo Sze commented on RATIS-835:
--

Thanks [~ljain].
- In RaftClientImpl.PendingClientRequest, let rename exceptionAttemptCount to 
exceptionCounts and the corresponding methods.  Also it should use , 
i.e.
{code}
private final Map, Integer> exceptionCounts = new 
ConcurrentHashMap<>();
{code}

- In ClientRetryEvent, the exceptionAttemptCount is counting the number of 
occurance of the cause.  So, let's rename it to causeCount and the 
corresponding methods.
-* Let's also change the order of the parameter in ClientRetryEvent(..) to
{code}
  public ClientRetryEvent(int attemptCount, RaftClientRequest request, int 
causeCount, Throwable cause) {
{code}
so that it is more clear that the attemptCount is counting the request and the 
causeCount is counting the cause.

- PendingClientRequest.incrementExceptionCount (after rename) should return the 
new value, i.e.
{code}
int incrementExceptionCount(Throwable t) {
  return exceptionCounts.compute(t.getClass(), (k, v) -> v != null ? v + 1 
: 1);
}
{code}
Then the code can be simplified as below:
-# OrderedAsync
{code}
final int exceptionCount = pending.incrementExceptionCount(e);
final ClientRetryEvent event = new ClientRetryEvent(attemptCount, 
request, exceptionCount, e);
{code}
-# RaftClientImpl 
{code}
  final int exceptionCount = ioe != null? 
pending.incrementExceptionCount(ioe): 0;
  final ClientRetryEvent event = new ClientRetryEvent(attemptCount, 
request, exceptionCount, ioe);

{code}
-# UnorderedAsync 
{code}
final Throwable cause = replyException != null ? replyException : e;
final int causeCount = pending.incrementExceptionCount(cause);
final ClientRetryEvent event = new ClientRetryEvent(attemptCount, 
request, causeCount, cause)
{code}
getExceptionCount should check null.
{code}
int getExceptionCount(Throwable t) {
  return Optional.ofNullable(exceptionCounts.get(t.getClass())).orElse(0);
}
{code}


> Include exception based attempt count in raft client request
> 
>
> Key: RATIS-835
> URL: https://issues.apache.org/jira/browse/RATIS-835
> Project: Ratis
>  Issue Type: Bug
>  Components: client
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-835.001.patch, RATIS-835.002.patch, 
> RATIS-835.003.patch
>
>
> Client needs to maintain exception based attempt count for using Exception 
> Dependent retry policy. Exception dependent policy helps in specifying 
> individual policies for different exception types.
> Currently policy takes number of attempts as argument. Therefore the 
> individual policies require attempt counts for the particular exception while 
> handling retry event. This is particularly important for using 
> MulipleLinearRandomRetry policy which increases sleep interval based on 
> number of attempts made by the client. Raft Client can therefore use this 
> policy for ResourceUnavailableException and increase sleep interval for 
> subsequent retries of the request on the same exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-772) Fix checkstyle violations in ratis-grpc

2020-04-01 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072927#comment-17072927
 ] 

Tsz-wo Sze commented on RATIS-772:
--

Thanks a lot for the update.

Let's don't add @SuppressWarnings in general since it just "hides" the warning 
instead of fixing them.
- For GrpcClientProtocolClient,
{code}
+@SuppressWarnings("linelength")
 private final StreamObserver replyStreamObserver = 
new StreamObserver() {
{code}
we may change it to
{code}
private final StreamObserver replyStreamObserver
= new StreamObserver() {
{code}
- For the parameternumber in GrpcService, let's just ignore it for now.


> Fix checkstyle violations in ratis-grpc
> ---
>
> Key: RATIS-772
> URL: https://issues.apache.org/jira/browse/RATIS-772
> Project: Ratis
>  Issue Type: Sub-task
>  Components: gRPC
>Reporter: Dinesh Chitlangia
>Assignee: Dinesh Chitlangia
>Priority: Major
> Attachments: RATIS-772.001.patch, RATIS-772.002.patch, 
> RATIS-772.003.patch
>
>
> Fix checkstyle violations in ratis-grpc module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-835) Include exception based attempt count in raft client request

2020-04-01 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072574#comment-17072574
 ] 

Tsz-wo Sze commented on RATIS-835:
--

[~ljain], thanks for working on this.

It seems there are quite a few bug fixes in the patch.  How about separate them 
to a new JIRA?

> Include exception based attempt count in raft client request
> 
>
> Key: RATIS-835
> URL: https://issues.apache.org/jira/browse/RATIS-835
> Project: Ratis
>  Issue Type: Bug
>  Components: client
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Attachments: RATIS-835.001.patch, RATIS-835.002.patch
>
>
> Client needs to maintain exception based attempt count for using Exception 
> Dependent retry policy. Exception dependent policy helps in specifying 
> individual policies for different exception types.
> Currently policy takes number of attempts as argument. Therefore the 
> individual policies require attempt counts for the particular exception while 
> handling retry event. This is particularly important for using 
> MulipleLinearRandomRetry policy which increases sleep interval based on 
> number of attempts made by the client. Raft Client can therefore use this 
> policy for ResourceUnavailableException and increase sleep interval for 
> subsequent retries of the request on the same exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-772) Fix checkstyle violations in ratis-grpc

2020-03-20 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063498#comment-17063498
 ] 

Tsz-wo Sze commented on RATIS-772:
--

[~dineshchitlangia], the 002 no longer applies.  Could you update it?

> Fix checkstyle violations in ratis-grpc
> ---
>
> Key: RATIS-772
> URL: https://issues.apache.org/jira/browse/RATIS-772
> Project: Ratis
>  Issue Type: Sub-task
>  Components: gRPC
>Reporter: Dinesh Chitlangia
>Assignee: Dinesh Chitlangia
>Priority: Major
> Attachments: RATIS-772.001.patch, RATIS-772.002.patch
>
>
> Fix checkstyle violations in ratis-grpc module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-826) Support stream to send large AppendEntriesRequestProto

2020-03-09 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created RATIS-826:


 Summary: Support stream to send large AppendEntriesRequestProto
 Key: RATIS-826
 URL: https://issues.apache.org/jira/browse/RATIS-826
 Project: Ratis
  Issue Type: Improvement
  Components: server
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze


The size of an AppendEntriesRequestProto message can be large since it contains 
user data.  In such case, the leader have to send large append entry messages 
to followers.The large messages may increase memory load and processing time.  
It is more efficient to stream large messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-686) Enforce standard getter and setter names in config keys

2020-03-09 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-686:
-
Attachment: r686_20200309.patch

> Enforce standard getter and setter names in config keys
> ---
>
> Key: RATIS-686
> URL: https://issues.apache.org/jira/browse/RATIS-686
> Project: Ratis
>  Issue Type: Improvement
>  Components: conf
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r686_20190919.patch, r686_20190919b.patch, 
> r686_20190919c.patch, r686_20191023.patch, r686_20200109.patch, 
> r686_20200218.patch, r686_20200309.patch
>
>
> In the conf keys, some getter/setter methods are missing.  Some of the method 
> names are not using the standard naming convention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-824) [thirdparty] periodic dependency update

2020-03-06 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-824:
-
Fix Version/s: thirdparty-0.4.0

> [thirdparty] periodic dependency update
> ---
>
> Key: RATIS-824
> URL: https://issues.apache.org/jira/browse/RATIS-824
> Project: Ratis
>  Issue Type: Improvement
>  Components: thirdparty
>Affects Versions: thirdparty-0.3.0
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
> Fix For: thirdparty-0.4.0
>
> Attachments: RATIS-824.thirdparty.001.patch
>
>
> Ran an OWSAP dependency check and a number of dependencies should update:
> guava: 24.1-jre --> 28.2jre
> hadoop: 3.1.1 --> 3.1.3 (to update guava)
> netty: 4.1.38.Final --> 4.1.46.Final



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-824) [thirdparty] periodic dependency update

2020-03-05 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052504#comment-17052504
 ] 

Tsz-wo Sze commented on RATIS-824:
--

+1 patch looks good. 

> [thirdparty] periodic dependency update
> ---
>
> Key: RATIS-824
> URL: https://issues.apache.org/jira/browse/RATIS-824
> Project: Ratis
>  Issue Type: Improvement
>  Components: thirdparty
>Affects Versions: thirdparty-0.3.0
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
> Attachments: RATIS-824.thirdparty.001.patch
>
>
> Ran an OWSAP dependency check and a number of dependencies should update:
> guava: 24.1-jre --> 28.2jre
> hadoop: 3.1.1 --> 3.1.3 (to update guava)
> netty: 4.1.38.Final --> 4.1.46.Final



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-03-03 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050702#comment-17050702
 ] 

Tsz-wo Sze commented on RATIS-815:
--

Thanks [~ljain] for committing the patch.

I just have found that there is a bug in flush(); filed RATIS-822.

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Fix For: 0.6.0
>
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch, r815_20200228.patch, r815_20200302.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-822) BufferedWriteChannel may not flush correctly

2020-03-03 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-822:
-
Attachment: r822_20200303.patch

> BufferedWriteChannel may not flush correctly
> 
>
> Key: RATIS-822
> URL: https://issues.apache.org/jira/browse/RATIS-822
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r822_20200303.patch
>
>
> Suppose the buffer in BufferedWriteChannel has capacity n and is initially 
> empty.  Calling write(..) with n-byte data will trigger flushInternal().  
> Then, the buffer becomes empty again.  If flush() is called, it won't trigger 
> FileChannel.force(..) since the buffer is empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-822) BufferedWriteChannel may not flush correctly

2020-03-03 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created RATIS-822:


 Summary: BufferedWriteChannel may not flush correctly
 Key: RATIS-822
 URL: https://issues.apache.org/jira/browse/RATIS-822
 Project: Ratis
  Issue Type: Bug
  Components: server
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze


Suppose the buffer in BufferedWriteChannel has capacity n and is initially 
empty.  Calling write(..) with n-byte data will trigger flushInternal().  Then, 
the buffer becomes empty again.  If flush() is called, it won't trigger 
FileChannel.force(..) since the buffer is empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-814) High CPU usage by TimeoutScheduler due to JDK bug JDK-8129861

2020-03-02 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-814:
-
Component/s: common

> High CPU usage by TimeoutScheduler due to JDK bug JDK-8129861
> -
>
> Key: RATIS-814
> URL: https://issues.apache.org/jira/browse/RATIS-814
> Project: Ratis
>  Issue Type: Bug
>  Components: common
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: RATIS-814.001.patch, flamegraph.svg
>
>
> TimeoutScheduler creates an instance of ScheduledThreadPoolExecutor with 0 
> core pool threads. There is a bug in JDK 
> [https://bugs.openjdk.java.net/browse/JDK-8129861] which causes high CPU 
> usage if ScheduledThreadPoolExecutor is instantiated with 0 core pool 
> threads. The bug was fixed in Java 9.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-03-02 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049382#comment-17049382
 ] 

Tsz-wo Sze commented on RATIS-815:
--

r815_20200302.patch: addresses [~ljain]'s comments.

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch, r815_20200228.patch, r815_20200302.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-815) Log entry corrupted with 0 checksum

2020-03-02 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-815:
-
Component/s: server

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>  Components: server
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch, r815_20200228.patch, r815_20200302.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-815) Log entry corrupted with 0 checksum

2020-03-02 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-815:
-
Attachment: r815_20200302.patch

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch, r815_20200228.patch, r815_20200302.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-03-02 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049380#comment-17049380
 ] 

Tsz-wo Sze commented on RATIS-815:
--

> ... Regarding the latest change oustanding and remaining are very similar. ...

Will rename them.  Thanks.

> ... We do not do it during writes as well as during preallocate. ...

Write and preallocate do not change metadata so that they should not force 
metadata.

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch, r815_20200228.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-28 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-815:
-
Attachment: (was: r815_20200228.patch)

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch, r815_20200228.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-28 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-815:
-
Attachment: r815_20200228.patch

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch, r815_20200228.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-28 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-815:
-
Attachment: r815_20200228.patch

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch, r815_20200228.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-28 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048111#comment-17048111
 ] 

Tsz-wo Sze commented on RATIS-815:
--

r815_20200228.patch: addresses [~ljain]'s comments.

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch, r815_20200228.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-28 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048105#comment-17048105
 ] 

Tsz-wo Sze commented on RATIS-815:
--

> The log worker should never write over the max size. ...

Indeed, the worker may write over the max size according to 
SegmentedRaftLog.isSegmentFull(..).  Let me change the code.

{code}
  final long entrySize = LogSegment.getEntrySize(entry);
  // if entry size is greater than the max segment size, write it directly
  // into the current segment
  return entrySize <= segmentMaxSize &&
  segment.getTotalSize() + entrySize > segmentMaxSize;
{code}


> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by 

[jira] [Updated] (RATIS-813) Add streamAsync(..)

2020-02-25 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-813:
-
Attachment: r813_20200225.patch

> Add streamAsync(..)
> ---
>
> Key: RATIS-813
> URL: https://issues.apache.org/jira/browse/RATIS-813
> Project: Ratis
>  Issue Type: New Feature
>  Components: client
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r813_20200219.patch, r813_20200225.patch
>
>
> This is a followup of RATIS-759.  Will add streamAsync(..) here.
> {code}
>  /** Send the given message using a stream. */
>   CompletableFuture streamAsync(Message message);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-813) Add streamAsync(..)

2020-02-25 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045051#comment-17045051
 ] 

Tsz-wo Sze commented on RATIS-813:
--

r813_20200225.patch: fixes checkstyle warnings

> Add streamAsync(..)
> ---
>
> Key: RATIS-813
> URL: https://issues.apache.org/jira/browse/RATIS-813
> Project: Ratis
>  Issue Type: New Feature
>  Components: client
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r813_20200219.patch, r813_20200225.patch
>
>
> This is a followup of RATIS-759.  Will add streamAsync(..) here.
> {code}
>  /** Send the given message using a stream. */
>   CompletableFuture streamAsync(Message message);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-21 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042202#comment-17042202
 ] 

Tsz-wo Sze commented on RATIS-815:
--

[~adoroszlai], thanks a lot for testing it.

[~ljain], could you review the patch?

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-20 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041422#comment-17041422
 ] 

Tsz-wo Sze commented on RATIS-815:
--

r815_20200220.patch: fixes some bugs and simplifies code.

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-20 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-815:
-
Attachment: r815_20200220.patch

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz, 
> r815_20200220.patch
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-20 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041244#comment-17041244
 ] 

Tsz-wo Sze commented on RATIS-815:
--

This is a bug in the code if the log entry size > preallocate size.  Also, 
SegmentedRaftLogOutputStream accesses the FileChannel in two different ways.  
Let's take this chance to clean up the code.

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-20 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reassigned RATIS-815:


Assignee: Tsz-wo Sze

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Tsz-wo Sze
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-20 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041239#comment-17041239
 ] 

Tsz-wo Sze commented on RATIS-815:
--

I see.  Could you test also setting the preallocate size to the same as the 
segment size?

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-819) testBasicInstallSnapshot is failing with metric

2020-02-20 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-819:
-
Description: 
{code}
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.ratis.statemachine.RaftSnapshotBaseTest.verifyTakeSnapshotMetric(RaftSnapshotBaseTest.java:257)
at 
org.apache.ratis.statemachine.RaftSnapshotBaseTest.testBasicInstallSnapshot(RaftSnapshotBaseTest.java:236)
...
{code}
It can be reproduced by running TestRaftSnapshotWithSimulatedRpc a few times.

  was:
{code}
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.ratis.statemachine.RaftSnapshotBaseTest.verifyTakeSnapshotMetric(RaftSnapshotBaseTest.java:257)
at 
org.apache.ratis.statemachine.RaftSnapshotBaseTest.testBasicInstallSnapshot(RaftSnapshotBaseTest.java:236)
...
{code}


> testBasicInstallSnapshot is failing with metric
> ---
>
> Key: RATIS-819
> URL: https://issues.apache.org/jira/browse/RATIS-819
> Project: Ratis
>  Issue Type: Bug
>  Components: test
>Reporter: Tsz-wo Sze
>Assignee: Aravindan Vijayan
>Priority: Minor
>
> {code}
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.ratis.statemachine.RaftSnapshotBaseTest.verifyTakeSnapshotMetric(RaftSnapshotBaseTest.java:257)
>   at 
> org.apache.ratis.statemachine.RaftSnapshotBaseTest.testBasicInstallSnapshot(RaftSnapshotBaseTest.java:236)
>   ...
> {code}
> It can be reproduced by running TestRaftSnapshotWithSimulatedRpc a few times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-819) testBasicInstallSnapshot is failing with metric

2020-02-20 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created RATIS-819:


 Summary: testBasicInstallSnapshot is failing with metric
 Key: RATIS-819
 URL: https://issues.apache.org/jira/browse/RATIS-819
 Project: Ratis
  Issue Type: Bug
  Components: test
Reporter: Tsz-wo Sze
Assignee: Aravindan Vijayan


{code}
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.ratis.statemachine.RaftSnapshotBaseTest.verifyTakeSnapshotMetric(RaftSnapshotBaseTest.java:257)
at 
org.apache.ratis.statemachine.RaftSnapshotBaseTest.testBasicInstallSnapshot(RaftSnapshotBaseTest.java:236)
...
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-20 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041216#comment-17041216
 ] 

Tsz-wo Sze commented on RATIS-815:
--

Thanks a lot for the update.

> 1. Adding a sync in preallocate method does not help with corruption. 

Do you mean that "fc.force(false)" does not work?

> 2. Avoided preallocate during log entry write in the log segment. ...

This confirms the race condition between preallocate() and write().

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Priority: Blocker
> Attachments: RATIS-815.temp.patch, dumps.tar.gz, logs.tar.gz
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-19 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040462#comment-17040462
 ] 

Tsz-wo Sze commented on RATIS-815:
--

{quote}
Other operations, in particular those that take an explicit position, may 
proceed concurrently; whether they in fact do so is dependent upon the 
underlying implementation and is therefore unspecified. 
{quote}
According to 
https://docs.oracle.com/javase/8/docs/api/java/nio/channels/FileChannel.html 
quoted above, the race condition between preallocate() and write() seems 
happening.


> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Priority: Blocker
> Attachments: dumps.tar.gz, logs.tar.gz
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-19 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040452#comment-17040452
 ] 

Tsz-wo Sze commented on RATIS-815:
--

In ozone, we have the log segment size 4MB and preallocate size 1MB.
- We should set the preallocate size to 4MB as well.

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Priority: Blocker
> Attachments: dumps.tar.gz, logs.tar.gz
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-813) Add streamAsync(..)

2020-02-19 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-813:
-
Component/s: (was: server)

> Add streamAsync(..)
> ---
>
> Key: RATIS-813
> URL: https://issues.apache.org/jira/browse/RATIS-813
> Project: Ratis
>  Issue Type: New Feature
>  Components: client
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r813_20200219.patch
>
>
> This is a followup of RATIS-759.  Will add streamAsync(..) here.
> {code}
>  /** Send the given message using a stream. */
>   CompletableFuture streamAsync(Message message);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-813) Add streamAsync(..)

2020-02-19 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-813:
-
Attachment: r813_20200219.patch

> Add streamAsync(..)
> ---
>
> Key: RATIS-813
> URL: https://issues.apache.org/jira/browse/RATIS-813
> Project: Ratis
>  Issue Type: New Feature
>  Components: client
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r813_20200219.patch
>
>
> This is a followup of RATIS-759.  Will add streamAsync(..) here.
> {code}
>  /** Send the given message using a stream. */
>   CompletableFuture streamAsync(Message message);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-19 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040415#comment-17040415
 ] 

Tsz-wo Sze commented on RATIS-815:
--

Discussed with [~ljain], it looks like that there is a race condition between 
preallocate() and write().  We could test the following
# Add
{code}
  fc.force(false);
{code}
to preallocate() to see if it can fix the bug.
# If the above does not work, run the test with preallocate() removed/commented.

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Priority: Blocker
> Attachments: dumps.tar.gz, logs.tar.gz
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-686) Enforce standard getter and setter names in config keys

2020-02-18 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039349#comment-17039349
 ] 

Tsz-wo Sze commented on RATIS-686:
--

r686_20200218.patch: some more changes.

> Enforce standard getter and setter names in config keys
> ---
>
> Key: RATIS-686
> URL: https://issues.apache.org/jira/browse/RATIS-686
> Project: Ratis
>  Issue Type: Improvement
>  Components: conf
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r686_20190919.patch, r686_20190919b.patch, 
> r686_20190919c.patch, r686_20191023.patch, r686_20200109.patch, 
> r686_20200218.patch
>
>
> In the conf keys, some getter/setter methods are missing.  Some of the method 
> names are not using the standard naming convention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-686) Enforce standard getter and setter names in config keys

2020-02-18 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-686:
-
Attachment: r686_20200218.patch

> Enforce standard getter and setter names in config keys
> ---
>
> Key: RATIS-686
> URL: https://issues.apache.org/jira/browse/RATIS-686
> Project: Ratis
>  Issue Type: Improvement
>  Components: conf
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r686_20190919.patch, r686_20190919b.patch, 
> r686_20190919c.patch, r686_20191023.patch, r686_20200109.patch, 
> r686_20200218.patch
>
>
> In the conf keys, some getter/setter methods are missing.  Some of the method 
> names are not using the standard naming convention.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-18 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039272#comment-17039272
 ] 

Tsz-wo Sze commented on RATIS-815:
--

[~adoroszlai], thanks for testing it with RATIS-767 reverted.

I wonder if the test kills any datanodes?  If it does, it is normal to have the 
log corrupted at the end since a datanode could be killed during writing the 
log.

> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Priority: Blocker
> Attachments: dumps.tar.gz, logs.tar.gz
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-815) Log entry corrupted with 0 checksum

2020-02-14 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037385#comment-17037385
 ] 

Tsz-wo Sze commented on RATIS-815:
--

[~adoroszlai], thanks for reporting the bug.  Is it easy to reproduce?

> Corruption related to change in RATIS-767. ...

[~ljain], thanks for checking it.  Let's try testing it with RATIS-767 reverted.



> Log entry corrupted with 0 checksum
> ---
>
> Key: RATIS-815
> URL: https://issues.apache.org/jira/browse/RATIS-815
> Project: Ratis
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Priority: Blocker
> Attachments: dumps.tar.gz, logs.tar.gz
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in 
> Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  Rolled log segment from 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379
>  to 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker:
>  created new log segment 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - 
> e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon
>  failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: 
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated 
> checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
>   at 
> org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
>   at 
> org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
>   at 
> org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: 
> Calculated checksum is CDFED097 but read checksum is .
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>   at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
>   at 
> org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
>   ... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
> {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
> {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log 
> dump utility):
> {code}
> Processing Raft Log file: 
> /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509
>  size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log 
> entry corrupted: Calculated checksum is 926127AE but read checksum is 
> .
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-813) Add streamAsync(..)

2020-02-07 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-813:
-
Component/s: server
 client

> Add streamAsync(..)
> ---
>
> Key: RATIS-813
> URL: https://issues.apache.org/jira/browse/RATIS-813
> Project: Ratis
>  Issue Type: New Feature
>  Components: client, server
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>
> This is a followup of RATIS-759.  Will add streamAsync(..) here.
> {code}
>  /** Send the given message using a stream. */
>   CompletableFuture streamAsync(Message message);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (RATIS-759) Support stream APIs to send large messages

2020-02-07 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032710#comment-17032710
 ] 

Tsz-wo Sze edited comment on RATIS-759 at 2/7/20 11:17 PM:
---

Filed RATIS-813 to add to the other method.


was (Author: szetszwo):
Filed RATIS-813 to add to other method.

> Support stream APIs to send large messages
> --
>
> Key: RATIS-759
> URL: https://issues.apache.org/jira/browse/RATIS-759
> Project: Ratis
>  Issue Type: New Feature
>  Components: client, server
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Fix For: 0.5.0
>
> Attachments: r759_20200115.patch, r759_20200123.patch, 
> r759_20200204.patch, r759_20200206.patch
>
>
> It is inefficient to send a large message using 
> send(Message)/sendAsync(Message) in RaftClient.  We already have 
> RaftOutputStream implemented with sendAsync(..).  We propose adding the 
> following new APIs
> {code}
>   /** Create a stream to send a large message. */
>   MessageOutputStream stream();
>   /** Send the given message using a stream. */
>   CompletableFuture streamAsync(Message message);
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-759) Support stream APIs to send large messages

2020-02-07 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032710#comment-17032710
 ] 

Tsz-wo Sze commented on RATIS-759:
--

Filed RATIS-813 to add to other method.

> Support stream APIs to send large messages
> --
>
> Key: RATIS-759
> URL: https://issues.apache.org/jira/browse/RATIS-759
> Project: Ratis
>  Issue Type: New Feature
>  Components: client, server
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Fix For: 0.5.0
>
> Attachments: r759_20200115.patch, r759_20200123.patch, 
> r759_20200204.patch, r759_20200206.patch
>
>
> It is inefficient to send a large message using 
> send(Message)/sendAsync(Message) in RaftClient.  We already have 
> RaftOutputStream implemented with sendAsync(..).  We propose adding the 
> following new APIs
> {code}
>   /** Create a stream to send a large message. */
>   MessageOutputStream stream();
>   /** Send the given message using a stream. */
>   CompletableFuture streamAsync(Message message);
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (RATIS-813) Add streamAsync(..)

2020-02-07 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created RATIS-813:


 Summary: Add streamAsync(..)
 Key: RATIS-813
 URL: https://issues.apache.org/jira/browse/RATIS-813
 Project: Ratis
  Issue Type: New Feature
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze


This is a followup of RATIS-759.  Will add streamAsync(..) here.
{code}
 /** Send the given message using a stream. */
  CompletableFuture streamAsync(Message message);
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RATIS-759) Support stream APIs to send large messages

2020-02-06 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/RATIS-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032074#comment-17032074
 ] 

Tsz-wo Sze commented on RATIS-759:
--

[~shashikant], thanks for reviewing the new patch.

r759_20200206: fixes checkstyle warnings.

> Support stream APIs to send large messages
> --
>
> Key: RATIS-759
> URL: https://issues.apache.org/jira/browse/RATIS-759
> Project: Ratis
>  Issue Type: New Feature
>  Components: client, server
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r759_20200115.patch, r759_20200123.patch, 
> r759_20200204.patch, r759_20200206.patch
>
>
> It is inefficient to send a large message using 
> send(Message)/sendAsync(Message) in RaftClient.  We already have 
> RaftOutputStream implemented with sendAsync(..).  We propose adding the 
> following new APIs
> {code}
>   /** Create a stream to send a large message. */
>   MessageOutputStream stream();
>   /** Send the given message using a stream. */
>   CompletableFuture streamAsync(Message message);
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (RATIS-759) Support stream APIs to send large messages

2020-02-06 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-759:
-
Attachment: r759_20200206.patch

> Support stream APIs to send large messages
> --
>
> Key: RATIS-759
> URL: https://issues.apache.org/jira/browse/RATIS-759
> Project: Ratis
>  Issue Type: New Feature
>  Components: client, server
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: r759_20200115.patch, r759_20200123.patch, 
> r759_20200204.patch, r759_20200206.patch
>
>
> It is inefficient to send a large message using 
> send(Message)/sendAsync(Message) in RaftClient.  We already have 
> RaftOutputStream implemented with sendAsync(..).  We propose adding the 
> following new APIs
> {code}
>   /** Create a stream to send a large message. */
>   MessageOutputStream stream();
>   /** Send the given message using a stream. */
>   CompletableFuture streamAsync(Message message);
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   >