[jira] [Commented] (RATIS-924) rename raft group dir on disk when remove group is invoked
[ https://issues.apache.org/jira/browse/RATIS-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157133#comment-17157133 ] Shashikant Banerjee commented on RATIS-924: --- [~cyrusjackson25], can you open a PR for the same? > rename raft group dir on disk when remove group is invoked > -- > > Key: RATIS-924 > URL: https://issues.apache.org/jira/browse/RATIS-924 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-924.001.patch, screenshot-1.png > > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-833) Add metrics for raft log cache count and size in bytes
[ https://issues.apache.org/jira/browse/RATIS-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-833. --- Resolution: Fixed > Add metrics for raft log cache count and size in bytes > -- > > Key: RATIS-833 > URL: https://issues.apache.org/jira/browse/RATIS-833 > Project: Ratis > Issue Type: Sub-task > Components: server >Reporter: Shashikant Banerjee >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: RATIS-833.001.patch > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-987) Fix Infinite install snapshot
[ https://issues.apache.org/jira/browse/RATIS-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-987. --- Fix Version/s: 0.6.0 Resolution: Fixed > Fix Infinite install snapshot > - > > Key: RATIS-987 > URL: https://issues.apache.org/jira/browse/RATIS-987 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Fix For: 0.6.0 > > Attachments: screenshot-1.png, screenshot-2.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > This happens in ozone production. > 1. leader notify follower install snapshot-(t:3, i:999697) infinitely > !screenshot-1.png! > 2. follower install snapshot infinitely > !screenshot-2.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-903) Fix Failed UT: RaftSnapshotBaseTest.testBasicInstallSnapshot
[ https://issues.apache.org/jira/browse/RATIS-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-903. --- Fix Version/s: 0.6.0 Resolution: Fixed > Fix Failed UT: RaftSnapshotBaseTest.testBasicInstallSnapshot > > > Key: RATIS-903 > URL: https://issues.apache.org/jira/browse/RATIS-903 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png, screenshot-2.png > > Time Spent: 1h 20m > Remaining Estimate: 0h > > !screenshot-1.png! > !screenshot-2.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-982) Fix RaftServerImpl illegal transition from RUNNING to RUNNING
[ https://issues.apache.org/jira/browse/RATIS-982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-982. --- Fix Version/s: 0.6.0 Resolution: Fixed > Fix RaftServerImpl illegal transition from RUNNING to RUNNING > - > > Key: RATIS-982 > URL: https://issues.apache.org/jira/browse/RATIS-982 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Time Spent: 50m > Remaining Estimate: 0h > > This happens in test, but it maybe also happen in production. > For example, leader is s3 and follower is s4. > 1. kill s4, and restart s4. > {code:java} > 2020-06-19 07:03:18,095 [Thread-6194] INFO ratis.MiniRaftCluster > (MiniRaftCluster.java:killServer(458)) - killServer s4 > 2020-06-19 07:03:18,095 [Thread-6194] INFO ratis.MiniRaftCluster > (MiniRaftCluster.java:newRaftServer(330)) - newRaftServer: s4, > group-5BD7E8A01610:[s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, > s1:0.0.0.0:33783, s2:0.0.0.0:40473], format? false > {code} > 2. s4 start and set configuration from storage at > [setRaftConf(raftConf.getLogEntryIndex(), raftConf) > |https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L170] > and s4 will change to RUNNING at > [lifeCycle.transition(RUNNING)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L213] > {code:java} > 2020-06-19 07:03:18,127 [pool-16-thread-1] INFO impl.RaftServerImpl > (ServerState.java:setRaftConf(356)) - s4@group-5BD7E8A01610: set > configuration 0: [s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, > s1:0.0.0.0:33783, s2:0.0.0.0:40473], old=null at 0 > 2020-06-19 07:03:18,153 [Thread-6194] INFO impl.RaftServerImpl > (RaftServerImpl.java:start(185)) - s4@group-5BD7E8A01610: start as a > follower, conf=0: [s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, > s1:0.0.0.0:33783, s2:0.0.0.0:40473], old=null > 2020-06-19 07:03:18,153 [Thread-6194] INFO impl.RaftServerImpl > (RaftServerImpl.java:setRole(174)) - s4@group-5BD7E8A01610: changes role from > null to FOLLOWER at term 1 for startAsFollower > {code} > 3. s3 send append entry request to s4, and s4 change to RUNNING at > [lifeCycle.compareAndTransition(STARTING, > RUNNING)|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1003] > {code:java} > 2020-06-19 07:03:18,162 [nioEventLoopGroup-59-1] DEBUG impl.RaftServerImpl > (RaftServerImpl.java:logAppendEntries(918)) - s4@group-5BD7E8A01610: receive > appendEntries(s3, 1, (t:1, i:0), 0, false, commits[s3:c0, s4:c0, s0:c0, > s1:c0, s2:c0], entries: (t:1, i:1), STATEMACHINELOGENTRY, > client-9414EC4E73DA, cid=3000 > {code} > 4. If change to RUNNING in step3 happens before step2, then step2 will throw > exception. > {code:java} > 2020-06-19 07:03:18,169 [Thread-6194] INFO impl.RoleInfo > (RoleInfo.java:updateAndGet(143)) - s4: start FollowerState > 2020-06-19 07:03:18,174 [Thread-6194] ERROR netty.TestRaftWithNetty > (ExitUtils.java:terminate(133)) - Terminating with exit status -1: Failed to > kill/restart server: s4 > 2020-06-19T07:03:18.1918474Z java.lang.IllegalStateException: ILLEGAL > TRANSITION: In s4, RUNNING -> RUNNING > 2020-06-19T07:03:18.1918899Z at > org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:63) > 2020-06-19T07:03:18.1919240Z at > org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:115) > 2020-06-19T07:03:18.1919558Z at > org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:155) > 2020-06-19T07:03:18.1919878Z at > org.apache.ratis.server.impl.RaftServerImpl.startAsFollower(RaftServerImpl.java:214) > 2020-06-19T07:03:18.1920206Z at > org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:186) > 2020-06-19T07:03:18.1920520Z at > java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > 2020-06-19T07:03:18.1920839Z at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > 2020-06-19T07:03:18.1921330Z at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > 2020-06-19T07:03:18.1921639Z at > java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290) > 2020-06-19T07:03:18.1921951Z at > java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) > 2020-06-19T07:03:18.1922261Z at > java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) > 2020-06-19T07:03:18.1922575Z at > java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) > 2020-06-19T07:03:18.1922885Z at >
[jira] [Resolved] (RATIS-983) Check follower state before ask for votes
[ https://issues.apache.org/jira/browse/RATIS-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-983. --- Fix Version/s: 0.6.0 Resolution: Fixed > Check follower state before ask for votes > - > > Key: RATIS-983 > URL: https://issues.apache.org/jira/browse/RATIS-983 > Project: Ratis > Issue Type: Improvement >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Time Spent: 40m > Remaining Estimate: 0h > > 1. There are server s0, s1, s2, all start leader election. But s2 has not > start askForVotes. > {code:java} > 2020-06-21 03:46:27,958 [Thread-7] INFO impl.RoleInfo > (RoleInfo.java:updateAndGet(143)) - s0: start LeaderElection > 2020-06-21 03:46:27,963 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.LeaderElection (LeaderElection.java:askForVotes(206)) - > s0@group-D88B65C78887-LeaderElection1: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > {code} > {code:java} > 2020-06-21 03:46:27,990 [Thread-8] INFO impl.RoleInfo > (RoleInfo.java:updateAndGet(143)) - s1: start LeaderElection > 2020-06-21 03:46:27,998 [s1@group-D88B65C78887-LeaderElection2] INFO > impl.LeaderElection (LeaderElection.java:askForVotes(206)) - > s1@group-D88B65C78887-LeaderElection2: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > {code} > {code:java} > 2020-06-21 03:46:28,064 [Thread-9] INFO impl.RoleInfo > (RoleInfo.java:updateAndGet(143)) - s2: start LeaderElection > {code} > 2. s0 was elected as leader > {code:java} > 2020-06-21 03:46:28,093 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.LeaderElection (LeaderElection.java:logAndReturn(61)) - > s0@group-D88B65C78887-LeaderElection1: Election PASSED; received 2 > response(s) [s0<-s1#0:FAIL-t1, s0<-s2#0:OK-t1] and 0 exception(s); > s0@group-D88B65C78887:t1, leader=null, voted=s0, > raftlog=s0@group-D88B65C78887-SegmentedRaftLog:OPENED:c-1,f-1,i0, conf=-1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > 2020-06-21 03:46:28,093 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.RoleInfo (RoleInfo.java:shutdownLeaderElection(134)) - s0: shutdown > LeaderElection > 2020-06-21T03:46:28.0975768Z 2020-06-21 03:46:28,094 > [s0@group-D88B65C78887-LeaderElection1] INFO impl.RaftServerImpl > (RaftServerImpl.java:setRole(174)) - s0@group-D88B65C78887: changes role from > CANDIDATE to LEADER at term 1 for changeToLeader > 2020-06-21 03:46:28,094 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.RaftServerImpl (ServerState.java:setLeader(255)) - > s0@group-D88B65C78887: change Leader from null to s0 at term 1 for > becomeLeader, leader elected after 474ms > {code} > 3. s2 start askForVotes which did not start in step1. Then a new leader > election happens. > {code:java} > 2020-06-21 03:46:28,096 [s2@group-D88B65C78887-LeaderElection3] INFO > impl.LeaderElection (LeaderElection.java:askForVotes(206)) - > s2@group-D88B65C78887-LeaderElection3: begin an election at term 2 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > {code} > all the log as following: > {code:java} > 2020-06-21T03:46:27.9598769Z 2020-06-21 03:46:27,958 [Thread-7] INFO > impl.RoleInfo (RoleInfo.java:updateAndGet(143)) - s0: start LeaderElection > 2020-06-21T03:46:27.9637021Z 2020-06-21 03:46:27,963 > [s0@group-D88B65C78887-LeaderElection1] INFO impl.LeaderElection > (LeaderElection.java:askForVotes(206)) - > s0@group-D88B65C78887-LeaderElection1: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > 2020-06-21T03:46:27.9912697Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.FollowerState (FollowerState.java:run(108)) - > s1@group-D88B65C78887-FollowerState: change to CANDIDATE, lastRpcTime:244ms, > electionTimeout:243ms > 2020-06-21T03:46:27.9918514Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.RoleInfo (RoleInfo.java:shutdownFollowerState(121)) - s1: shutdown > FollowerState > 2020-06-21T03:46:27.9919033Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.RaftServerImpl (RaftServerImpl.java:setRole(174)) - > s1@group-D88B65C78887: changes role from FOLLOWER to CANDIDATE at term 0 for > changeToCandidate > 2020-06-21T03:46:27.9920005Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.RoleInfo (RoleInfo.java:updateAndGet(143)) - s1: start LeaderElection > 2020-06-21T03:46:27.9994968Z 2020-06-21 03:46:27,998 > [s1@group-D88B65C78887-LeaderElection2] INFO impl.LeaderElection > (LeaderElection.java:askForVotes(206)) - > s1@group-D88B65C78887-LeaderElection2: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null >
[jira] [Updated] (RATIS-983) Check follower state before ask for votes
[ https://issues.apache.org/jira/browse/RATIS-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-983: -- Summary: Check follower state before ask for votes (was: Check follower status before ask for votes) > Check follower state before ask for votes > - > > Key: RATIS-983 > URL: https://issues.apache.org/jira/browse/RATIS-983 > Project: Ratis > Issue Type: Improvement >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > 1. There are server s0, s1, s2, all start leader election. But s2 has not > start askForVotes. > {code:java} > 2020-06-21 03:46:27,958 [Thread-7] INFO impl.RoleInfo > (RoleInfo.java:updateAndGet(143)) - s0: start LeaderElection > 2020-06-21 03:46:27,963 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.LeaderElection (LeaderElection.java:askForVotes(206)) - > s0@group-D88B65C78887-LeaderElection1: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > {code} > {code:java} > 2020-06-21 03:46:27,990 [Thread-8] INFO impl.RoleInfo > (RoleInfo.java:updateAndGet(143)) - s1: start LeaderElection > 2020-06-21 03:46:27,998 [s1@group-D88B65C78887-LeaderElection2] INFO > impl.LeaderElection (LeaderElection.java:askForVotes(206)) - > s1@group-D88B65C78887-LeaderElection2: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > {code} > {code:java} > 2020-06-21 03:46:28,064 [Thread-9] INFO impl.RoleInfo > (RoleInfo.java:updateAndGet(143)) - s2: start LeaderElection > {code} > 2. s0 was elected as leader > {code:java} > 2020-06-21 03:46:28,093 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.LeaderElection (LeaderElection.java:logAndReturn(61)) - > s0@group-D88B65C78887-LeaderElection1: Election PASSED; received 2 > response(s) [s0<-s1#0:FAIL-t1, s0<-s2#0:OK-t1] and 0 exception(s); > s0@group-D88B65C78887:t1, leader=null, voted=s0, > raftlog=s0@group-D88B65C78887-SegmentedRaftLog:OPENED:c-1,f-1,i0, conf=-1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > 2020-06-21 03:46:28,093 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.RoleInfo (RoleInfo.java:shutdownLeaderElection(134)) - s0: shutdown > LeaderElection > 2020-06-21T03:46:28.0975768Z 2020-06-21 03:46:28,094 > [s0@group-D88B65C78887-LeaderElection1] INFO impl.RaftServerImpl > (RaftServerImpl.java:setRole(174)) - s0@group-D88B65C78887: changes role from > CANDIDATE to LEADER at term 1 for changeToLeader > 2020-06-21 03:46:28,094 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.RaftServerImpl (ServerState.java:setLeader(255)) - > s0@group-D88B65C78887: change Leader from null to s0 at term 1 for > becomeLeader, leader elected after 474ms > {code} > 3. s2 start askForVotes which did not start in step1. Then a new leader > election happens. > {code:java} > 2020-06-21 03:46:28,096 [s2@group-D88B65C78887-LeaderElection3] INFO > impl.LeaderElection (LeaderElection.java:askForVotes(206)) - > s2@group-D88B65C78887-LeaderElection3: begin an election at term 2 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > {code} > all the log as following: > {code:java} > 2020-06-21T03:46:27.9598769Z 2020-06-21 03:46:27,958 [Thread-7] INFO > impl.RoleInfo (RoleInfo.java:updateAndGet(143)) - s0: start LeaderElection > 2020-06-21T03:46:27.9637021Z 2020-06-21 03:46:27,963 > [s0@group-D88B65C78887-LeaderElection1] INFO impl.LeaderElection > (LeaderElection.java:askForVotes(206)) - > s0@group-D88B65C78887-LeaderElection1: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > 2020-06-21T03:46:27.9912697Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.FollowerState (FollowerState.java:run(108)) - > s1@group-D88B65C78887-FollowerState: change to CANDIDATE, lastRpcTime:244ms, > electionTimeout:243ms > 2020-06-21T03:46:27.9918514Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.RoleInfo (RoleInfo.java:shutdownFollowerState(121)) - s1: shutdown > FollowerState > 2020-06-21T03:46:27.9919033Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.RaftServerImpl (RaftServerImpl.java:setRole(174)) - > s1@group-D88B65C78887: changes role from FOLLOWER to CANDIDATE at term 0 for > changeToCandidate > 2020-06-21T03:46:27.9920005Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.RoleInfo (RoleInfo.java:updateAndGet(143)) - s1: start LeaderElection > 2020-06-21T03:46:27.9994968Z 2020-06-21 03:46:27,998 > [s1@group-D88B65C78887-LeaderElection2] INFO impl.LeaderElection > (LeaderElection.java:askForVotes(206)) - > s1@group-D88B65C78887-LeaderElection2: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669,
[jira] [Updated] (RATIS-983) Check follower status before ask for votes
[ https://issues.apache.org/jira/browse/RATIS-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-983: -- Summary: Check follower status before ask for votes (was: Check changed to follower before ask for votes) > Check follower status before ask for votes > -- > > Key: RATIS-983 > URL: https://issues.apache.org/jira/browse/RATIS-983 > Project: Ratis > Issue Type: Improvement >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > 1. There are server s0, s1, s2, all start leader election. But s2 has not > start askForVotes. > {code:java} > 2020-06-21 03:46:27,958 [Thread-7] INFO impl.RoleInfo > (RoleInfo.java:updateAndGet(143)) - s0: start LeaderElection > 2020-06-21 03:46:27,963 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.LeaderElection (LeaderElection.java:askForVotes(206)) - > s0@group-D88B65C78887-LeaderElection1: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > {code} > {code:java} > 2020-06-21 03:46:27,990 [Thread-8] INFO impl.RoleInfo > (RoleInfo.java:updateAndGet(143)) - s1: start LeaderElection > 2020-06-21 03:46:27,998 [s1@group-D88B65C78887-LeaderElection2] INFO > impl.LeaderElection (LeaderElection.java:askForVotes(206)) - > s1@group-D88B65C78887-LeaderElection2: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > {code} > {code:java} > 2020-06-21 03:46:28,064 [Thread-9] INFO impl.RoleInfo > (RoleInfo.java:updateAndGet(143)) - s2: start LeaderElection > {code} > 2. s0 was elected as leader > {code:java} > 2020-06-21 03:46:28,093 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.LeaderElection (LeaderElection.java:logAndReturn(61)) - > s0@group-D88B65C78887-LeaderElection1: Election PASSED; received 2 > response(s) [s0<-s1#0:FAIL-t1, s0<-s2#0:OK-t1] and 0 exception(s); > s0@group-D88B65C78887:t1, leader=null, voted=s0, > raftlog=s0@group-D88B65C78887-SegmentedRaftLog:OPENED:c-1,f-1,i0, conf=-1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > 2020-06-21 03:46:28,093 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.RoleInfo (RoleInfo.java:shutdownLeaderElection(134)) - s0: shutdown > LeaderElection > 2020-06-21T03:46:28.0975768Z 2020-06-21 03:46:28,094 > [s0@group-D88B65C78887-LeaderElection1] INFO impl.RaftServerImpl > (RaftServerImpl.java:setRole(174)) - s0@group-D88B65C78887: changes role from > CANDIDATE to LEADER at term 1 for changeToLeader > 2020-06-21 03:46:28,094 [s0@group-D88B65C78887-LeaderElection1] INFO > impl.RaftServerImpl (ServerState.java:setLeader(255)) - > s0@group-D88B65C78887: change Leader from null to s0 at term 1 for > becomeLeader, leader elected after 474ms > {code} > 3. s2 start askForVotes which did not start in step1. Then a new leader > election happens. > {code:java} > 2020-06-21 03:46:28,096 [s2@group-D88B65C78887-LeaderElection3] INFO > impl.LeaderElection (LeaderElection.java:askForVotes(206)) - > s2@group-D88B65C78887-LeaderElection3: begin an election at term 2 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > {code} > all the log as following: > {code:java} > 2020-06-21T03:46:27.9598769Z 2020-06-21 03:46:27,958 [Thread-7] INFO > impl.RoleInfo (RoleInfo.java:updateAndGet(143)) - s0: start LeaderElection > 2020-06-21T03:46:27.9637021Z 2020-06-21 03:46:27,963 > [s0@group-D88B65C78887-LeaderElection1] INFO impl.LeaderElection > (LeaderElection.java:askForVotes(206)) - > s0@group-D88B65C78887-LeaderElection1: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669, s2:0.0.0.0:41589], old=null > 2020-06-21T03:46:27.9912697Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.FollowerState (FollowerState.java:run(108)) - > s1@group-D88B65C78887-FollowerState: change to CANDIDATE, lastRpcTime:244ms, > electionTimeout:243ms > 2020-06-21T03:46:27.9918514Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.RoleInfo (RoleInfo.java:shutdownFollowerState(121)) - s1: shutdown > FollowerState > 2020-06-21T03:46:27.9919033Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.RaftServerImpl (RaftServerImpl.java:setRole(174)) - > s1@group-D88B65C78887: changes role from FOLLOWER to CANDIDATE at term 0 for > changeToCandidate > 2020-06-21T03:46:27.9920005Z 2020-06-21 03:46:27,990 [Thread-8] INFO > impl.RoleInfo (RoleInfo.java:updateAndGet(143)) - s1: start LeaderElection > 2020-06-21T03:46:27.9994968Z 2020-06-21 03:46:27,998 > [s1@group-D88B65C78887-LeaderElection2] INFO impl.LeaderElection > (LeaderElection.java:askForVotes(206)) - > s1@group-D88B65C78887-LeaderElection2: begin an election at term 1 for -1: > [s0:0.0.0.0:40443, s1:0.0.0.0:46669,
[jira] [Resolved] (RATIS-895) Fix Failed UT: runTestRetryOnStateMachineException
[ https://issues.apache.org/jira/browse/RATIS-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-895. --- Fix Version/s: 0.6.0 Resolution: Fixed > Fix Failed UT: runTestRetryOnStateMachineException > -- > > Key: RATIS-895 > URL: https://issues.apache.org/jira/browse/RATIS-895 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png, screenshot-2.png > > Time Spent: 40m > Remaining Estimate: 0h > > !screenshot-1.png! > !screenshot-2.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-958) Support multiple requests in a single MessageOutputStream
[ https://issues.apache.org/jira/browse/RATIS-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-958. --- Fix Version/s: 0.6.0 Resolution: Fixed > Support multiple requests in a single MessageOutputStream > - > > Key: RATIS-958 > URL: https://issues.apache.org/jira/browse/RATIS-958 > Project: Ratis > Issue Type: Improvement > Components: client, server >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Fix For: 0.6.0 > > Attachments: r958_20200617.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently, MessageOutputStream only support one request per stream. In this > JIRA, we will change it to support multiple requests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-904) Failed UT: testFileStoreAsync
[ https://issues.apache.org/jira/browse/RATIS-904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-904. --- Fix Version/s: 0.6.0 Resolution: Fixed Thanks [~yjxxtd] for working on this. I have committed this. > Failed UT: testFileStoreAsync > - > > Key: RATIS-904 > URL: https://issues.apache.org/jira/browse/RATIS-904 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-975) Fix failed UT: testRaftLogMetrics
[ https://issues.apache.org/jira/browse/RATIS-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-975. --- Fix Version/s: 0.6.0 Resolution: Fixed > Fix failed UT: testRaftLogMetrics > - > > Key: RATIS-975 > URL: https://issues.apache.org/jira/browse/RATIS-975 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png > > Time Spent: 40m > Remaining Estimate: 0h > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-960) Add APIs to support streaming state machine data
[ https://issues.apache.org/jira/browse/RATIS-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-960. --- Fix Version/s: 0.6.0 Resolution: Fixed Thanks [~szetszwo] for the contribution. I have committed this. > Add APIs to support streaming state machine data > > > Key: RATIS-960 > URL: https://issues.apache.org/jira/browse/RATIS-960 > Project: Ratis > Issue Type: New Feature > Components: StateMachine >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Fix For: 0.6.0 > > Attachments: r960_20200529.patch, r960_20200603.patch, > r960_20200610.patch > > Time Spent: 20m > Remaining Estimate: 0h > > {code} > //StateMachine > CompletableFuture writeStateMachineData(LogEntryProto entry) > {code} > In StateMachine, we have writeStateMachineData to write the state machine > data in the given log entry. It is inefficient to process state machine data > in a log entry when the data size is large. > In this JIRA, we add new APIs to support streaming state machine data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-921) Fix resource leak by closing thousands of gRPC clients after use
[ https://issues.apache.org/jira/browse/RATIS-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-921. --- Fix Version/s: 0.6.0 Resolution: Fixed > Fix resource leak by closing thousands of gRPC clients after use > > > Key: RATIS-921 > URL: https://issues.apache.org/jira/browse/RATIS-921 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png > > Time Spent: 1h 20m > Remaining Estimate: 0h > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-967) Provide an api to transition leader state from a member to another one
[ https://issues.apache.org/jira/browse/RATIS-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133021#comment-17133021 ] Shashikant Banerjee commented on RATIS-967: --- Thanks [~maobaolong] for filing. Can you please explain the idea in some detail here. I think, we can build something to recommend one datanode to become a leader but it would be difficult to forcefully make a datnode a leader for a raft group. > Provide an api to transition leader state from a member to another one > -- > > Key: RATIS-967 > URL: https://issues.apache.org/jira/browse/RATIS-967 > Project: Ratis > Issue Type: New Feature > Components: raft-group >Affects Versions: 0.5.0 >Reporter: maobaolong >Priority: Major > > With this api, we can transition leader state to a specify one for datanodes > in the same pipeline, OM group and SCM group. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-966) Add metric for different types of log entries for a raft server impl
[ https://issues.apache.org/jira/browse/RATIS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-966. --- Fix Version/s: 0.6.0 Resolution: Fixed > Add metric for different types of log entries for a raft server impl > > > Key: RATIS-966 > URL: https://issues.apache.org/jira/browse/RATIS-966 > Project: Ratis > Issue Type: Sub-task >Reporter: Shashikant Banerjee >Assignee: Ansh Khanna >Priority: Major > Fix For: 0.6.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently, a raft log entry can potentially consist of different types of > entries: > 1) Configuration > 2) MetaData > 3) StateMachine > > Idea here is track the count for the same for a given raft server impl. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-966) Add metric for different types of log entries for a raft server impl
[ https://issues.apache.org/jira/browse/RATIS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128818#comment-17128818 ] Shashikant Banerjee commented on RATIS-966: --- If the count is incremented after the commit index is updated, it should just be fine. > Add metric for different types of log entries for a raft server impl > > > Key: RATIS-966 > URL: https://issues.apache.org/jira/browse/RATIS-966 > Project: Ratis > Issue Type: Sub-task >Reporter: Shashikant Banerjee >Assignee: Ansh Khanna >Priority: Major > > Currently, a raft log entry can potentially consist of different types of > entries: > 1) Configuration > 2) MetaData > 3) StateMachine > > Idea here is track the count for the same for a given raft server impl. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-966) Add metric for different types of log entries for a raft server impl
[ https://issues.apache.org/jira/browse/RATIS-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126962#comment-17126962 ] Shashikant Banerjee commented on RATIS-966: --- Thanks [~ansh.khanna]/ {code:java} Should the count be incremented when an entry is appended or committed? (since uncommitted entries can be potentially discarded) {code} Count should change only when its committed. {code:java} Should the count be decremented incase they are discarded? {code} Yes. The idea is to to know count each type of entries in the raft log and at any point of time the metric should reflect what the raft log has. > Add metric for different types of log entries for a raft server impl > > > Key: RATIS-966 > URL: https://issues.apache.org/jira/browse/RATIS-966 > Project: Ratis > Issue Type: Sub-task >Reporter: Shashikant Banerjee >Assignee: Ansh Khanna >Priority: Major > > Currently, a raft log entry can potentially consist of different types of > entries: > 1) Configuration > 2) MetaData > 3) StateMachine > > Idea here is track the count for the same for a given raft server impl. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-966) Add metric for different types of log entries for a raft server impl
Shashikant Banerjee created RATIS-966: - Summary: Add metric for different types of log entries for a raft server impl Key: RATIS-966 URL: https://issues.apache.org/jira/browse/RATIS-966 Project: Ratis Issue Type: Sub-task Reporter: Shashikant Banerjee Currently, a raft log entry can potentially consist of different types of entries: 1) Configuration 2) MetaData 3) StateMachine Idea here is track the count for the same for a given raft server impl. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-965) Add a metric for raftServer impl groups for a raft server
Shashikant Banerjee created RATIS-965: - Summary: Add a metric for raftServer impl groups for a raft server Key: RATIS-965 URL: https://issues.apache.org/jira/browse/RATIS-965 Project: Ratis Issue Type: Sub-task Reporter: Shashikant Banerjee Currently, a single raft server instance can contain multiple raftServerImpl belonging to different raft groups. The idea here is to track the number of RaftGroups a raft server is part of. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-939) Failed UT: testRaftServerMetrics
[ https://issues.apache.org/jira/browse/RATIS-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-939. --- Fix Version/s: 0.6.0 Resolution: Fixed > Failed UT: testRaftServerMetrics > > > Key: RATIS-939 > URL: https://issues.apache.org/jira/browse/RATIS-939 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png > > Time Spent: 1h > Remaining Estimate: 0h > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-964) Fix failed UT: testRestartLogAppender
[ https://issues.apache.org/jira/browse/RATIS-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-964. --- Fix Version/s: 0.6.0 Resolution: Fixed > Fix failed UT: testRestartLogAppender > - > > Key: RATIS-964 > URL: https://issues.apache.org/jira/browse/RATIS-964 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-867) TestMetaServer#testListLogs
[ https://issues.apache.org/jira/browse/RATIS-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-867. --- Fix Version/s: 0.6.0 Resolution: Fixed > TestMetaServer#testListLogs > --- > > Key: RATIS-867 > URL: https://issues.apache.org/jira/browse/RATIS-867 > Project: Ratis > Issue Type: Sub-task >Reporter: Shashikant Banerjee >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: image.png > > Time Spent: 1h > Remaining Estimate: 0h > > > The issue was observed here: > [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.logservice.server/TestMetaServer/testListLogs/] > {code:java} > ava.lang.AssertionError: expected:<19> but was:<20> > at > org.apache.ratis.logservice.server.TestMetaServer.testJMXCount(TestMetaServer.java:339) > at > org.apache.ratis.logservice.server.TestMetaServer.testListLogs(TestMetaServer.java:331) > {code} > > The reason is: > 1. when create log, it will call > [RaftClientImpl::sendRequestWithRetry|https://github.com/apache/incubator-ratis/blob/master/ratis-client/src/main/java/org/apache/ratis/client/impl/RaftClientImpl.java#L285], > if throw TimeoutIOException, it will retry at [final RaftClientReply reply = > sendRequest(request)|https://github.com/apache/incubator-ratis/blob/master/ratis-client/src/main/java/org/apache/ratis/client/impl/RaftClientImpl.java#L296], > So JMXCount will increase many times at [timerContext = > metricRegistry.timer(type.name()).time()|https://github.com/apache/incubator-ratis/blob/master/ratis-logservice/src/main/java/org/apache/ratis/logservice/server/MetaStateMachine.java#L224] > when retry happens. Then JMXCount i.e. 20 not equal to createCount i.e. 19 > 2. The TimeoutIOException is as follows: > {code:java} > org.apache.ratis.protocol.TimeoutIOException: deadline exceeded after > 2.77899s. [buffered_nanos=1460409, remote_addr=localhost/127.0.0.1:9001] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-935) Fix memory leak by ungister metrics
[ https://issues.apache.org/jira/browse/RATIS-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-935. --- Fix Version/s: 0.6.0 Resolution: Fixed > Fix memory leak by ungister metrics > --- > > Key: RATIS-935 > URL: https://issues.apache.org/jira/browse/RATIS-935 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-959) Refactor. xxxStateMachineData methods
[ https://issues.apache.org/jira/browse/RATIS-959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123536#comment-17123536 ] Shashikant Banerjee commented on RATIS-959: --- Thanks [~szetszwo] for working on this. The changes look good. Can you submit a PR for this as we now Ratis PR Model? > Refactor. xxxStateMachineData methods > - > > Key: RATIS-959 > URL: https://issues.apache.org/jira/browse/RATIS-959 > Project: Ratis > Issue Type: Improvement > Components: StateMachine >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Attachments: r959_20200529.patch > > > Currently, the StateMachine interface has quite a few methods related to > state machine data as below: > - writeStateMachineData > - readStateMachineData > - flushStateMachineData > - truncateStateMachineData > We propose moving them to a new DataApi interface. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-892) Failed UT: TestMetaServer.testReadWritetoLog
[ https://issues.apache.org/jira/browse/RATIS-892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-892. --- Fix Version/s: 0.6.0 Resolution: Fixed > Failed UT: TestMetaServer.testReadWritetoLog > > > Key: RATIS-892 > URL: https://issues.apache.org/jira/browse/RATIS-892 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-947) RequestTypeDependentRetryPolicy should have timeout per request type
[ https://issues.apache.org/jira/browse/RATIS-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-947. --- Fix Version/s: 0.6.0 Resolution: Fixed > RequestTypeDependentRetryPolicy should have timeout per request type > > > Key: RATIS-947 > URL: https://issues.apache.org/jira/browse/RATIS-947 > Project: Ratis > Issue Type: Bug > Components: client >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Fix For: 0.6.0 > > > RequestTypeDependentRetryPolicy currently has single timeout for all request > types. The Jira aims to add timeout for every request type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-942) Fix can not create raftLogMetrics in multi-raft
[ https://issues.apache.org/jira/browse/RATIS-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-942. --- Fix Version/s: 0.6.0 Resolution: Fixed > Fix can not create raftLogMetrics in multi-raft > --- > > Key: RATIS-942 > URL: https://issues.apache.org/jira/browse/RATIS-942 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-900) Failed UT: RaftExceptionBaseTest.testHandleNotLeaderAndIOException
[ https://issues.apache.org/jira/browse/RATIS-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-900. --- Fix Version/s: 0.6.0 Resolution: Fixed > Failed UT: RaftExceptionBaseTest.testHandleNotLeaderAndIOException > -- > > Key: RATIS-900 > URL: https://issues.apache.org/jira/browse/RATIS-900 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: image.png, screenshot-1.png, screenshot-2.png > > Time Spent: 40m > Remaining Estimate: 0h > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-930) Failed to remove RaftStorageDirectory
[ https://issues.apache.org/jira/browse/RATIS-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109907#comment-17109907 ] Shashikant Banerjee commented on RATIS-930: --- Thanks [~yjxxtd] for working on this. I have committed this. > Failed to remove RaftStorageDirectory > - > > Key: RATIS-930 > URL: https://issues.apache.org/jira/browse/RATIS-930 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png, screenshot-2.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > One thread move and another thread delete at the same time, then both fail. > !screenshot-1.png! > !screenshot-2.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-930) Failed to remove RaftStorageDirectory
[ https://issues.apache.org/jira/browse/RATIS-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-930. --- Fix Version/s: 0.6.0 Resolution: Fixed > Failed to remove RaftStorageDirectory > - > > Key: RATIS-930 > URL: https://issues.apache.org/jira/browse/RATIS-930 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png, screenshot-2.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > One thread move and another thread delete at the same time, then both fail. > !screenshot-1.png! > !screenshot-2.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-845) Memory leak of RaftServerImpl for no unregister from reporter
[ https://issues.apache.org/jira/browse/RATIS-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-845. --- Fix Version/s: 0.6.0 Resolution: Fixed Thanks [~yjxxtd] for working on this. i have committed this. > Memory leak of RaftServerImpl for no unregister from reporter > - > > Key: RATIS-845 > URL: https://issues.apache.org/jira/browse/RATIS-845 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-10.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png, screenshot-5.png, screenshot-6.png, screenshot-7.png, > screenshot-8.png, screenshot-9.png > > Time Spent: 4h 40m > Remaining Estimate: 0h > > *What's the problem ? * > As the image shows, there are 1885 instances of RaftServerImpl, most of them > are Closed, and should be GC, but actually not. You can find from the image > 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap, 372 RaftServerImpl were held by > Datanode ReportManager Thread -> prometheus -> HashMap. So 1513 > RaftServerImpl leak in ratis, and 372 leak in ozone. If RaftServerImpl can > not GC, there are a lot of related resource can not be GC, such as the > [DirectByteBuffer|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/raftlog/segmented/SegmentedRaftLogWorker.java#L150] > in SegmentRaftLogWorker, which result 1GB memory leak out of heap. > h3. *{color:#DE350B}1. 1885 instances of RaftServerImpl {color}* > !screenshot-4.png! > h3. *{color:#DE350B}2. 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap, 372 RaftServerImpl were held by > Datanode ReportManager Thread -> prometheus -> HashMap{color}* > !screenshot-5.png! > h3. *{color:#DE350B}3. 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap{color}* > !screenshot-6.png! > h3. *{color:#DE350B}4. 372 RaftServerImpl were held by Datanode ReportManager > Thread -> prometheus -> HashMap{color}* > !screenshot-7.png! > h3. *{color:#DE350B}5. 2038 DirectByteBuffer, and 1885 held by > RaftServerImpl.{color}* > !screenshot-8.png! > !screenshot-9.png! > h3. *{color:#DE350B}6. 1033 DirectByteBuffer were held by ManagermentFactory, > 802 DirectByteBuffer were held by Datanode ReportManager Thread, total > 1885.{color}* > !screenshot-10.png! > h3. *{color:#DE350B}7. The reason RaftServerImpl held by > ManagermentFactory->jxmMBeanServer->HashMap is ratis start > [JmxReporter|https://github.com/apache/incubator-ratis/blob/master/ratis-metrics/src/main/java/org/apache/ratis/metrics/MetricsReporting.java#L47], > but does not stop it. {color}* > h3. *{color:#DE350B}8. The reason RaftServerImpl held by Datanode > ReportManager Thread -> prometheus -> HashMap is ozone call the ratis > function to > [register|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java#L189] > metric in prometheus, but does not unregister it.{color}* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-846) Create replicated counter example
[ https://issues.apache.org/jira/browse/RATIS-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107453#comment-17107453 ] Shashikant Banerjee commented on RATIS-846: --- Thanks [~esa.hekmat] for working on this. I have committed this. > Create replicated counter example > - > > Key: RATIS-846 > URL: https://issues.apache.org/jira/browse/RATIS-846 > Project: Ratis > Issue Type: Improvement > Components: examples >Reporter: Isa Hekmatizadeh >Assignee: Isa Hekmatizadeh >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > Create a very very simple example that just maintains a counter value across > the cluster to illustrate "How to use Ratis" in the simplest way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-901) Failed UT: WatchRequestTests.testWatchRequestClientTimeout
[ https://issues.apache.org/jira/browse/RATIS-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-901. --- Fix Version/s: 0.6.0 Resolution: Fixed > Failed UT: WatchRequestTests.testWatchRequestClientTimeout > -- > > Key: RATIS-901 > URL: https://issues.apache.org/jira/browse/RATIS-901 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png > > Time Spent: 50m > Remaining Estimate: 0h > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-929) Shutdown EventLoopGroup faster
[ https://issues.apache.org/jira/browse/RATIS-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-929. --- Fix Version/s: 0.6.0 Resolution: Fixed > Shutdown EventLoopGroup faster > -- > > Key: RATIS-929 > URL: https://issues.apache.org/jira/browse/RATIS-929 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-924) rename raft group dir on disk when remove group is invoked
[ https://issues.apache.org/jira/browse/RATIS-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100792#comment-17100792 ] Shashikant Banerjee commented on RATIS-924: --- Currently, even if a raft group is removed, the raft log directory is not cleaned up. As a result, when datanode restarts it still reinitializes the raft group as it finds the raft group dir intact. I think we should do the following: 1) Add an config to selectively delete/retain the raft group dir on group remove 2) If the deleteDirectory config is set to false, it should rename the dir so that the next restart, reinitialisation doesn't happen > rename raft group dir on disk when remove group is invoked > -- > > Key: RATIS-924 > URL: https://issues.apache.org/jira/browse/RATIS-924 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-924) rename raft group dir on disk when remove group is invoked
[ https://issues.apache.org/jira/browse/RATIS-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-924: -- Summary: rename raft group dir on disk when remove group is invoked (was: rename group on disk when remove group ) > rename raft group dir on disk when remove group is invoked > -- > > Key: RATIS-924 > URL: https://issues.apache.org/jira/browse/RATIS-924 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-927) Improve the log of remove group
[ https://issues.apache.org/jira/browse/RATIS-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-927. --- Fix Version/s: 0.6.0 Resolution: Fixed Thanks [~yjxxtd] for the contribution. I have committed this. > Improve the log of remove group > --- > > Key: RATIS-927 > URL: https://issues.apache.org/jira/browse/RATIS-927 > Project: Ratis > Issue Type: Improvement >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-05-06-15-13-57-035.png > > Time Spent: 10m > Remaining Estimate: 0h > > !image-2020-05-06-15-13-57-035.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-576) NullPointerException at the ratis client while running Freon benchmark
[ https://issues.apache.org/jira/browse/RATIS-576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-576. --- Fix Version/s: 0.6.0 Resolution: Not A Problem > NullPointerException at the ratis client while running Freon benchmark > -- > > Key: RATIS-576 > URL: https://issues.apache.org/jira/browse/RATIS-576 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.4.0 >Reporter: Rakesh Radhakrishnan >Assignee: Nanda kumar >Priority: Blocker > Labels: ozone > Fix For: 0.6.0 > > Attachments: NPE-logs.tar.gz > > > Hits NPE during Freon benchmark test run. Below is the exception logged at > the client side output log message. > {code} > SEVERE: Exception while executing runnable > org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed@6c585536 > java.lang.NullPointerException > at > org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.completeReplyExceptionally(GrpcClientProtocolClient.java:320) > at > org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.access$000(GrpcClientProtocolClient.java:245) > at > org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onError(GrpcClientProtocolClient.java:269) > at > org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434) > at > org.apache.ratis.thirdparty.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) > at > org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) > at > org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) > at > org.apache.ratis.thirdparty.io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:678) > at > org.apache.ratis.thirdparty.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) > at > org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) > at > org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) > at > org.apache.ratis.thirdparty.io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397) > at > org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459) > at > org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63) > at > org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546) > at > org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467) > at > org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584) > at > org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) > at > org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-780) Pipeline reports LeaderNotReadyException after written a bunch of data
[ https://issues.apache.org/jira/browse/RATIS-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-780. --- Fix Version/s: 0.6.0 Resolution: Cannot Reproduce > Pipeline reports LeaderNotReadyException after written a bunch of data > -- > > Key: RATIS-780 > URL: https://issues.apache.org/jira/browse/RATIS-780 > Project: Ratis > Issue Type: Bug >Reporter: Sammi Chen >Assignee: Shashikant Banerjee >Priority: Blocker > Fix For: 0.6.0 > > Attachments: client.log, leader-metrics.log > > > The pipeline failed to serve write request after written a bunch of data. > There is no WARN or ERROR messags in datanode log file. > Client log attached. > Leader node metrids attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-914) Failed UT: Can not mock final class
[ https://issues.apache.org/jira/browse/RATIS-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-914. --- Fix Version/s: 0.6.0 Resolution: Fixed Thanks [~yjxxtd] for working on this and [~adoroszlai] for the review. I have committed this. > Failed UT: Can not mock final class > > > Key: RATIS-914 > URL: https://issues.apache.org/jira/browse/RATIS-914 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png > > Time Spent: 10m > Remaining Estimate: 0h > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-911) Failed UT: testRestartLogAppender
[ https://issues.apache.org/jira/browse/RATIS-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-911. --- Fix Version/s: 0.6.0 Resolution: Fixed Thanks [~yjxxtd] for working on this. I have committed this. > Failed UT: testRestartLogAppender > - > > Key: RATIS-911 > URL: https://issues.apache.org/jira/browse/RATIS-911 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png > > > Can not elect a leader for a long time. > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (RATIS-845) Memory leak of RaftServerImpl
[ https://issues.apache.org/jira/browse/RATIS-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096244#comment-17096244 ] Shashikant Banerjee edited comment on RATIS-845 at 4/30/20, 7:07 AM: - [~avijayan]/[~elek] can you plz review it as the fix seem to add a prometheus sink in ratis? was (Author: shashikant): [~avijayan]/[~elek] can you plz review it as it adds a prometheus sink in ratis? > Memory leak of RaftServerImpl > - > > Key: RATIS-845 > URL: https://issues.apache.org/jira/browse/RATIS-845 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: screenshot-10.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png, screenshot-5.png, screenshot-6.png, screenshot-7.png, > screenshot-8.png, screenshot-9.png > > Time Spent: 20m > Remaining Estimate: 0h > > *What's the problem ? * > As the image shows, there are 1885 instances of RaftServerImpl, most of them > are Closed, and should be GC, but actually not. You can find from the image > 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap, 372 RaftServerImpl were held by > Datanode ReportManager Thread -> prometheus -> HashMap. So 1513 > RaftServerImpl leak in ratis, and 372 leak in ozone. If RaftServerImpl can > not GC, there are a lot of related resource can not be GC, such as the > [DirectByteBuffer|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/raftlog/segmented/SegmentedRaftLogWorker.java#L150] > in SegmentRaftLogWorker, which result 1GB memory leak out of heap. > h3. *{color:#DE350B}1. 1885 instances of RaftServerImpl {color}* > !screenshot-4.png! > h3. *{color:#DE350B}2. 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap, 372 RaftServerImpl were held by > Datanode ReportManager Thread -> prometheus -> HashMap{color}* > !screenshot-5.png! > h3. *{color:#DE350B}3. 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap{color}* > !screenshot-6.png! > h3. *{color:#DE350B}4. 372 RaftServerImpl were held by Datanode ReportManager > Thread -> prometheus -> HashMap{color}* > !screenshot-7.png! > h3. *{color:#DE350B}5. 2038 DirectByteBuffer, and 1885 held by > RaftServerImpl.{color}* > !screenshot-8.png! > !screenshot-9.png! > h3. *{color:#DE350B}6. 1033 DirectByteBuffer were held by ManagermentFactory, > 802 DirectByteBuffer were held by Datanode ReportManager Thread, total > 1885.{color}* > !screenshot-10.png! > h3. *{color:#DE350B}7. The reason RaftServerImpl held by > ManagermentFactory->jxmMBeanServer->HashMap is ratis start > [JmxReporter|https://github.com/apache/incubator-ratis/blob/master/ratis-metrics/src/main/java/org/apache/ratis/metrics/MetricsReporting.java#L47], > but does not stop it. {color}* > h3. *{color:#DE350B}8. The reason RaftServerImpl held by Datanode > ReportManager Thread -> prometheus -> HashMap is ozone call the ratis > function to > [register|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java#L189] > metric in prometheus, but does not unregister it.{color}* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (RATIS-845) Memory leak of RaftServerImpl
[ https://issues.apache.org/jira/browse/RATIS-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096244#comment-17096244 ] Shashikant Banerjee edited comment on RATIS-845 at 4/30/20, 7:06 AM: - [~avijayan]/[~elek] can you plz review it as it adds a prometheus sink in ratis? was (Author: shashikant): [~avijayan], can you plz review it? > Memory leak of RaftServerImpl > - > > Key: RATIS-845 > URL: https://issues.apache.org/jira/browse/RATIS-845 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: screenshot-10.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png, screenshot-5.png, screenshot-6.png, screenshot-7.png, > screenshot-8.png, screenshot-9.png > > Time Spent: 20m > Remaining Estimate: 0h > > *What's the problem ? * > As the image shows, there are 1885 instances of RaftServerImpl, most of them > are Closed, and should be GC, but actually not. You can find from the image > 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap, 372 RaftServerImpl were held by > Datanode ReportManager Thread -> prometheus -> HashMap. So 1513 > RaftServerImpl leak in ratis, and 372 leak in ozone. If RaftServerImpl can > not GC, there are a lot of related resource can not be GC, such as the > [DirectByteBuffer|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/raftlog/segmented/SegmentedRaftLogWorker.java#L150] > in SegmentRaftLogWorker, which result 1GB memory leak out of heap. > h3. *{color:#DE350B}1. 1885 instances of RaftServerImpl {color}* > !screenshot-4.png! > h3. *{color:#DE350B}2. 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap, 372 RaftServerImpl were held by > Datanode ReportManager Thread -> prometheus -> HashMap{color}* > !screenshot-5.png! > h3. *{color:#DE350B}3. 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap{color}* > !screenshot-6.png! > h3. *{color:#DE350B}4. 372 RaftServerImpl were held by Datanode ReportManager > Thread -> prometheus -> HashMap{color}* > !screenshot-7.png! > h3. *{color:#DE350B}5. 2038 DirectByteBuffer, and 1885 held by > RaftServerImpl.{color}* > !screenshot-8.png! > !screenshot-9.png! > h3. *{color:#DE350B}6. 1033 DirectByteBuffer were held by ManagermentFactory, > 802 DirectByteBuffer were held by Datanode ReportManager Thread, total > 1885.{color}* > !screenshot-10.png! > h3. *{color:#DE350B}7. The reason RaftServerImpl held by > ManagermentFactory->jxmMBeanServer->HashMap is ratis start > [JmxReporter|https://github.com/apache/incubator-ratis/blob/master/ratis-metrics/src/main/java/org/apache/ratis/metrics/MetricsReporting.java#L47], > but does not stop it. {color}* > h3. *{color:#DE350B}8. The reason RaftServerImpl held by Datanode > ReportManager Thread -> prometheus -> HashMap is ozone call the ratis > function to > [register|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java#L189] > metric in prometheus, but does not unregister it.{color}* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-845) Memory leak of RaftServerImpl
[ https://issues.apache.org/jira/browse/RATIS-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096244#comment-17096244 ] Shashikant Banerjee commented on RATIS-845: --- [~avijayan], can you plz review it? > Memory leak of RaftServerImpl > - > > Key: RATIS-845 > URL: https://issues.apache.org/jira/browse/RATIS-845 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: screenshot-10.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png, screenshot-5.png, screenshot-6.png, screenshot-7.png, > screenshot-8.png, screenshot-9.png > > Time Spent: 20m > Remaining Estimate: 0h > > *What's the problem ? * > As the image shows, there are 1885 instances of RaftServerImpl, most of them > are Closed, and should be GC, but actually not. You can find from the image > 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap, 372 RaftServerImpl were held by > Datanode ReportManager Thread -> prometheus -> HashMap. So 1513 > RaftServerImpl leak in ratis, and 372 leak in ozone. If RaftServerImpl can > not GC, there are a lot of related resource can not be GC, such as the > [DirectByteBuffer|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/raftlog/segmented/SegmentedRaftLogWorker.java#L150] > in SegmentRaftLogWorker, which result 1GB memory leak out of heap. > h3. *{color:#DE350B}1. 1885 instances of RaftServerImpl {color}* > !screenshot-4.png! > h3. *{color:#DE350B}2. 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap, 372 RaftServerImpl were held by > Datanode ReportManager Thread -> prometheus -> HashMap{color}* > !screenshot-5.png! > h3. *{color:#DE350B}3. 1513 RaftServerImpl were held by > ManagermentFactory->jxmMBeanServer->HashMap{color}* > !screenshot-6.png! > h3. *{color:#DE350B}4. 372 RaftServerImpl were held by Datanode ReportManager > Thread -> prometheus -> HashMap{color}* > !screenshot-7.png! > h3. *{color:#DE350B}5. 2038 DirectByteBuffer, and 1885 held by > RaftServerImpl.{color}* > !screenshot-8.png! > !screenshot-9.png! > h3. *{color:#DE350B}6. 1033 DirectByteBuffer were held by ManagermentFactory, > 802 DirectByteBuffer were held by Datanode ReportManager Thread, total > 1885.{color}* > !screenshot-10.png! > h3. *{color:#DE350B}7. The reason RaftServerImpl held by > ManagermentFactory->jxmMBeanServer->HashMap is ratis start > [JmxReporter|https://github.com/apache/incubator-ratis/blob/master/ratis-metrics/src/main/java/org/apache/ratis/metrics/MetricsReporting.java#L47], > but does not stop it. {color}* > h3. *{color:#DE350B}8. The reason RaftServerImpl held by Datanode > ReportManager Thread -> prometheus -> HashMap is ozone call the ratis > function to > [register|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java#L189] > metric in prometheus, but does not unregister it.{color}* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (RATIS-912) Netty tests fail with RejectedExecutionException on channel close
[ https://issues.apache.org/jira/browse/RATIS-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee reassigned RATIS-912: - Assignee: Lokesh Jain > Netty tests fail with RejectedExecutionException on channel close > - > > Key: RATIS-912 > URL: https://issues.apache.org/jira/browse/RATIS-912 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: Lokesh Jain >Priority: Major > Attachments: screenshot-1.png > > Time Spent: 10m > Remaining Estimate: 0h > > It looks like the RATIS-910 generate new failed UT. This type of failed UT > did not happen in previous commit. > https://github.com/apache/incubator-ratis/runs/625933249 > https://github.com/apache/incubator-ratis/runs/626750611 > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-912) Netty tests fail with RejectedExecutionException on channel close
[ https://issues.apache.org/jira/browse/RATIS-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-912. --- Fix Version/s: 0.6.0 Resolution: Fixed > Netty tests fail with RejectedExecutionException on channel close > - > > Key: RATIS-912 > URL: https://issues.apache.org/jira/browse/RATIS-912 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: Lokesh Jain >Priority: Major > Fix For: 0.6.0 > > Attachments: screenshot-1.png > > Time Spent: 10m > Remaining Estimate: 0h > > It looks like the RATIS-910 generate new failed UT. This type of failed UT > did not happen in previous commit. > https://github.com/apache/incubator-ratis/runs/625933249 > https://github.com/apache/incubator-ratis/runs/626750611 > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-912) Failed UT: RejectedExecutionException: event executor terminated
[ https://issues.apache.org/jira/browse/RATIS-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-912: -- Parent: RATIS-863 Issue Type: Sub-task (was: Bug) > Failed UT: RejectedExecutionException: event executor terminated > > > Key: RATIS-912 > URL: https://issues.apache.org/jira/browse/RATIS-912 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Priority: Major > Attachments: screenshot-1.png > > > It looks like the RATIS-910 generate new failed UT. This type of failed UT > did not happen in previous commit. > https://github.com/apache/incubator-ratis/runs/625933249 > https://github.com/apache/incubator-ratis/runs/626750611 > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-911) Failed UT: testRestartLogAppender
[ https://issues.apache.org/jira/browse/RATIS-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-911: -- Parent: RATIS-863 Issue Type: Sub-task (was: Bug) > Failed UT: testRestartLogAppender > - > > Key: RATIS-911 > URL: https://issues.apache.org/jira/browse/RATIS-911 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: screenshot-1.png > > > Can not elect a leader for a long time. > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-889) Fix TestRaftSnapshotWithGrpc
[ https://issues.apache.org/jira/browse/RATIS-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-889: -- Attachment: unit.zip > Fix TestRaftSnapshotWithGrpc > > > Key: RATIS-889 > URL: https://issues.apache.org/jira/browse/RATIS-889 > Project: Ratis > Issue Type: Sub-task > Components: snapshot >Affects Versions: 0.6.0 >Reporter: Shashikant Banerjee >Priority: Major > Fix For: 0.6.0 > > Attachments: unit.zip > > > {code:java} > Test set: org.apache.ratis.grpc.TestRaftSnapshotWithGrpc > --- > Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 4.84 s <<< > FAILURE! - in org.apache.ratis.grpc.TestRaftSnapshotWithGrpc > testBasicInstallSnapshot(org.apache.ratis.grpc.TestRaftSnapshotWithGrpc) > Time elapsed: 2.308 s <<< FAILURE! > java.lang.AssertionError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-890) Fix TestMetaServer timeout
[ https://issues.apache.org/jira/browse/RATIS-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-890: -- Summary: Fix TestMetaServer timeout (was: Fix TestMetaServer) > Fix TestMetaServer timeout > -- > > Key: RATIS-890 > URL: https://issues.apache.org/jira/browse/RATIS-890 > Project: Ratis > Issue Type: Sub-task > Components: LogService >Reporter: Shashikant Banerjee >Priority: Major > Fix For: 0.6.0 > > Attachments: unit.zip > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-890) Fix TestMetaServer
[ https://issues.apache.org/jira/browse/RATIS-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-890: -- Attachment: unit.zip > Fix TestMetaServer > -- > > Key: RATIS-890 > URL: https://issues.apache.org/jira/browse/RATIS-890 > Project: Ratis > Issue Type: Sub-task > Components: LogService >Reporter: Shashikant Banerjee >Priority: Major > Fix For: 0.6.0 > > Attachments: unit.zip > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-888) Fix TestLogAppenderWithGrpc#testRestartLogAppender
[ https://issues.apache.org/jira/browse/RATIS-888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-888: -- Attachment: org.apache.ratis.grpc.TestLogAppenderWithGrpc-output.txt > Fix TestLogAppenderWithGrpc#testRestartLogAppender > -- > > Key: RATIS-888 > URL: https://issues.apache.org/jira/browse/RATIS-888 > Project: Ratis > Issue Type: Sub-task >Reporter: Shashikant Banerjee >Priority: Major > Attachments: org.apache.ratis.grpc.TestLogAppenderWithGrpc-output.txt > > > {code:java} > testRestartLogAppender(org.apache.ratis.grpc.TestLogAppenderWithGrpc) Time > elapsed: 2.817 s <<< > FAILURE!testRestartLogAppender(org.apache.ratis.grpc.TestLogAppenderWithGrpc) > Time elapsed: 2.817 s <<< FAILURE!java.lang.AssertionError: expected:<1> > but was:<2> at > org.apache.ratis.grpc.TestLogAppenderWithGrpc.runTestRestartLogAppender(TestLogAppenderWithGrpc.java:129) > at > org.apache.ratis.grpc.TestLogAppenderWithGrpc.testRestartLogAppender(TestLogAppenderWithGrpc.java:96) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-890) Fix TestMetaServer
Shashikant Banerjee created RATIS-890: - Summary: Fix TestMetaServer Key: RATIS-890 URL: https://issues.apache.org/jira/browse/RATIS-890 Project: Ratis Issue Type: Sub-task Components: LogService Reporter: Shashikant Banerjee Fix For: 0.6.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-889) Fix TestRaftSnapshotWithGrpc
Shashikant Banerjee created RATIS-889: - Summary: Fix TestRaftSnapshotWithGrpc Key: RATIS-889 URL: https://issues.apache.org/jira/browse/RATIS-889 Project: Ratis Issue Type: Sub-task Components: snapshot Affects Versions: 0.6.0 Reporter: Shashikant Banerjee Fix For: 0.6.0 {code:java} Test set: org.apache.ratis.grpc.TestRaftSnapshotWithGrpc --- Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 4.84 s <<< FAILURE! - in org.apache.ratis.grpc.TestRaftSnapshotWithGrpc testBasicInstallSnapshot(org.apache.ratis.grpc.TestRaftSnapshotWithGrpc) Time elapsed: 2.308 s <<< FAILURE! java.lang.AssertionError {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-888) Fix TestLogAppenderWithGrpc#testRestartLogAppender
Shashikant Banerjee created RATIS-888: - Summary: Fix TestLogAppenderWithGrpc#testRestartLogAppender Key: RATIS-888 URL: https://issues.apache.org/jira/browse/RATIS-888 Project: Ratis Issue Type: Sub-task Reporter: Shashikant Banerjee {code:java} testRestartLogAppender(org.apache.ratis.grpc.TestLogAppenderWithGrpc) Time elapsed: 2.817 s <<< FAILURE!testRestartLogAppender(org.apache.ratis.grpc.TestLogAppenderWithGrpc) Time elapsed: 2.817 s <<< FAILURE!java.lang.AssertionError: expected:<1> but was:<2> at org.apache.ratis.grpc.TestLogAppenderWithGrpc.runTestRestartLogAppender(TestLogAppenderWithGrpc.java:129) at org.apache.ratis.grpc.TestLogAppenderWithGrpc.testRestartLogAppender(TestLogAppenderWithGrpc.java:96) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-849) Failed UT: GroupManagementBaseTest.runMultiGroupTest
[ https://issues.apache.org/jira/browse/RATIS-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-849. --- Fix Version/s: 0.6.0 Resolution: Fixed Thanks [~yjxxtd] for the contribution. I have committed this. > Failed UT: GroupManagementBaseTest.runMultiGroupTest > > > Key: RATIS-849 > URL: https://issues.apache.org/jira/browse/RATIS-849 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > Attachments: image-2020-04-14-21-07-42-831.png, screenshot-1.png > > Time Spent: 10m > Remaining Estimate: 0h > > *What's the problem ?* > !image-2020-04-14-21-07-42-831.png! > *What the reason ?* > I test with the patch, the failed unit test will not happen again. > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-876) Introduce max timeout in RequestTypeDependentRetryPolicy
[ https://issues.apache.org/jira/browse/RATIS-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093128#comment-17093128 ] Shashikant Banerjee commented on RATIS-876: --- [~ljain], can you please rebase ? > Introduce max timeout in RequestTypeDependentRetryPolicy > > > Key: RATIS-876 > URL: https://issues.apache.org/jira/browse/RATIS-876 > Project: Ratis > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Attachments: RATIS-876.001.patch, RATIS-876.002.patch > > > This Jira aims to add a max timeout in RequestTypeDependentRetryPolicy. If a > timeout of 1 minute is configured then all retries after 1 minute of request > creation will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-865) Fix TestRaftWithGrpc#testStateMachineMetrics
[ https://issues.apache.org/jira/browse/RATIS-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-865. --- Fix Version/s: 0.6.0 Resolution: Duplicate > Fix TestRaftWithGrpc#testStateMachineMetrics > > > Key: RATIS-865 > URL: https://issues.apache.org/jira/browse/RATIS-865 > Project: Ratis > Issue Type: Sub-task >Reporter: Shashikant Banerjee >Priority: Major > Fix For: 0.6.0 > > > The failure was observed here: > [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.grpc/TestRaftWithGrpc/testStateMachineMetrics/] > {code:java} > org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.ratis.RaftBasicTests.checkFollowerCommitLagsLeader(RaftBasicTests.java:494) > at > org.apache.ratis.RaftBasicTests.testStateMachineMetrics(RaftBasicTests.java:469) > at > org.apache.ratis.grpc.TestRaftWithGrpc.lambda$testStateMachineMetrics$1(TestRaftWithGrpc.java:65) > at > org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:125) > at > org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) > at > org.apache.ratis.grpc.TestRaftWithGrpc.testStateMachineMetrics(TestRaftWithGrpc.java:64) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > 2ND INSTANCE > --- > org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) > java.util.NoSuchElementException > at java.util.TreeMap.key(TreeMap.java:1327) > at java.util.TreeMap.firstKey(TreeMap.java:290) > at > java.util.Collections$UnmodifiableSortedMap.firstKey(Collections.java:1808) > at > org.apache.ratis.server.impl.RaftServerMetrics.getPeerCommitIndexGauge(RaftServerMetrics.java:159) > at > org.apache.ratis.RaftBasicTests.checkFollowerCommitLagsLeader(RaftBasicTests.java:487) > at > org.apache.ratis.RaftBasicTests.testStateMachineMetrics(RaftBasicTests.java:458) > at > org.apache.ratis.grpc.TestRaftWithGrpc.lambda$testStateMachineMetrics$1(TestRaftWithGrpc.java:65) > at > org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:125) > at > org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) > at > org.apache.ratis.grpc.TestRaftWithGrpc.testStateMachineMetrics(TestRaftWithGrpc.java:64) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at
[jira] [Updated] (RATIS-884) Failed UT: TestRaftLogMetrics.testRaftLogMetrics
[ https://issues.apache.org/jira/browse/RATIS-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-884: -- Parent: RATIS-863 Issue Type: Sub-task (was: Bug) > Failed UT: TestRaftLogMetrics.testRaftLogMetrics > > > Key: RATIS-884 > URL: https://issues.apache.org/jira/browse/RATIS-884 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-884.001.patch, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > !screenshot-1.png! > *What's the reason ?* > As the images shows, flushCount.incrementAndGet() happens in the 1st > statement: stateMachine.flushStateMachineData(lastWrittenIndex), > ratisMetricRegistry.get(RAFT_LOG_FLUSH_TIME) increase after the 2nd > statement: timerContext.stop(). If the test > Assert.assertEquals(expectedFlush, tm.getCount()) happens between 1st and 2nd > statement, then the expectedFlush will be tm.getCount() + 1, so the test fail. > !screenshot-2.png! > *How to fix ?* > Retry check expectedFlush == tm.getCount() -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-885) Failed UT because error use of attemptRepeatedly to check boolean condition
[ https://issues.apache.org/jira/browse/RATIS-885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-885: -- Parent: RATIS-863 Issue Type: Sub-task (was: Bug) > Failed UT because error use of attemptRepeatedly to check boolean condition > --- > > Key: RATIS-885 > URL: https://issues.apache.org/jira/browse/RATIS-885 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-885.001.patch, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > !screenshot-1.png! > *What's the reason ?* > I think the author of following code want to try 10 seconds until > followerState.getLastAppliedIndex() >= leaderLastIndex, but actually > JavaUtils.attemptRepeatedly will not retry unless the statement throw > exception as the image shows. > {code:java} > // make sure the restarted follower can catchup > final ServerState followerState = > cluster.getRaftServerImpl(followerId).getState(); > JavaUtils.attemptRepeatedly(() -> followerState.getLastAppliedIndex() >= > leaderLastIndex, > 10, ONE_SECOND, "follower catchup", LOG); > {code} > !screenshot-2.png! > *How to fix ?* > I fix all the error use of JavaUtils.attemptRepeatedly to check boolean > condition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-881) Failed unit test because test before MiniRaftCluster ready
[ https://issues.apache.org/jira/browse/RATIS-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-881: -- Parent: RATIS-863 Issue Type: Sub-task (was: Bug) > Failed unit test because test before MiniRaftCluster ready > -- > > Key: RATIS-881 > URL: https://issues.apache.org/jira/browse/RATIS-881 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-881.001.patch, screenshot-1.png > > > For the failed > [TestRaftWithGrpc::testStateMachineMetrics|https://builds.apache.org/job/PreCommit-RATIS-Build/1305/testReport/org.apache.ratis.grpc/TestRaftWithGrpc/testStateMachineMetrics/], > the reason is the > [RaftServerMetrics::getPeerCommitIndexGauge|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerMetrics.java#L141] > happens before > [RaftServerMetrics::addPeerCommitIndexGauge|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerMetrics.java#L122]. > > When some RaftServerImpl [setRole(RaftPeerRole.LEADER, > "changeToLeader")|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L345], > the statement > [waitForLeader|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/test/java/org/apache/ratis/RaftBasicTests.java#L446] > succ to get leader and test begin, but > [role.startLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L349] > -> > [new > LeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RoleInfo.java#L94] > -> > [LeaderState::addSenders|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L409]->[RaftServerMetrics::addFollower|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerMetrics.java#L106] > -> > [RaftServerMetrics::addPeerCommitIndexGauge|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerMetrics.java#L122] > has not finished. > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-883) Failed UT: testStateMachineMetrics.checkFollowerCommitLagsLeader
[ https://issues.apache.org/jira/browse/RATIS-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-883: -- Parent: RATIS-863 Issue Type: Sub-task (was: Bug) > Failed UT: testStateMachineMetrics.checkFollowerCommitLagsLeader > > > Key: RATIS-883 > URL: https://issues.apache.org/jira/browse/RATIS-883 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-883.001.patch, screenshot-1.png > > > *What's the problem ?* > !screenshot-1.png! > *What's the reason ?* > The reason is follower update commitInfoCache after leader. > The stack of follower update commitInfoCache is: > RaftServerImpl::appendEntriesAsync > -> state.updateStateMachine > -> StateMachineUpdater::applyLog > -> RaftServerImpl::applyLogToStateMachine > -> RaftServerImpl::replyPendingRequest > -> RaftServerImpl::getCommitInfos > -> infos.add(commitInfoCache.update(getPeer(), > state.getLog().getLastCommittedIndex())) > -> CommitInfoCache::update. > The stack of leader update commitInfoCache is: > follower finish RaftServerImpl::appendEntriesAsync and return reply > -> GrpcLogAppender::runAppenderImpl > -> GrpcLogAppender::appendLog > ->LogAppender::createRequest > ->LeaderState::newAppendEntriesRequestProto > ->RaftServerImpl::getCommitInfos > ->LeaderState::updateFollowerCommitInfos > ->CommitInfoCache::update. > Because follower need to notify thread StateMachineUpdater to update > CommitInfoCache, we can not ensure follower update CommitInfoCache before > leader. > *How to fix ?* > Follower update CommitInfoCache before return reply to leader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-849) Failed UT: GroupManagementBaseTest.runMultiGroupTest
[ https://issues.apache.org/jira/browse/RATIS-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-849: -- Parent: RATIS-863 Issue Type: Sub-task (was: Bug) > Failed UT: GroupManagementBaseTest.runMultiGroupTest > > > Key: RATIS-849 > URL: https://issues.apache.org/jira/browse/RATIS-849 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-849.001.patch, image-2020-04-14-21-07-42-831.png, > screenshot-1.png > > > *What's the problem ?* > !image-2020-04-14-21-07-42-831.png! > *What the reason ?* > I test with the patch, the failed unit test will not happen again. > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-865) Fix TestRaftWithGrpc#testStateMachineMetrics
[ https://issues.apache.org/jira/browse/RATIS-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-865: -- Description: The failure was observed here: [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.grpc/TestRaftWithGrpc/testStateMachineMetrics/] {code:java} org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.ratis.RaftBasicTests.checkFollowerCommitLagsLeader(RaftBasicTests.java:494) at org.apache.ratis.RaftBasicTests.testStateMachineMetrics(RaftBasicTests.java:469) at org.apache.ratis.grpc.TestRaftWithGrpc.lambda$testStateMachineMetrics$1(TestRaftWithGrpc.java:65) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:125) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) at org.apache.ratis.grpc.TestRaftWithGrpc.testStateMachineMetrics(TestRaftWithGrpc.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) 2ND INSTANCE --- org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) java.util.NoSuchElementException at java.util.TreeMap.key(TreeMap.java:1327) at java.util.TreeMap.firstKey(TreeMap.java:290) at java.util.Collections$UnmodifiableSortedMap.firstKey(Collections.java:1808) at org.apache.ratis.server.impl.RaftServerMetrics.getPeerCommitIndexGauge(RaftServerMetrics.java:159) at org.apache.ratis.RaftBasicTests.checkFollowerCommitLagsLeader(RaftBasicTests.java:487) at org.apache.ratis.RaftBasicTests.testStateMachineMetrics(RaftBasicTests.java:458) at org.apache.ratis.grpc.TestRaftWithGrpc.lambda$testStateMachineMetrics$1(TestRaftWithGrpc.java:65) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:125) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) at org.apache.ratis.grpc.TestRaftWithGrpc.testStateMachineMetrics(TestRaftWithGrpc.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {code} was: The failure was observed here: [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.grpc/TestRaftWithGrpc/testStateMachineMetrics/] {code:java} org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at
[jira] [Created] (RATIS-879) Fix TestRaftAsyncWithGrpc#testNoRetryWaitOnNotLeaderException
Shashikant Banerjee created RATIS-879: - Summary: Fix TestRaftAsyncWithGrpc#testNoRetryWaitOnNotLeaderException Key: RATIS-879 URL: https://issues.apache.org/jira/browse/RATIS-879 Project: Ratis Issue Type: Sub-task Reporter: Shashikant Banerjee Fix For: 0.6.0 {code:java} java.lang.AssertionError: Failed to get async resultjava.lang.AssertionError: Failed to get async result at org.apache.ratis.RaftAsyncTests.runTestNoRetryWaitOnNotLeaderException(RaftAsyncTests.java:435) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:125) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) at org.apache.ratis.RaftAsyncTests.testNoRetryWaitOnNotLeaderException(RaftAsyncTests.java:407) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748)Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) at org.apache.ratis.util.TimeDuration.apply(TimeDuration.java:289) at org.apache.ratis.RaftAsyncTests.runTestNoRetryWaitOnNotLeaderException(RaftAsyncTests.java:433) ... 16 morejava.lang.IllegalStateException: Failed: first exception was set at org.apache.ratis.BaseTest.assertNoFailures(BaseTest.java:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.IllegalStateException: Unexpected getSleepTime: ClientRetryEvent:attempt=1,request=RaftClientRequest:client-7928BDA9D90A->s0@group-970F01270564, cid=2857, seq=1*, RW, abc,cause=org.apache.ratis.protocol.NotLeaderException: Server s0@group-970F01270564 is not the leader s1:0.0.0.0:57704 at org.apache.ratis.RaftAsyncTests.lambda$null$15(RaftAsyncTests.java:426) at org.apache.ratis.client.impl.OrderedAsync.scheduleWithTimeout(OrderedAsync.java:214) at org.apache.ratis.client.impl.OrderedAsync.lambda$sendRequestWithRetry$6(OrderedAsync.java:200) at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.completeReplyExceptionally(GrpcClientProtocolClient.java:358) at org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.access$000(GrpcClientProtocolClient.java:264) at org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:278) at org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:269) at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:429) at
[jira] [Updated] (RATIS-801) Ratis snapshot should consider stateMachine#appliedIndex for triggering snapshot
[ https://issues.apache.org/jira/browse/RATIS-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-801: -- Description: Currently, while triggering snapshot, snapshotUpdater#appliedIndex is taken into account to decide whether it has exceeded the snapshot threshold from the last snapshotIndex. This may lead to creating more snapshots than usual as stateMachineUpdater#appliedIndex is updated as soon as the applyTransaction call happens. Ideally, Ratis snapshot should not be triggered taking stateMachine's applied index into account. (was: Currently, while triggering snapshot, snapshotUpdater#appliedIndex is taken into account to decide whether it has exceeded the snapshot threshold from the last snapshotIndex. This may lead to creating more snapshots than usual as stateMachineUpdater#appliedIndex is updated as soon as the applyTransaction call happens. Ideally, Ratis snapshot should nbe triggered taking stateMachine's applied index into account.) > Ratis snapshot should consider stateMachine#appliedIndex for triggering > snapshot > > > Key: RATIS-801 > URL: https://issues.apache.org/jira/browse/RATIS-801 > Project: Ratis > Issue Type: Improvement >Affects Versions: 0.5.0 >Reporter: Shashikant Banerjee >Assignee: Bharat Viswanadham >Priority: Major > Fix For: 0.6.0 > > > Currently, while triggering snapshot, snapshotUpdater#appliedIndex is taken > into account to decide whether it has exceeded the snapshot threshold from > the last snapshotIndex. This may lead to creating more snapshots than usual > as stateMachineUpdater#appliedIndex is updated as soon as the > applyTransaction call happens. Ideally, Ratis snapshot should not be > triggered taking stateMachine's applied index into account. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-877) StateMachineUpdater#takeSnapshot should check stateMachine lastAppliedIndex
[ https://issues.apache.org/jira/browse/RATIS-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee resolved RATIS-877. --- Fix Version/s: 0.6.0 Resolution: Duplicate > StateMachineUpdater#takeSnapshot should check stateMachine lastAppliedIndex > --- > > Key: RATIS-877 > URL: https://issues.apache.org/jira/browse/RATIS-877 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Lokesh Jain >Priority: Major > Fix For: 0.6.0 > > > Currently StateMachineUpdater#takeSnapshot checks whether index of snapshott > taken is greater than lastAppliedIndex. It should ideally check > stateMachineLastAppliedIndex which reflects the index till which state > machine has already applied the log entry. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-871) Update to latest Ratis Snapshot 0.6.0-490b689-SNAPSHOT
Shashikant Banerjee created RATIS-871: - Summary: Update to latest Ratis Snapshot 0.6.0-490b689-SNAPSHOT Key: RATIS-871 URL: https://issues.apache.org/jira/browse/RATIS-871 Project: Ratis Issue Type: Bug Components: build Reporter: Shashikant Banerjee Assignee: Shashikant Banerjee Fix For: 0.6.0 Update ozone to latest ratis snapshot. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-870) Fix TestFileStoreWithGrpc#testFileStore
Shashikant Banerjee created RATIS-870: - Summary: Fix TestFileStoreWithGrpc#testFileStore Key: RATIS-870 URL: https://issues.apache.org/jira/browse/RATIS-870 Project: Ratis Issue Type: Sub-task Reporter: Shashikant Banerjee Fix For: 0.6.0 [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.examples.filestore/TestFileStoreWithGrpc/testFileStore/] {code:java} Error Message test timed out after 100 seconds Stacktrace org.junit.runners.model.TestTimedOutException: test timed out after 100 seconds{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-869) Fix TestServerRestartWithGrpc#testRestartFollower
[ https://issues.apache.org/jira/browse/RATIS-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-869: -- Summary: Fix TestServerRestartWithGrpc#testRestartFollower (was: Fix ) > Fix TestServerRestartWithGrpc#testRestartFollower > - > > Key: RATIS-869 > URL: https://issues.apache.org/jira/browse/RATIS-869 > Project: Ratis > Issue Type: Sub-task >Reporter: Shashikant Banerjee >Priority: Major > > The issue was discovered in > [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.grpc/TestServerRestartWithGrpc/testRestartFollower/] > {code:java} > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.ratis.server.ServerRestartTests.runTestRestartFollower(ServerRestartTests.java:122) > at > org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:125) > at > org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) > at > org.apache.ratis.server.ServerRestartTests.testRestartFollower(ServerRestartTests.java:91) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-869) Fix
Shashikant Banerjee created RATIS-869: - Summary: Fix Key: RATIS-869 URL: https://issues.apache.org/jira/browse/RATIS-869 Project: Ratis Issue Type: Sub-task Reporter: Shashikant Banerjee The issue was discovered in [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.grpc/TestServerRestartWithGrpc/testRestartFollower/] {code:java} java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.ratis.server.ServerRestartTests.runTestRestartFollower(ServerRestartTests.java:122) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:125) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) at org.apache.ratis.server.ServerRestartTests.testRestartFollower(ServerRestartTests.java:91) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-868) Fix TestRaftWithSimulatedRpc#testWithLoad
Shashikant Banerjee created RATIS-868: - Summary: Fix TestRaftWithSimulatedRpc#testWithLoad Key: RATIS-868 URL: https://issues.apache.org/jira/browse/RATIS-868 Project: Ratis Issue Type: Sub-task Reporter: Shashikant Banerjee [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.server.simulation/TestRaftWithSimulatedRpc/testWithLoad/] {code:java} org.junit.runners.model.TestTimedOutException: test timed out after 100 seconds {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-867) TestMetaServer#testListLogs
Shashikant Banerjee created RATIS-867: - Summary: TestMetaServer#testListLogs Key: RATIS-867 URL: https://issues.apache.org/jira/browse/RATIS-867 Project: Ratis Issue Type: Sub-task Reporter: Shashikant Banerjee The issue was observed here: [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.logservice.server/TestMetaServer/testListLogs/] {code:java} ava.lang.AssertionError: expected:<19> but was:<20> at org.apache.ratis.logservice.server.TestMetaServer.testJMXCount(TestMetaServer.java:339) at org.apache.ratis.logservice.server.TestMetaServer.testListLogs(TestMetaServer.java:331) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-866) Fix TestServerRestartWithSimulatedRpc#testRestartWithCorruptedLogHeader
Shashikant Banerjee created RATIS-866: - Summary: Fix TestServerRestartWithSimulatedRpc#testRestartWithCorruptedLogHeader Key: RATIS-866 URL: https://issues.apache.org/jira/browse/RATIS-866 Project: Ratis Issue Type: Sub-task Reporter: Shashikant Banerjee Fix For: 0.6.0 [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.server.simulation/TestServerRestartWithSimulatedRpc/testRestartWithCorruptedLogHeader/] {code:java} attempt #2/10: java.lang.AssertionError: expected:<1> but was:<0>, sleep 100ms and then retry. java.lang.AssertionError: expected:<1> but was:<0> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:834) at org.junit.Assert.assertEquals(Assert.java:645) at org.junit.Assert.assertEquals(Assert.java:631) at org.apache.ratis.server.ServerRestartTests.getOpenLogFile(ServerRestartTests.java:179) at org.apache.ratis.server.ServerRestartTests.lambda$runTestRestartWithCorruptedLogHeader$2(ServerRestartTests.java:191) at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:160) at org.apache.ratis.util.JavaUtils.attemptRepeatedly(JavaUtils.java:146) at org.apache.ratis.server.ServerRestartTests.runTestRestartWithCorruptedLogHeader(ServerRestartTests.java:191) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:125) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) at org.apache.ratis.server.ServerRestartTests.testRestartWithCorruptedLogHeader(ServerRestartTests.java:185) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-865) Fix TestRaftWithGrpc#testStateMachineMetrics
Shashikant Banerjee created RATIS-865: - Summary: Fix TestRaftWithGrpc#testStateMachineMetrics Key: RATIS-865 URL: https://issues.apache.org/jira/browse/RATIS-865 Project: Ratis Issue Type: Sub-task Reporter: Shashikant Banerjee The failure was observed here: [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.grpc/TestRaftWithGrpc/testStateMachineMetrics/] {code:java} org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.ratis.RaftBasicTests.checkFollowerCommitLagsLeader(RaftBasicTests.java:494) at org.apache.ratis.RaftBasicTests.testStateMachineMetrics(RaftBasicTests.java:469) at org.apache.ratis.grpc.TestRaftWithGrpc.lambda$testStateMachineMetrics$1(TestRaftWithGrpc.java:65) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:125) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) at org.apache.ratis.grpc.TestRaftWithGrpc.testStateMachineMetrics(TestRaftWithGrpc.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-864) Fix TestRaftStateMachineExceptionWithGrpc#testRetryOnExceptionDuringReplication
Shashikant Banerjee created RATIS-864: - Summary: Fix TestRaftStateMachineExceptionWithGrpc#testRetryOnExceptionDuringReplication Key: RATIS-864 URL: https://issues.apache.org/jira/browse/RATIS-864 Project: Ratis Issue Type: Sub-task Reporter: Shashikant Banerjee The test failure was observed here: [https://builds.apache.org/job/PreCommit-RATIS-Build/1299/testReport/org.apache.ratis.grpc/TestRaftStateMachineExceptionWithGrpc/testRetryOnExceptionDuringReplication/] {code:java} org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) java.lang.NullPointerException at java.util.Objects.requireNonNull(Objects.java:203) at org.apache.ratis.server.impl.RaftStateMachineExceptionTests.runTestRetryOnExceptionDuringReplication(RaftStateMachineExceptionTests.java:170) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:125) at org.apache.ratis.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:113) at org.apache.ratis.server.impl.RaftStateMachineExceptionTests.testRetryOnExceptionDuringReplication(RaftStateMachineExceptionTests.java:145) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-863) Fix Ratis Unit Test Failures
Shashikant Banerjee created RATIS-863: - Summary: Fix Ratis Unit Test Failures Key: RATIS-863 URL: https://issues.apache.org/jira/browse/RATIS-863 Project: Ratis Issue Type: Task Reporter: Shashikant Banerjee Assignee: Shashikant Banerjee Fix For: 0.6.0 There are multiple unit test failures in ratis off late. The aim here is to list every failure and try to fix them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-861) Define all third party jar versions in properties in pom.xml
[ https://issues.apache.org/jira/browse/RATIS-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-861: -- Description: Currently, many third library dpendencies are hardcoded in the dependency tag in the pom file. Idea is to define the jar version as a proerty in the pom file and reuse the defined version in the dependency tag as done in https://issues.apache.org/jira/browse/RATIS-860. (was: Currently, many third library dpendencies are hardcoded in the dependency tag in the pom file. Idea is to define the jar version as a proerty in the pom file and reuse the defined version in the dependency tag.) > Define all third party jar versions in properties in pom.xml > > > Key: RATIS-861 > URL: https://issues.apache.org/jira/browse/RATIS-861 > Project: Ratis > Issue Type: Bug >Reporter: Shashikant Banerjee >Priority: Major > > Currently, many third library dpendencies are hardcoded in the dependency tag > in the pom file. Idea is to define the jar version as a proerty in the pom > file and reuse the defined version in the dependency tag as done in > https://issues.apache.org/jira/browse/RATIS-860. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-862) Add third party jar versions as properties in pom.xml
Shashikant Banerjee created RATIS-862: - Summary: Add third party jar versions as properties in pom.xml Key: RATIS-862 URL: https://issues.apache.org/jira/browse/RATIS-862 Project: Ratis Issue Type: Bug Reporter: Shashikant Banerjee Currently, many third library dependencies are hardcoded in the dependency tag in the pom file. Idea is to re-organize the structure a bit by defining the jar version as a property in the pom file and reuse the defined version in the dependency tag. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-861) Define all third party jar versions in properties in pom.xml
Shashikant Banerjee created RATIS-861: - Summary: Define all third party jar versions in properties in pom.xml Key: RATIS-861 URL: https://issues.apache.org/jira/browse/RATIS-861 Project: Ratis Issue Type: Bug Reporter: Shashikant Banerjee Currently, many third library dpendencies are hardcoded in the dependency tag in the pom file. Idea is to define the jar version as a proerty in the pom file and reuse the defined version in the dependency tag. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-860) Organize log4j dependency in pom.xml
[ https://issues.apache.org/jira/browse/RATIS-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-860: -- Attachment: RATIS-860.000.patch > Organize log4j dependency in pom.xml > > > Key: RATIS-860 > URL: https://issues.apache.org/jira/browse/RATIS-860 > Project: Ratis > Issue Type: Bug > Components: build >Reporter: Shashikant Banerjee >Assignee: Shashikant Banerjee >Priority: Major > Fix For: 0.6.0 > > Attachments: RATIS-860.000.patch > > > Currently, dependency of log4j in ozone is added as following: > {code:java} > > log4j > log4j > 1.2.17 > {code} > Idea here is to add log4j.version as a property in pom.xml and reuse the same > while defining the dependency. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-860) Organize log4j dependency in pom.xml
Shashikant Banerjee created RATIS-860: - Summary: Organize log4j dependency in pom.xml Key: RATIS-860 URL: https://issues.apache.org/jira/browse/RATIS-860 Project: Ratis Issue Type: Bug Components: build Reporter: Shashikant Banerjee Assignee: Shashikant Banerjee Fix For: 0.6.0 Currently, dependency of log4j in ozone is added as following: {code:java} log4j log4j 1.2.17 {code} Idea here is to add log4j.version as a property in pom.xml and reuse the same while defining the dependency. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-859) Infinite leader election in ozone
[ https://issues.apache.org/jira/browse/RATIS-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087880#comment-17087880 ] Shashikant Banerjee commented on RATIS-859: --- I think, its something we might need to handle in Ozone rather than in ratis. Thanks for reporting this. > Infinite leader election in ozone > - > > Key: RATIS-859 > URL: https://issues.apache.org/jira/browse/RATIS-859 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png, screenshot-5.png, screenshot-6.png, screenshot-7.png > > > I also open the same jira in ozone: > https://issues.apache.org/jira/browse/HDDS-3459. I think both ozone and ratis > should avoid this happens. > *What's the problem ?* > There are 3 datanodes in a group: leader, follower1, follower2. Steps to > reproduce the problem are as following: > 1. follower2 report close pipeline > 2. scm send close pipeline command > 3. leader and follower1 remove group, but follower2 socket timeout and does > not remove group > 4. follower2 then begin infinite LeaderElection at least 6 hours, leader and > follower1 response group not found > You can see find it in following screenshot. > 1. follower2 report close pipeline > !screenshot-1.png! > 2. Scm close pipeline: > !screenshot-2.png! > !screenshot-3.png! > 3. leader remove group > !screenshot-4.png! >follower1 remove group > !screenshot-5.png! > follower2 socket timeout > !screenshot-6.png! > 4. follower2 then begin infinite LeaderElection at least 6 hours > !screenshot-7.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-853) Unordered Client request should not sleep when NotLeaderException provides leader information
[ https://issues.apache.org/jira/browse/RATIS-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087864#comment-17087864 ] Shashikant Banerjee commented on RATIS-853: --- Thanks [~ljain] for working on this. The patch looks good. Few minor comments inline: 1) Can we move getEffectiveRetryPolicy function from RetryPolicies.Java to some utility class like clientImplUtils? 2) As we go ahead, i am hoping retryForeverForNoSleep will not be used for any exception in any case right? Can we add a comment or TODO stating the same? > Unordered Client request should not sleep when NotLeaderException provides > leader information > - > > Key: RATIS-853 > URL: https://issues.apache.org/jira/browse/RATIS-853 > Project: Ratis > Issue Type: Bug > Components: client >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Attachments: RATIS-853.001.patch > > > When NotLeaderException provides leader information, the client request > should be retried immediately on the suggested leader. Currently Unordered > requests in raft client use the default policy to determine sleep time and > thus may sleep even if NotLeaderException provides leader information. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-857) Thread unsafe HashMap in multi thread
[ https://issues.apache.org/jira/browse/RATIS-857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087870#comment-17087870 ] Shashikant Banerjee commented on RATIS-857: --- The patch looks good. Waiting for jenkins.. > Thread unsafe HashMap in multi thread > - > > Key: RATIS-857 > URL: https://issues.apache.org/jira/browse/RATIS-857 > Project: Ratis > Issue Type: Bug >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-857.001.patch > > > *What's the problem ?* > The {color:#DE350B}static{color} variable > [RaftServerMetrics::metricsMap|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerMetrics.java#L71] > is type of HashMap, which is not thread safe. But entry will be > [put|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerMetrics.java#L76] > into metricsMap by different thread, when create each RaftServerImpl > instance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-840: -- Priority: Critical (was: Major) > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Critical > Attachments: image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > {color:#DE350B}Still working. The previous patch has some problem, and I will > submit it again.{color} > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of runTestRestartLogAppender can pass is that it > did not stop the old GrpcLogAppender, and the old GrpcLogAppender append
[jira] [Commented] (RATIS-851) Raft Client should not change leader on ResourceUnavailableException
[ https://issues.apache.org/jira/browse/RATIS-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084759#comment-17084759 ] Shashikant Banerjee commented on RATIS-851: --- Thanks [~ljain] for the contribution. I have committed this. > Raft Client should not change leader on ResourceUnavailableException > > > Key: RATIS-851 > URL: https://issues.apache.org/jira/browse/RATIS-851 > Project: Ratis > Issue Type: Bug > Components: client >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Attachments: RATIS-851.001.patch > > > Currently raft client changes the leader on receiving > ResourceUnavailableException. It should not change the leader as the > exception only signifies load on the leader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-851) Raft Client should not change leader on ResourceUnavailableException
[ https://issues.apache.org/jira/browse/RATIS-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084757#comment-17084757 ] Shashikant Banerjee commented on RATIS-851: --- Thanks [~ljain] for working on this. The changes look good to me. I am +1 on this. > Raft Client should not change leader on ResourceUnavailableException > > > Key: RATIS-851 > URL: https://issues.apache.org/jira/browse/RATIS-851 > Project: Ratis > Issue Type: Bug > Components: client >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Attachments: RATIS-851.001.patch > > > Currently raft client changes the leader on receiving > ResourceUnavailableException. It should not change the leader as the > exception only signifies load on the leader. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-832) Add Metrics for retry cache count as well as size in bytes
[ https://issues.apache.org/jira/browse/RATIS-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083929#comment-17083929 ] Shashikant Banerjee commented on RATIS-832: --- Thanks [~yjxxtd] for the contribution. I have committed this. > Add Metrics for retry cache count as well as size in bytes > -- > > Key: RATIS-832 > URL: https://issues.apache.org/jira/browse/RATIS-832 > Project: Ratis > Issue Type: Sub-task > Components: server >Affects Versions: 0.6.0 >Reporter: Shashikant Banerjee >Assignee: runzhiwang >Priority: Major > Attachments: RATIS-832.001.patch, RATIS-832.002.patch > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-834) Add metrics for stateMachine cache count and size in bytes
[ https://issues.apache.org/jira/browse/RATIS-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-834: -- Parent: (was: RATIS-646) Issue Type: Bug (was: Sub-task) > Add metrics for stateMachine cache count and size in bytes > -- > > Key: RATIS-834 > URL: https://issues.apache.org/jira/browse/RATIS-834 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Shashikant Banerjee >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-834) Add metrics for stateMachine cache count and size in bytes
[ https://issues.apache.org/jira/browse/RATIS-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083914#comment-17083914 ] Shashikant Banerjee commented on RATIS-834: --- Thanks [~yjxxtd]. The metric should be there in Ozone where the cache is maintained in ContainerStateMachine. > Add metrics for stateMachine cache count and size in bytes > -- > > Key: RATIS-834 > URL: https://issues.apache.org/jira/browse/RATIS-834 > Project: Ratis > Issue Type: Sub-task > Components: server >Reporter: Shashikant Banerjee >Assignee: runzhiwang >Priority: Major > Fix For: 0.6.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-835) Include exception based attempt count in raft client request
[ https://issues.apache.org/jira/browse/RATIS-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082203#comment-17082203 ] Shashikant Banerjee commented on RATIS-835: --- [~ljain], thanks for reporting and working on this. Would it make more sense to maintain the exception based attempt count inside the exception dependent retry policy class itself instead of clientRetryEvent as it is very specific to this policy ? . Every time the exception policy is inquired to get the retry policy for an specific exception, the attempt count can be increased or when anytime shouldRetry() returns true, the attempt counter of that specific exception inside the exceptionDependentRetryPolicy map can be increased. > Include exception based attempt count in raft client request > > > Key: RATIS-835 > URL: https://issues.apache.org/jira/browse/RATIS-835 > Project: Ratis > Issue Type: Bug > Components: client >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Attachments: RATIS-835.001.patch, RATIS-835.002.patch, > RATIS-835.003.patch > > > Client needs to maintain exception based attempt count for using Exception > Dependent retry policy. Exception dependent policy helps in specifying > individual policies for different exception types. > Currently policy takes number of attempts as argument. Therefore the > individual policies require attempt counts for the particular exception while > handling retry event. This is particularly important for using > MulipleLinearRandomRetry policy which increases sleep interval based on > number of attempts made by the client. Raft Client can therefore use this > policy for ResourceUnavailableException and increase sleep interval for > subsequent retries of the request on the same exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-842) GrpcClientProtocolClient uses NotLeaderException event for LeaderNotReadyException
[ https://issues.apache.org/jira/browse/RATIS-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082189#comment-17082189 ] Shashikant Banerjee commented on RATIS-842: --- Thanks [~ljain] for reporting and working on this. The patch looks good to me . I am +1 on this patch. > GrpcClientProtocolClient uses NotLeaderException event for > LeaderNotReadyException > -- > > Key: RATIS-842 > URL: https://issues.apache.org/jira/browse/RATIS-842 > Project: Ratis > Issue Type: Bug > Components: client, gRPC >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Minor > Attachments: RATIS-842.001.patch > > > GrpcClientProtocolClient uses NotLeaderException event for > LeaderNotReadyException. It should be changed to LeaderNotReadyException. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-832) Add Metrics for retry cache count as well as size in bytes
[ https://issues.apache.org/jira/browse/RATIS-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076157#comment-17076157 ] Shashikant Banerjee commented on RATIS-832: --- Thanks [~runzhiwang] for working on this. The changes look good to me. Some comments inline: RetryCacheHit metric is already available in Ratis. I think it needs to be removed. {code:java} public static final String RETRY_REQUEST_CACHE_HIT_COUNTER = "numRetryCacheHits"; {code} Also, RetryCache is maintained per RaftServerImpl instance. Can we maintain RetryCacheMetrics inside RaftServerMetrics itself? > Add Metrics for retry cache count as well as size in bytes > -- > > Key: RATIS-832 > URL: https://issues.apache.org/jira/browse/RATIS-832 > Project: Ratis > Issue Type: Sub-task > Components: server >Affects Versions: 0.6.0 >Reporter: Shashikant Banerjee >Priority: Major > Attachments: RATIS-832.001.patch > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-834) Add metrics for stateMachine cache count and size in bytes
Shashikant Banerjee created RATIS-834: - Summary: Add metrics for stateMachine cache count and size in bytes Key: RATIS-834 URL: https://issues.apache.org/jira/browse/RATIS-834 Project: Ratis Issue Type: Sub-task Components: server Reporter: Shashikant Banerjee Fix For: 0.6.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-833) Add metrics for raft log cache count and size in bytes
Shashikant Banerjee created RATIS-833: - Summary: Add metrics for raft log cache count and size in bytes Key: RATIS-833 URL: https://issues.apache.org/jira/browse/RATIS-833 Project: Ratis Issue Type: Sub-task Components: server Reporter: Shashikant Banerjee Fix For: 0.6.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-831) Add a metric to track count of requests failing with ResourceUnavailable exception
[ https://issues.apache.org/jira/browse/RATIS-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-831: -- Parent: RATIS-646 Issue Type: Sub-task (was: Bug) > Add a metric to track count of requests failing with ResourceUnavailable > exception > -- > > Key: RATIS-831 > URL: https://issues.apache.org/jira/browse/RATIS-831 > Project: Ratis > Issue Type: Sub-task >Reporter: Shashikant Banerjee >Assignee: Nanda kumar >Priority: Major > > The idea is to determine the rejected request count on a server bcoz of > server overload. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-830) Add a metric for tracking failed client requests on a server
[ https://issues.apache.org/jira/browse/RATIS-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-830: -- Parent: RATIS-646 Issue Type: Sub-task (was: Bug) > Add a metric for tracking failed client requests on a server > > > Key: RATIS-830 > URL: https://issues.apache.org/jira/browse/RATIS-830 > Project: Ratis > Issue Type: Sub-task >Reporter: Shashikant Banerjee >Priority: Major > > This metric will track failed count for all type of ratis requests-- > WriteType, ReadType and WatchType. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-832) Add Metrics for retry cache count as well as size in bytes
[ https://issues.apache.org/jira/browse/RATIS-832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashikant Banerjee updated RATIS-832: -- Parent: RATIS-646 Issue Type: Sub-task (was: Bug) > Add Metrics for retry cache count as well as size in bytes > -- > > Key: RATIS-832 > URL: https://issues.apache.org/jira/browse/RATIS-832 > Project: Ratis > Issue Type: Sub-task > Components: server >Affects Versions: 0.6.0 >Reporter: Shashikant Banerjee >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)