[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable, especially resources is limited. [https://github.com/apache/incubator-ratis/runs/927310008?check_suite_focus=true] [https://github.com/apache/incubator-ratis/runs/926606136?check_suite_focus=true] !image-2020-07-31-12-33-35-755.png! !image-2020-07-31-12-34-08-384.png! !image-2020-07-31-12-40-11-183.png! Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides changing election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? was: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable, especially resources is limited. !image-2020-07-31-12-33-35-755.png! !image-2020-07-31-12-34-08-384.png! !image-2020-07-31-12-40-11-183.png! Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides changing election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > Attachments: image-2020-07-31-12-33-35-755.png, > image-2020-07-31-12-34-08-384.png, image-2020-07-31-12-40-11-183.png > > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable, especially resources is limited. > > [https://github.com/apache/incubator-ratis/runs/927310008?check_suite_focus=true] > [https://github.com/apache/incubator-ratis/runs/926606136?check_suite_focus=true] > > !image-2020-07-31-12-33-35-755.png! > > !image-2020-07-31-12-34-08-384.png! > > !image-2020-07-31-12-40-11-183.png! > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], because larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides changing election timeout ? > 2) are there other test cases affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-981) Step-down stale leader in case of split-brain
[ https://issues.apache.org/jira/browse/RATIS-981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain resolved RATIS-981. --- Fix Version/s: 1.1.0 Resolution: Fixed > Step-down stale leader in case of split-brain > - > > Key: RATIS-981 > URL: https://issues.apache.org/jira/browse/RATIS-981 > Project: Ratis > Issue Type: Improvement >Reporter: Nanda kumar >Assignee: Glen Geng >Priority: Major > Labels: pull-request-available > Fix For: 1.1.0 > > Time Spent: 4h > Remaining Estimate: 0h > > We should make sure that the stale leader steps down to the candidate state > before the next leader election. > Proposal: > In the heartbeat thread in the Leader node, we should check if the last > response time of the follower is less than the leader election timeout. If > the majority of the follower’s last response time is less than the leader > election timeout, the current leader is still the active leader. Majority of > the followers are heartbeating to the current leader, so there can’t be a new > leader. > If the majority of follower’s last response time is greater than the leader > election timeout, the current leader should step down and become a candidate. > With this check, we can be sure that the current leader will step down and > become a candidate before the new leader election starts in case of a network > partition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Attachment: image-2020-07-31-12-40-11-183.png > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > Attachments: image-2020-07-31-12-33-35-755.png, > image-2020-07-31-12-34-08-384.png, image-2020-07-31-12-40-11-183.png > > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable, especially resources is limited. > > !image-2020-07-31-12-33-35-755.png! > > !image-2020-07-31-12-34-08-384.png! > > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], because larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides changing election timeout ? > 2) are there other test cases affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable, especially resources is limited. !image-2020-07-31-12-33-35-755.png! !image-2020-07-31-12-34-08-384.png! !image-2020-07-31-12-40-11-183.png! Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides changing election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? was: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable, especially resources is limited. !image-2020-07-31-12-33-35-755.png! !image-2020-07-31-12-34-08-384.png! Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides changing election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > Attachments: image-2020-07-31-12-33-35-755.png, > image-2020-07-31-12-34-08-384.png, image-2020-07-31-12-40-11-183.png > > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable, especially resources is limited. > > !image-2020-07-31-12-33-35-755.png! > > !image-2020-07-31-12-34-08-384.png! > > !image-2020-07-31-12-40-11-183.png! > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], because larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides changing election timeout ? > 2) are there other test cases affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Attachment: image-2020-07-31-12-34-08-384.png > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > Attachments: image-2020-07-31-12-33-35-755.png, > image-2020-07-31-12-34-08-384.png > > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable, especially resources is limited. > > !image-2020-07-31-12-33-35-755.png! > > !image-2020-07-31-12-34-08-384.png! > > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], because larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides changing election timeout ? > 2) are there other test cases affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable, especially resources is limited. !image-2020-07-31-12-33-35-755.png! !image-2020-07-31-12-34-08-384.png! Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides changing election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? was: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable, especially resources is limited. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides changing election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > Attachments: image-2020-07-31-12-33-35-755.png, > image-2020-07-31-12-34-08-384.png > > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable, especially resources is limited. > > !image-2020-07-31-12-33-35-755.png! > > !image-2020-07-31-12-34-08-384.png! > > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], because larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides changing election timeout ? > 2) are there other test cases affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Attachment: image-2020-07-31-12-33-35-755.png > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > Attachments: image-2020-07-31-12-33-35-755.png > > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable, especially resources is limited. > > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], because larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides changing election timeout ? > 2) are there other test cases affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable, especially resources is limited. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides changing election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? was: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable, especially resources is limited. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable, especially resources is limited. > > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], because larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides changing election timeout ? > 2) are there other test cases affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable, especially resources is limited. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? was: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable, especially resources is limited. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable, especially resources is limited. > > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], because larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides change election timeout ? > 2) are there other test cases affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable, especially resources is limited. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? was: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable, especially resources is limited. > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], because larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides change election timeout ? > 2) are there other test cases affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cased affected by LeaderState::checkLeadership() ? was: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], since larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cased affected by LeaderState::checkLeadership() ? > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable. > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], because larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides change election timeout ? > 2) are there other test cased affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cases affected by LeaderState::checkLeadership() ? was: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], because larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cased affected by LeaderState::checkLeadership() ? > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable. > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], because larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides change election timeout ? > 2) are there other test cases affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (RATIS-624) RaftServer should support pause/ unpause in its LifeCycle state
[ https://issues.apache.org/jira/browse/RATIS-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal reassigned RATIS-624: --- Assignee: Rui Wang > RaftServer should support pause/ unpause in its LifeCycle state > --- > > Key: RATIS-624 > URL: https://issues.apache.org/jira/browse/RATIS-624 > Project: Ratis > Issue Type: Task >Reporter: Hanisha Koneru >Assignee: Rui Wang >Priority: Major > Labels: ozone > Fix For: 1.1.0 > > > This Jira aims to add support to RaftServer to support pause and unpause to > its state. When paused, the RaftServer should not accept any incoming append > log entries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable. Current walk around is to enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], since larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cased affected by LeaderState::checkLeadership() ? was: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable. The walk around is enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], since larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cased affected by LeaderState::checkLeadership() ? > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable. > Current walk around is to enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], since larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides change election timeout ? > 2) are there other test cased affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-624) RaftServer should support pause/ unpause in its LifeCycle state
[ https://issues.apache.org/jira/browse/RATIS-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168377#comment-17168377 ] Arpit Agarwal commented on RATIS-624: - [~amaliujia] I have added you as a Ratis contributor. You will be able to assign issues to yourself now. Welcome aboard! > RaftServer should support pause/ unpause in its LifeCycle state > --- > > Key: RATIS-624 > URL: https://issues.apache.org/jira/browse/RATIS-624 > Project: Ratis > Issue Type: Task >Reporter: Hanisha Koneru >Assignee: Rui Wang >Priority: Major > Labels: ozone > Fix For: 1.1.0 > > > This Jira aims to add support to RaftServer to support pause and unpause to > its state. When paused, the RaftServer should not accept any incoming append > log entries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. Such case need do node restart operation or membership change operation, which will make leader vulnerable. The walk around is enlarge election timeout a little bit, e.g., from [150ms,300ms] to [300ms, 600ms], since larger election timeout will make leader become more stable. TODO: 1) do we have better way besides change election timeout ? 2) are there other test cased affected by LeaderState::checkLeadership() ? was: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable. > The walk around is enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], since larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides change election timeout ? > 2) are there other test cased affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Affects Version/s: 1.1.0 > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Affects Versions: 1.1.0 >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > Such case need do node restart operation or membership change operation, > which will make leader vulnerable. > The walk around is enlarge election timeout a little bit, e.g., from > [150ms,300ms] to [300ms, 600ms], since larger election timeout will make > leader become more stable. > > TODO: > 1) do we have better way besides change election timeout ? > 2) are there other test cased affected by LeaderState::checkLeadership() ? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge LeaderState::checkLeadership(), some test case become hard to pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. was: After merge > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > After merge LeaderState::checkLeadership(), some test case become hard to > pass under GitHub CI, such as GroupManagementBaseTest and TestMultiRaftGroup. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Description: After merge was: We should make sure that the stale leader steps down to the candidate state before the next leader election. Proposal: In the heartbeat thread in the Leader node, we should check if the last response time of the follower is less than the leader election timeout. If the majority of the follower’s last response time is less than the leader election timeout, the current leader is still the active leader. Majority of the followers are heartbeating to the current leader, so there can’t be a new leader. If the majority of follower’s last response time is greater than the leader election timeout, the current leader should step down and become a candidate. With this check, we can be sure that the current leader will step down and become a candidate before the new leader election starts in case of a network partition. > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > After merge > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
Glen Geng created RATIS-1014: Summary: checkLeadership() may make some test case become flaky under GitHub CI. Key: RATIS-1014 URL: https://issues.apache.org/jira/browse/RATIS-1014 Project: Ratis Issue Type: Improvement Reporter: Glen Geng Assignee: Glen Geng We should make sure that the stale leader steps down to the candidate state before the next leader election. Proposal: In the heartbeat thread in the Leader node, we should check if the last response time of the follower is less than the leader election timeout. If the majority of the follower’s last response time is less than the leader election timeout, the current leader is still the active leader. Majority of the followers are heartbeating to the current leader, so there can’t be a new leader. If the majority of follower’s last response time is greater than the leader election timeout, the current leader should step down and become a candidate. With this check, we can be sure that the current leader will step down and become a candidate before the new leader election starts in case of a network partition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Labels: (was: pull-request-available) > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Improvement >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > We should make sure that the stale leader steps down to the candidate state > before the next leader election. > Proposal: > In the heartbeat thread in the Leader node, we should check if the last > response time of the follower is less than the leader election timeout. If > the majority of the follower’s last response time is less than the leader > election timeout, the current leader is still the active leader. Majority of > the followers are heartbeating to the current leader, so there can’t be a new > leader. > If the majority of follower’s last response time is greater than the leader > election timeout, the current leader should step down and become a candidate. > With this check, we can be sure that the current leader will step down and > become a candidate before the new leader election starts in case of a network > partition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1014) checkLeadership() may make some test case become flaky under GitHub CI.
[ https://issues.apache.org/jira/browse/RATIS-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glen Geng updated RATIS-1014: - Issue Type: Test (was: Improvement) > checkLeadership() may make some test case become flaky under GitHub CI. > --- > > Key: RATIS-1014 > URL: https://issues.apache.org/jira/browse/RATIS-1014 > Project: Ratis > Issue Type: Test >Reporter: Glen Geng >Assignee: Glen Geng >Priority: Major > > We should make sure that the stale leader steps down to the candidate state > before the next leader election. > Proposal: > In the heartbeat thread in the Leader node, we should check if the last > response time of the follower is less than the leader election timeout. If > the majority of the follower’s last response time is less than the leader > election timeout, the current leader is still the active leader. Majority of > the followers are heartbeating to the current leader, so there can’t be a new > leader. > If the majority of follower’s last response time is greater than the leader > election timeout, the current leader should step down and become a candidate. > With this check, we can be sure that the current leader will step down and > become a candidate before the new leader election starts in case of a network > partition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-1013) Fix failed UT caused by RATIS-757
[ https://issues.apache.org/jira/browse/RATIS-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168340#comment-17168340 ] runzhiwang commented on RATIS-1013: --- [~ljain] [~dineshchitlangia] Hi, It looks like https://github.com/apache/incubator-ratis/pull/103 caused CI unstable, and it failed CI 6 times, which is abnormally, Could you have a look ? > Fix failed UT caused by RATIS-757 > - > > Key: RATIS-1013 > URL: https://issues.apache.org/jira/browse/RATIS-1013 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1013) Fix failed UT caused by RATIS-757
[ https://issues.apache.org/jira/browse/RATIS-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] runzhiwang updated RATIS-1013: -- Parent: RATIS-863 Issue Type: Sub-task (was: Bug) > Fix failed UT caused by RATIS-757 > - > > Key: RATIS-1013 > URL: https://issues.apache.org/jira/browse/RATIS-1013 > Project: Ratis > Issue Type: Sub-task >Reporter: runzhiwang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-1013) Fix failed UT caused by RATIS-757
runzhiwang created RATIS-1013: - Summary: Fix failed UT caused by RATIS-757 Key: RATIS-1013 URL: https://issues.apache.org/jira/browse/RATIS-1013 Project: Ratis Issue Type: Bug Reporter: runzhiwang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-624) RaftServer should support pause/ unpause in its LifeCycle state
[ https://issues.apache.org/jira/browse/RATIS-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168296#comment-17168296 ] Hanisha Koneru commented on RATIS-624: -- The requirement comes from Ozone. In Ozone, if one of the OM Ratis server is lagging behind and needs to install a snapshot to catch up, it would have to stop it's Ratis server so that there are no transactions applied in the meanwhile. If Ratis provides an option to pause/ unpause its state, OM Ratis server would not have to be stopped in this process. > RaftServer should support pause/ unpause in its LifeCycle state > --- > > Key: RATIS-624 > URL: https://issues.apache.org/jira/browse/RATIS-624 > Project: Ratis > Issue Type: Task >Reporter: Hanisha Koneru >Priority: Major > Labels: ozone > Fix For: 1.1.0 > > > This Jira aims to add support to RaftServer to support pause and unpause to > its state. When paused, the RaftServer should not accept any incoming append > log entries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-624) RaftServer should support pause/ unpause in its LifeCycle state
[ https://issues.apache.org/jira/browse/RATIS-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168119#comment-17168119 ] Rui Wang commented on RATIS-624: [~hanishakoneru] thanks! Can you share a bit context on this JIRA? Is there some papers/systems doing this? Or is there a requirement from hadoop-ozone to ask ratis supports this? > RaftServer should support pause/ unpause in its LifeCycle state > --- > > Key: RATIS-624 > URL: https://issues.apache.org/jira/browse/RATIS-624 > Project: Ratis > Issue Type: Task >Reporter: Hanisha Koneru >Priority: Major > Labels: ozone > Fix For: 1.1.0 > > > This Jira aims to add support to RaftServer to support pause and unpause to > its state. When paused, the RaftServer should not accept any incoming append > log entries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1011) Define internal streaming APIs
[ https://issues.apache.org/jira/browse/RATIS-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-1011: -- Description: Similar to ratis rpc, ratis streaming should define a set of internal APIs in order to support pluggable implementations. The APIs must support asynchronous event driven. was:Similar to ratis rpc, ratis streaming should define a set of internal APIs in order to support pluggable implementations. > Define internal streaming APIs > -- > > Key: RATIS-1011 > URL: https://issues.apache.org/jira/browse/RATIS-1011 > Project: Ratis > Issue Type: Sub-task > Components: Streaming >Reporter: Tsz-wo Sze >Assignee: Ansh Khanna >Priority: Major > > Similar to ratis rpc, ratis streaming should define a set of internal APIs in > order to support pluggable implementations. > The APIs must support asynchronous event driven. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-1012) Implement ratis streaming using netty
Tsz-wo Sze created RATIS-1012: - Summary: Implement ratis streaming using netty Key: RATIS-1012 URL: https://issues.apache.org/jira/browse/RATIS-1012 Project: Ratis Issue Type: Sub-task Components: Streaming Reporter: Tsz-wo Sze Assignee: Ansh Khanna Since we are getting good results from RATIS-1009, we will continue to work on the first ratis streaming implementation using netty with zero buffer copying. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-1011) Define internal streaming APIs
Tsz-wo Sze created RATIS-1011: - Summary: Define internal streaming APIs Key: RATIS-1011 URL: https://issues.apache.org/jira/browse/RATIS-1011 Project: Ratis Issue Type: Sub-task Components: Streaming Reporter: Tsz-wo Sze Assignee: Ansh Khanna Similar to ratis rpc, ratis streaming should define a set of internal APIs in order to support pluggable implementations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-1009) A simple benchmark achieving zero-copy semantics using Netty.
[ https://issues.apache.org/jira/browse/RATIS-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved RATIS-1009. --- Fix Version/s: 1.1.0 Resolution: Fixed I have merged the pull request. Thanks, Ansh! > A simple benchmark achieving zero-copy semantics using Netty. > - > > Key: RATIS-1009 > URL: https://issues.apache.org/jira/browse/RATIS-1009 > Project: Ratis > Issue Type: Sub-task > Components: Streaming >Reporter: Ansh Khanna >Assignee: Ansh Khanna >Priority: Major > Fix For: 1.1.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Patch: [https://github.com/apache/incubator-ratis/pull/155] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-1009) A simple benchmark achieving zero-copy semantics using Netty.
[ https://issues.apache.org/jira/browse/RATIS-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-1009: -- Component/s: Streaming > A simple benchmark achieving zero-copy semantics using Netty. > - > > Key: RATIS-1009 > URL: https://issues.apache.org/jira/browse/RATIS-1009 > Project: Ratis > Issue Type: Sub-task > Components: Streaming >Reporter: Ansh Khanna >Assignee: Ansh Khanna >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Patch: [https://github.com/apache/incubator-ratis/pull/155] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-624) RaftServer should support pause/ unpause in its LifeCycle state
[ https://issues.apache.org/jira/browse/RATIS-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168018#comment-17168018 ] Hanisha Koneru commented on RATIS-624: -- [~amaliujia], please go ahead. > RaftServer should support pause/ unpause in its LifeCycle state > --- > > Key: RATIS-624 > URL: https://issues.apache.org/jira/browse/RATIS-624 > Project: Ratis > Issue Type: Task >Reporter: Hanisha Koneru >Priority: Major > Labels: ozone > Fix For: 1.1.0 > > > This Jira aims to add support to RaftServer to support pause and unpause to > its state. When paused, the RaftServer should not accept any incoming append > log entries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (RATIS-965) Add a metric for raftServer impl groups for a raft server
[ https://issues.apache.org/jira/browse/RATIS-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain reassigned RATIS-965: - Assignee: Cyrus Jackson (was: Ansh Khanna) > Add a metric for raftServer impl groups for a raft server > - > > Key: RATIS-965 > URL: https://issues.apache.org/jira/browse/RATIS-965 > Project: Ratis > Issue Type: Sub-task >Reporter: Shashikant Banerjee >Assignee: Cyrus Jackson >Priority: Major > Fix For: 1.1.0 > > Attachments: RATIS-965.001.patch > > > Currently, a single raft server instance can contain multiple raftServerImpl > belonging to different raft groups. The idea here is to track the number of > RaftGroups a raft server is part of. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-965) Add a metric for raftServer impl groups for a raft server
[ https://issues.apache.org/jira/browse/RATIS-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain resolved RATIS-965. --- Fix Version/s: 1.1.0 Resolution: Fixed > Add a metric for raftServer impl groups for a raft server > - > > Key: RATIS-965 > URL: https://issues.apache.org/jira/browse/RATIS-965 > Project: Ratis > Issue Type: Sub-task >Reporter: Shashikant Banerjee >Assignee: Ansh Khanna >Priority: Major > Fix For: 1.1.0 > > Attachments: RATIS-965.001.patch > > > Currently, a single raft server instance can contain multiple raftServerImpl > belonging to different raft groups. The idea here is to track the number of > RaftGroups a raft server is part of. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-624) RaftServer should support pause/ unpause in its LifeCycle state
[ https://issues.apache.org/jira/browse/RATIS-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167653#comment-17167653 ] Rui Wang commented on RATIS-624: [~hanishakoneru] Can I work on this JIRA? > RaftServer should support pause/ unpause in its LifeCycle state > --- > > Key: RATIS-624 > URL: https://issues.apache.org/jira/browse/RATIS-624 > Project: Ratis > Issue Type: Task >Reporter: Hanisha Koneru >Priority: Major > Labels: ozone > Fix For: 1.1.0 > > > This Jira aims to add support to RaftServer to support pause and unpause to > its state. When paused, the RaftServer should not accept any incoming append > log entries. -- This message was sent by Atlassian Jira (v8.3.4#803005)