[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282985#comment-15282985 ] Scott Blum commented on SOLR-8697: -- Probably so. I think we should close that one and assume it's fixed. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: 5.5.1, 6.0 > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282958#comment-15282958 ] Jeff Wartes commented on SOLR-8697: --- Does this fix SOLR-6498? > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: 5.5.1, 6.0 > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252752#comment-15252752 ] ASF subversion and git services commented on SOLR-8697: --- Commit f6fca6901665cab4bea078baa4350ddc8964a2cf in lucene-solr's branch refs/heads/branch_5_5 from [~shalinmangar] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6fca69 ] SOLR-8697: Fix precommit failure > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master, 6.0, 5.5.1 > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252751#comment-15252751 ] ASF subversion and git services commented on SOLR-8697: --- Commit 78bed536984dbfd4ba2f802deb58b29979d59329 in lucene-solr's branch refs/heads/branch_5_5 from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=78bed53 ] SOLR-8697: Add synchronization around registering as leader and canceling. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master, 6.0, 5.5.1 > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252750#comment-15252750 ] ASF subversion and git services commented on SOLR-8697: --- Commit f8dd98f93b75ff774fd81c3a44d551d204676b94 in lucene-solr's branch refs/heads/branch_5_5 from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f8dd98f ] SOLR-8697: Scope ZK election nodes by session to prevent elections from interfering with each other and other small LeaderElector improvements. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master, 6.0, 5.5.1 > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252650#comment-15252650 ] ASF subversion and git services commented on SOLR-8697: --- Commit 4b42c77ac6b045dce914d2244298cd0fea5300e4 in lucene-solr's branch refs/heads/branch_5x from [~shalinmangar] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4b42c77 ] SOLR-8697: Fix precommit failure > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master, 6.0, 5.5.1 > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252648#comment-15252648 ] ASF subversion and git services commented on SOLR-8697: --- Commit 0dabbadb4dbdabbad4da06a49e2eebd908a4cd62 in lucene-solr's branch refs/heads/branch_5x from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0dabbad ] SOLR-8697: Scope ZK election nodes by session to prevent elections from interfering with each other and other small LeaderElector improvements. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master, 6.0, 5.5.1 > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252649#comment-15252649 ] ASF subversion and git services commented on SOLR-8697: --- Commit d08169e6f6355d9023779cac1303ab48fe2c86e0 in lucene-solr's branch refs/heads/branch_5x from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d08169e ] SOLR-8697: Add synchronization around registering as leader and canceling. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master, 6.0, 5.5.1 > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170408#comment-15170408 ] ASF subversion and git services commented on SOLR-8697: --- Commit c2bc93822d73b66c74ed998ea6c57a3cce05af44 in lucene-solr's branch refs/heads/master from [~shalinmangar] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2bc938 ] SOLR-8697: Fix precommit failure > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169379#comment-15169379 ] ASF subversion and git services commented on SOLR-8697: --- Commit efb7bb171b22a3c6a00d30eefe935a0024df0c71 in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=efb7bb1 ] SOLR-8697: Add synchronization around registering as leader and canceling. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163597#comment-15163597 ] Mark Miller commented on SOLR-8697: --- No, we only open a new issue if this one has gone out in a release, otherwise we can keep iterating as needed in the same issue. A bit busy, but I'll get to this soon. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163535#comment-15163535 ] Scott Blum commented on SOLR-8697: -- [~markrmil...@gmail.com] should I open a separate jira for the follow on patch? > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159632#comment-15159632 ] Scott Blum commented on SOLR-8697: -- Attached a patch with real synchronization. 99% sure this will fix the observed flake. I also included the MockZKController.close() election cancel change, I think it's a good cleanup but not related to the flake. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159610#comment-15159610 ] Scott Blum commented on SOLR-8697: -- Actually, I'm stupid. The flaky problem is that I *still* didn't fix the race regarding leaderZkNodeParentVersion. I just made it harder to repro. The smoking gun is this line: {code} log.info("No version found for ephemeral leader parent node, won't remove previous leader registration."); {code} You can repro this pretty easily with the following change: {code} -- a/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java ++ b/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java @@ -193,7 +193,7 @@ class ShardLeaderElectionContextBase extends ElectionContext { List results; results = zkClient.multi(ops, true); Thread.sleep(1); for (OpResult result : results) { if (result.getType() == ZooDefs.OpCode.setData) { SetDataResult dresult = (SetDataResult) result; {code} We need a harder synchronization around becoming leader vs. canceling. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159597#comment-15159597 ] Mark Miller commented on SOLR-8697: --- bq. , even running like 100 iterations. Unfortunately, that is common. The tests are run in a wicked diverse set of envs. We see stuff from the jenkins cluster no dev ever seems to hit. A lot of fails only pop when running with other tests to also bog down the system, and you won't see the issue with that test in isolation, and then you can configure different numbers of tests to run at the same time (I do 10 on my 6 core machine), and the different levels of hardware... Some things are pretty hard to replicate locally. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159589#comment-15159589 ] Scott Blum commented on SOLR-8697: -- Attached a followup patch that cancels previous elections when the MockZkController is closed. I haven't actually been able to get OverseerTest.testShardLeaderChange() to flake on my machine, with or without the change, even running like 100 iterations. But > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159562#comment-15159562 ] Mark Miller commented on SOLR-8697: --- bq. deleting the registration on cancel. +1. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: OverseerTestFail.log, SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159537#comment-15159537 ] Scott Blum commented on SOLR-8697: -- I'll take a look. I ran into this before on account of the test code doing a couple of questionable things. 1) Re-using the same ZK session for multiple election contexts on the same election. I fixed that one in my patch. 2) Not deleting the registration on cancel. I didn't bother fixing this one, since I figured eventually the session timeout would kill all the ephemeral nodes, making for a more thorough test. But explicitly deleting the registration would speed the test up and make it more reliable, probably. Thoughts? > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159525#comment-15159525 ] Mark Miller commented on SOLR-8697: --- Note: Have not seen any fails like this in this test in a long, long time so may be related to this change. Perhaps just a test issue, because this retries for like 60 seconds or something. {noformat} [junit4] ERROR 66.3s J4 | OverseerTest.testShardLeaderChange <<< [junit4]> Throwable #1: org.apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed [junit4]>at __randomizedtesting.SeedInfo.seed([C4618609907C7E14:1A3201FE8AE48BE5]:0) [junit4]>at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:212) [junit4]>at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:173) [junit4]>at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:138) [junit4]>at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:310) [junit4]>at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:219) [junit4]>at org.apache.solr.cloud.OverseerTest$MockZKController.publishState(OverseerTest.java:181) [junit4]>at org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:841) [junit4]>at java.lang.Thread.run(Thread.java:745) [junit4]> Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists [junit4]>at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) [junit4]>at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949) [junit4]>at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) [junit4]>at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:577) [junit4]>at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:574) [junit4]>at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) [junit4]>at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:574) [junit4]>at org.apache.solr.cloud.ShardLeaderElectionContextBase$1.execute(ElectionContext.java:195) [junit4]>at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:49) [junit4]>at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:42) [junit4]>at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:178) [junit4]>... 45 more {noformat} > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155203#comment-15155203 ] Scott Blum commented on SOLR-8697: -- I think it's such a subtle race, that it would only generally show up with code changes-- but as little as changing logging could trigger it / not trigger it. So it might have been beaten into a state where it happened to work unless you breathed on it. Drove me nuts for the longest time debugging it until I stumbled on the race. :D > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154996#comment-15154996 ] Mark Miller commented on SOLR-8697: --- Okay, good, but then was it after your changes? I don't recall seeing that test fail in a long, long time on our Jenkins 'cluster', and that's a bunch of machines running the tests continuously. I also don't recall it locally. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154993#comment-15154993 ] Scott Blum commented on SOLR-8697: -- Actually OverseerTest.testShardLeaderChange() DID catch this race for me. But only rarely. Debugging that flake is how I uncovered the race. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154976#comment-15154976 ] Mark Miller commented on SOLR-8697: --- bq. cancelElection() and runLeaderProcess() can race with each other. If the local process is trying to cancel right as it becomes leader, cancelElection() won't see a leaderZkNodeParentVersion yet, so it won't try to delete the leader registration. Meanwhile, runLeaderProcess() still succeeds in creating the leader registration. The call to super.cancelElection() does remove us from the queue, but the dead leader registration is left there. Any thoughts on why the existing stress tests for leader election can't catch this? Can we beef something up? > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154965#comment-15154965 ] ASF subversion and git services commented on SOLR-8697: --- Commit 9418369b46586818467109e482b70ba41e90d4ed in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9418369 ] SOLR-8697: Scope ZK election nodes by session to prevent elections from interfering with each other and other small LeaderElector improvements. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154950#comment-15154950 ] Mark Miller commented on SOLR-8697: --- FYI, we are in a special little place where we can break back compat and don't have to consider rolling upgrades because the next release is 6.0. We don't have much time before 6.0 branches though I think. Patch looks good to me. Others should take a look as well, but I'll commit to get Jenkins cranking on it. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum >Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154939#comment-15154939 ] Mark Miller commented on SOLR-8697: --- bq. Bringing in Curator at some point would be something I'd only advocate for incrementally and in pieces, like replace our DQ with Curator's, etc. Yeah, I suppose if we had some consensus to push it forward over time, that's a viable option. bq. If an outside party forcibly deletes our node, we should put ourselves at the back of the line. Yeah, that sounds like an interesting improvement. Much nicer than making a bunch of distrib calls. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154919#comment-15154919 ] Mark Miller commented on SOLR-8697: --- You can leave the old patches by the way. We tend to leave the history and just pull the latest patch with the same file name. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154888#comment-15154888 ] Scott Blum commented on SOLR-8697: -- Yeah, totally agreed on refactoring and trying to fix core bugs! Bringing in Curator at some point would be something I'd only advocate for incrementally and in pieces, like replace our DQ with Curator's, etc. Moving everything over at in a short period of time would be a pipe dream anyway. Back on the topic of LeaderElector, I think this patch is in a pretty good state now. The only thing I want to consider doing in the short term (after this patch) is that, in addition to watching the node ahead of you, I think we should also be watching our own node, whether or not we're leader. If an outside party forcibly deletes our node, we should put ourselves at the back of the line. If you think about it, if we could trust that behavior, something like RebalanceLeaders wouldn't even need to be a distributed request; overseer could just delete the current leader elect node and trust the owner to do the right thing. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154853#comment-15154853 ] Mark Miller commented on SOLR-8697: --- bq. full disclosure: I'm a committer And you worked on GWT! Awesome. Have not used it in years, still madly in love with GWT. bq. using third party libs We may have started with curator but when we started (query side of SolrCloud was 2009 or 2010?) it was not around or we did not know about it back then if it was. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154833#comment-15154833 ] Mark Miller commented on SOLR-8697: --- Curator has come up before. Personally, I have not wanted to try and mimic what we have or go through a protracted hardening process again. This stuff is all very touchy, and our tests def do not catch anything, so a rip and replace at that low level would be both very difficult and sure to introduce a lot of issues. I think a lot of the problem is that devs like to favor just tossing crap on top of what exists, rather than trying to wholistically move the design forward or make it right for what they want to add (Examples: OverseerNodePrioritizer and RebalanceLeaders - which also made the election code much more dense). I feel a lot of "let's just make this work". I can't tell you how surprised I've been that some devs have come and built so much on some of the prototype code I initially laid out. I've always thought, how do you build so much on this without finding/fixing more core bugs and seeing other necessary improvments more things as you go? Not that it doesn't happen, but the scale has historically been way below what I think makes sense. Easy for me to say I guess. Anyway, it's great that you have already filed a bunch of issues :) I'd rather focus on some refactoring than bringing in curator though. The implications of that would be pretty large and we have plenty of other more pressing issues I think. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154817#comment-15154817 ] Scott Blum commented on SOLR-8697: -- I think part of the general problem with a lot of the ZK-interacting code is a lack of clean separation of concerns. The relationships between LeaderElector and the various ElectionContext subclasses are pretty gnarly and incestuous. DistributedQueue had a similar kind of design problem before I extracted the app specific gnarly parts into OverseerTaskQueue. Have we considered trying to migrate to, say, Apache Curator (full disclosure: I'm a committer)? There are a lot of advantages to using third party libs for some of this common patterns like distributed queue, leader election, or even observing changes in a tree. Those components tend to be reusable, better documented, with cleaner APIs, and have a natural resistance to spaghetti invasion. (Examples: OverseerNodePrioritizer and RebalanceLeaders are intricately tied to implementation details of LeaderElector.) A clean, reusable leader election component (with its own tests) that could simply be used in a few different contexts seems like a good place to be longer term. That said, I hope this patch can simply clean up some up the existing bugs without being too disruptive. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154740#comment-15154740 ] Mark Miller commented on SOLR-8697: --- bq. TBH, the code is pretty hard to follow in its existing form Yup. It was mildly hairy in its first form (copying the ZK recipe as described) and took a while to harden. Then some contributions came that just made it insane to follow. I've brought it up before, instead of trying to avoid thundering herd issues with what will be a reasonably low number of replicas trying to be leader, we probably should just have very simple leader elections. All of the original logic, and the logic that was added that made it really hard for me to follow, would be really simple if we gave up the cool elegant approach we used to avoid a mostly non existent thundering herd issue. That thicket is just a ripe breeding ground for random bugs our tests just don't easily expose. At this point, the effort to change reliably is probably really high though. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154618#comment-15154618 ] Scott Blum commented on SOLR-8697: -- Okay, all the tests are passing now, I think this is ready for review. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153418#comment-15153418 ] Scott Blum commented on SOLR-8697: -- Found another bug in ShardLeaderElectionContextBase. cancelElection() and runLeaderProcess() can race with each other. If the local process is trying to cancel right as it becomes leader, cancelElection() won't see a leaderZkNodeParentVersion yet, so it won't try to delete the leader registration. Meanwhile, runLeaderProcess() still succeeds in creating the leader registration. The call to super.cancelElection() does remove us from the queue, but the dead leader registration is left there. I think moving the call to super.cancelElection() to the start of ShardLeaderElectionContextBase.cancelElection() should resolve the race, because actually removing the election node will cause the multi to fail with a NoNode rather than a NodeExists, and it won't get stuck in a retry loop. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153294#comment-15153294 ] Scott Blum commented on SOLR-8697: -- [~markrmil...@gmail.com] [~erickerickson] I think there is a potential problem with how OverseerTest is constructed, that perhaps caused us to write some code into LeaderElector in the past that doesn't make any sense for live code. I'm looking at the implementation of MockZkController.publishState() (it's kind of a beast) and I notice that when it creates an ElectionContext, it never actually adds it to the map, checks whether one already exists, etc. As a result, MockZkController does something the real ZkController never does -- it tries to register two different election contexts for the same core on the same ZK session. My question is, what's the right fix? I can either make MockZkController not setup a new electionContext on subsequent invocations, or I could make it simply cancel the previous election context before creating a new one. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153048#comment-15153048 ] Scott Blum commented on SOLR-8697: -- TBH, the code is pretty hard to follow in its existing form, and a lot of the theory of operation and design is difficult to distill. There is code that looks like it's trying to do what you're talking about, but it's buggy. Take this analysis with like 70% confidence: The existing code has a very hazy definition of "myself" and "previous registration", from what I can tell. The previous elect node from a different process qualifies as "myself" in some cases, so the new process thinks it's already registered when it's not. > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153011#comment-15153011 ] Mark Miller commented on SOLR-8697: --- bq. The old process's ZK leader elect node has not yet disappeared, and the new process fails to set appropriate watches. I thought we waited for a while for the old one to go away...is that missing in the Overseer election code and just in the shard election code? > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8697) Fix LeaderElector issues
[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152995#comment-15152995 ] Scott Blum commented on SOLR-8697: -- [~shalinmangar] [~erickerickson] [~markrmil...@gmail.com] > Fix LeaderElector issues > > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.4.1 >Reporter: Scott Blum > Labels: patch, reliability, solrcloud > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org