[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-05-13 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282985#comment-15282985
 ] 

Scott Blum commented on SOLR-8697:
--

Probably so.  I think we should close that one and assume it's fixed.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: 5.5.1, 6.0
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-05-13 Thread Jeff Wartes (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282958#comment-15282958
 ] 

Jeff Wartes commented on SOLR-8697:
---

Does this fix SOLR-6498?

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: 5.5.1, 6.0
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-04-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252752#comment-15252752
 ] 

ASF subversion and git services commented on SOLR-8697:
---

Commit f6fca6901665cab4bea078baa4350ddc8964a2cf in lucene-solr's branch 
refs/heads/branch_5_5 from [~shalinmangar]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6fca69 ]

SOLR-8697: Fix precommit failure


> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master, 6.0, 5.5.1
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-04-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252751#comment-15252751
 ] 

ASF subversion and git services commented on SOLR-8697:
---

Commit 78bed536984dbfd4ba2f802deb58b29979d59329 in lucene-solr's branch 
refs/heads/branch_5_5 from markrmiller
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=78bed53 ]

SOLR-8697: Add synchronization around registering as leader and canceling.


> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master, 6.0, 5.5.1
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-04-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252750#comment-15252750
 ] 

ASF subversion and git services commented on SOLR-8697:
---

Commit f8dd98f93b75ff774fd81c3a44d551d204676b94 in lucene-solr's branch 
refs/heads/branch_5_5 from markrmiller
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f8dd98f ]

SOLR-8697: Scope ZK election nodes by session to prevent elections from 
interfering with each other and other small LeaderElector improvements.


> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master, 6.0, 5.5.1
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-04-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252650#comment-15252650
 ] 

ASF subversion and git services commented on SOLR-8697:
---

Commit 4b42c77ac6b045dce914d2244298cd0fea5300e4 in lucene-solr's branch 
refs/heads/branch_5x from [~shalinmangar]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4b42c77 ]

SOLR-8697: Fix precommit failure


> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master, 6.0, 5.5.1
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-04-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252648#comment-15252648
 ] 

ASF subversion and git services commented on SOLR-8697:
---

Commit 0dabbadb4dbdabbad4da06a49e2eebd908a4cd62 in lucene-solr's branch 
refs/heads/branch_5x from markrmiller
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0dabbad ]

SOLR-8697: Scope ZK election nodes by session to prevent elections from 
interfering with each other and other small LeaderElector improvements.


> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master, 6.0, 5.5.1
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-04-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252649#comment-15252649
 ] 

ASF subversion and git services commented on SOLR-8697:
---

Commit d08169e6f6355d9023779cac1303ab48fe2c86e0 in lucene-solr's branch 
refs/heads/branch_5x from markrmiller
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d08169e ]

SOLR-8697: Add synchronization around registering as leader and canceling.


> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master, 6.0, 5.5.1
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-26 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170408#comment-15170408
 ] 

ASF subversion and git services commented on SOLR-8697:
---

Commit c2bc93822d73b66c74ed998ea6c57a3cce05af44 in lucene-solr's branch 
refs/heads/master from [~shalinmangar]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2bc938 ]

SOLR-8697: Fix precommit failure


> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-26 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169379#comment-15169379
 ] 

ASF subversion and git services commented on SOLR-8697:
---

Commit efb7bb171b22a3c6a00d30eefe935a0024df0c71 in lucene-solr's branch 
refs/heads/master from markrmiller
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=efb7bb1 ]

SOLR-8697: Add synchronization around registering as leader and canceling.


> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-24 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163597#comment-15163597
 ] 

Mark Miller commented on SOLR-8697:
---

No, we only open a new issue if this one has gone out in a release, otherwise 
we can keep iterating as needed in the same issue. A bit busy, but I'll get to 
this soon.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-24 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163535#comment-15163535
 ] 

Scott Blum commented on SOLR-8697:
--

[~markrmil...@gmail.com] should I open a separate jira for the follow on patch?

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-23 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159632#comment-15159632
 ] 

Scott Blum commented on SOLR-8697:
--

Attached a patch with real synchronization.  99% sure this will fix the 
observed flake.
I also included the MockZKController.close() election cancel change, I think 
it's a good cleanup but not related to the flake.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-23 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159610#comment-15159610
 ] 

Scott Blum commented on SOLR-8697:
--

Actually, I'm stupid.  The flaky problem is that I *still* didn't fix the race 
regarding leaderZkNodeParentVersion.  I just made it harder to repro.

The smoking gun is this line:

{code}
  log.info("No version found for ephemeral leader parent node, won't remove 
previous leader registration.");
{code}

You can repro this pretty easily with the following change:

{code}

 -- a/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java
 ++ b/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java

@@ -193,7 +193,7 @@ class ShardLeaderElectionContextBase extends 
ElectionContext {
   List results;
   
   results = zkClient.multi(ops, true);
   
   Thread.sleep(1);
   for (OpResult result : results) {
 if (result.getType() == ZooDefs.OpCode.setData) {
   SetDataResult dresult = (SetDataResult) result;
{code}

We need a harder synchronization around becoming leader vs. canceling.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159597#comment-15159597
 ] 

Mark Miller commented on SOLR-8697:
---

bq. , even running like 100 iterations.

Unfortunately, that is common. The tests are run in a wicked diverse set of 
envs. We see stuff from the jenkins cluster no dev ever seems to hit. 

A lot of fails only pop when running with other tests to also bog down the 
system, and you won't see the issue with that test in isolation, and then you 
can configure different numbers of tests to run at the same time (I do 10 on my 
6 core machine), and the different levels of hardware...

Some things are pretty hard to replicate locally.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-23 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159589#comment-15159589
 ] 

Scott Blum commented on SOLR-8697:
--

Attached a followup patch that cancels previous elections when the 
MockZkController is closed.  I haven't actually been able to get 
OverseerTest.testShardLeaderChange() to flake on my machine, with or without 
the change, even running like 100 iterations.  But 

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master
>
> Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159562#comment-15159562
 ] 

Mark Miller commented on SOLR-8697:
---

bq. deleting the registration on cancel.

+1.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master
>
> Attachments: OverseerTestFail.log, SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-23 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159537#comment-15159537
 ] 

Scott Blum commented on SOLR-8697:
--

I'll take a look.  I ran into this before on account of the test code doing a 
couple of questionable things.

1) Re-using the same ZK session for multiple election contexts on the same 
election.  I fixed that one in my patch.

2) Not deleting the registration on cancel.  I didn't bother fixing this one, 
since I figured eventually the session timeout would kill all the ephemeral 
nodes, making for a more thorough test.  But explicitly deleting the 
registration would speed the test up and make it more reliable, probably.

Thoughts?


> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master
>
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159525#comment-15159525
 ] 

Mark Miller commented on SOLR-8697:
---

Note: Have not seen any fails like this in this test in a long, long time so 
may be related to this change. Perhaps just a test issue, because this retries 
for like 60 seconds or something.

{noformat}
   [junit4] ERROR   66.3s J4  | OverseerTest.testShardLeaderChange <<<
   [junit4]> Throwable #1: org.apache.solr.common.SolrException: Could not 
register as the leader because creating the ephemeral registration node in 
ZooKeeper failed
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([C4618609907C7E14:1A3201FE8AE48BE5]:0)
   [junit4]>at 
org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:212)
   [junit4]>at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:173)
   [junit4]>at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:138)
   [junit4]>at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:310)
   [junit4]>at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:219)
   [junit4]>at 
org.apache.solr.cloud.OverseerTest$MockZKController.publishState(OverseerTest.java:181)
   [junit4]>at 
org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:841)
   [junit4]>at java.lang.Thread.run(Thread.java:745)
   [junit4]> Caused by: 
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
   [junit4]>at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
   [junit4]>at 
org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
   [junit4]>at 
org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
   [junit4]>at 
org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:577)
   [junit4]>at 
org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:574)
   [junit4]>at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
   [junit4]>at 
org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:574)
   [junit4]>at 
org.apache.solr.cloud.ShardLeaderElectionContextBase$1.execute(ElectionContext.java:195)
   [junit4]>at 
org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:49)
   [junit4]>at 
org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:42)
   [junit4]>at 
org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:178)
   [junit4]>... 45 more
{noformat}

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Fix For: master
>
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15155203#comment-15155203
 ] 

Scott Blum commented on SOLR-8697:
--

I think it's such a subtle race, that it would only generally show up with code 
changes-- but as little as changing logging could trigger it / not trigger it.  
So it might have been beaten into a state where it happened to work unless you 
breathed on it.  Drove me nuts for the longest time debugging it until I 
stumbled on the race. :D

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154996#comment-15154996
 ] 

Mark Miller commented on SOLR-8697:
---

Okay, good, but then was it after your changes? I don't recall seeing that test 
fail in a long, long time on our Jenkins 'cluster', and that's a bunch of 
machines running the tests continuously. I also don't recall it locally.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154993#comment-15154993
 ] 

Scott Blum commented on SOLR-8697:
--

Actually OverseerTest.testShardLeaderChange() DID catch this race for me.  But 
only rarely.  Debugging that flake is how I uncovered the race.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154976#comment-15154976
 ] 

Mark Miller commented on SOLR-8697:
---

bq. cancelElection() and runLeaderProcess() can race with each other. If the 
local process is trying to cancel right as it becomes leader, cancelElection() 
won't see a leaderZkNodeParentVersion yet, so it won't try to delete the leader 
registration. Meanwhile, runLeaderProcess() still succeeds in creating the 
leader registration. The call to super.cancelElection() does remove us from the 
queue, but the dead leader registration is left there.

Any thoughts on why the existing stress tests for leader election can't catch 
this? Can we beef something up?

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154965#comment-15154965
 ] 

ASF subversion and git services commented on SOLR-8697:
---

Commit 9418369b46586818467109e482b70ba41e90d4ed in lucene-solr's branch 
refs/heads/master from markrmiller
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9418369 ]

SOLR-8697: Scope ZK election nodes by session to prevent elections from 
interfering with each other and other small LeaderElector improvements.


> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154950#comment-15154950
 ] 

Mark Miller commented on SOLR-8697:
---

FYI, we are in a special little place where we can break back compat and don't 
have to consider rolling upgrades because the next release is 6.0. We don't 
have much time before 6.0 branches though I think.

Patch looks good to me. Others should take a look as well, but I'll commit to 
get Jenkins cranking on it.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>Assignee: Mark Miller
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154939#comment-15154939
 ] 

Mark Miller commented on SOLR-8697:
---

bq. Bringing in Curator at some point would be something I'd only advocate for 
incrementally and in pieces, like replace our DQ with Curator's, etc.

Yeah, I suppose if we had some consensus to push it forward over time, that's a 
viable option.

bq.  If an outside party forcibly deletes our node, we should put ourselves at 
the back of the line.

Yeah, that sounds like an interesting improvement. Much nicer than making a 
bunch of distrib calls.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154919#comment-15154919
 ] 

Mark Miller commented on SOLR-8697:
---

You can leave the old patches by the way. We tend to leave the history and just 
pull the latest patch with the same file name.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154888#comment-15154888
 ] 

Scott Blum commented on SOLR-8697:
--

Yeah, totally agreed on refactoring and trying to fix core bugs!  Bringing in 
Curator at some point would be something I'd only advocate for incrementally 
and in pieces, like replace our DQ with Curator's, etc.  Moving everything over 
at in a short period of time would be a pipe dream anyway.

Back on the topic of LeaderElector, I think this patch is in a pretty good 
state now.  The only thing I want to consider doing in the short term (after 
this patch) is that, in addition to watching the node ahead of you, I think we 
should also be watching our own node, whether or not we're leader.  If an 
outside party forcibly deletes our node, we should put ourselves at the back of 
the line.  If you think about it, if we could trust that behavior, something 
like RebalanceLeaders wouldn't even need to be a distributed request; overseer 
could just delete the current leader elect node and trust the owner to do the 
right thing.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154853#comment-15154853
 ] 

Mark Miller commented on SOLR-8697:
---

bq. full disclosure: I'm a committer

And you worked on GWT! Awesome. Have not used it in years, still madly in love 
with GWT.

bq. using third party libs 

We may have started with curator but when we started (query side of SolrCloud 
was 2009 or 2010?) it was not around or we did not know about it back then if 
it was.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154833#comment-15154833
 ] 

Mark Miller commented on SOLR-8697:
---

Curator has come up before. Personally, I have not wanted to try and mimic what 
we have or go through a protracted hardening process again. This stuff is all 
very touchy, and our tests def do not catch anything, so a rip and replace at 
that low level would be both very difficult and sure to introduce a lot of 
issues.

I think a lot of the problem is that devs like to favor just tossing crap on 
top of what exists, rather than trying to wholistically move the design forward 
or make it right for what they want to add (Examples: OverseerNodePrioritizer 
and RebalanceLeaders - which also made the election code much more dense). I 
feel a lot of "let's just make this work". I can't tell you how surprised I've 
been that some devs have come and built so much on some of the prototype code I 
initially laid out. I've always thought, how do you build so much on this 
without finding/fixing more core bugs and seeing other necessary improvments 
more things as you go? Not that it doesn't happen, but the scale has 
historically been way below what I think makes sense. Easy for me to say I 
guess. Anyway, it's great that you have already filed a bunch of issues :)

I'd rather focus on some refactoring than bringing in curator though. The 
implications of that would be pretty large and we have plenty of other more 
pressing issues I think.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154817#comment-15154817
 ] 

Scott Blum commented on SOLR-8697:
--

I think part of the general problem with a lot of the ZK-interacting code is a 
lack of clean separation of concerns.  The relationships between LeaderElector 
and the various ElectionContext subclasses are pretty gnarly and incestuous.  
DistributedQueue had a similar kind of design problem before I extracted the 
app specific gnarly parts into OverseerTaskQueue.

Have we considered trying to migrate to, say, Apache Curator (full disclosure: 
I'm a committer)?  There are a lot of advantages to using third party libs for 
some of this common patterns like distributed queue, leader election, or even 
observing changes in a tree.  Those components tend to be reusable, better 
documented, with cleaner APIs, and have a natural resistance to spaghetti 
invasion.  (Examples: OverseerNodePrioritizer and RebalanceLeaders are 
intricately tied to implementation details of LeaderElector.)

A clean, reusable leader election component (with its own tests) that could 
simply be used in a few different contexts seems like a good place to be longer 
term.

That said, I hope this patch can simply clean up some up the existing bugs 
without being too disruptive.


> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154740#comment-15154740
 ] 

Mark Miller commented on SOLR-8697:
---

bq. TBH, the code is pretty hard to follow in its existing form

Yup. It was mildly hairy in its first form (copying the ZK recipe as described) 
and took a while to harden. Then some contributions came that just made it 
insane to follow. I've brought it up before, instead of trying to avoid 
thundering herd issues with what will be a reasonably low number of replicas 
trying to be leader, we probably should just have very simple leader elections. 
All of the original logic, and the logic that was added that made it really 
hard for me to follow, would be really simple if we gave up the cool elegant 
approach we used to avoid a mostly non existent thundering herd issue. That 
thicket is just a ripe breeding ground for random bugs our tests just don't 
easily expose.

At this point, the effort to change reliably is probably really high though.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-19 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154618#comment-15154618
 ] 

Scott Blum commented on SOLR-8697:
--

Okay, all the tests are passing now, I think this is ready for review.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-18 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153418#comment-15153418
 ] 

Scott Blum commented on SOLR-8697:
--

Found another bug in ShardLeaderElectionContextBase.

cancelElection() and runLeaderProcess() can race with each other.  If the local 
process is trying to cancel right as it becomes leader, cancelElection() won't 
see a leaderZkNodeParentVersion yet, so it won't try to delete the leader 
registration.  Meanwhile, runLeaderProcess() still succeeds in creating the 
leader registration.  The call to super.cancelElection() does remove us from 
the queue, but the dead leader registration is left there.

I think moving the call to super.cancelElection() to the start of 
ShardLeaderElectionContextBase.cancelElection() should resolve the race, 
because actually removing the election node will cause the multi to fail with a 
NoNode rather than a NodeExists, and it won't get stuck in a retry loop.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-18 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153294#comment-15153294
 ] 

Scott Blum commented on SOLR-8697:
--

[~markrmil...@gmail.com] [~erickerickson]

I think there is a potential problem with how OverseerTest is constructed, that 
perhaps caused us to write some code into LeaderElector in the past that 
doesn't make any sense for live code.

I'm looking at the implementation of MockZkController.publishState() (it's kind 
of a beast) and I notice that when it creates an ElectionContext, it never 
actually adds it to the map, checks whether one already exists, etc.  As a 
result, MockZkController does something the real ZkController never does -- it 
tries to register two different election contexts for the same core on the same 
ZK session.

My question is, what's the right fix?  I can either make MockZkController not 
setup a new electionContext on subsequent invocations, or I could make it 
simply cancel the previous election context before creating a new one.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-18 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153048#comment-15153048
 ] 

Scott Blum commented on SOLR-8697:
--

TBH, the code is pretty hard to follow in its existing form, and a lot of the 
theory of operation and design is difficult to distill.  There is code that 
looks like it's trying to do what you're talking about, but it's buggy.

Take this analysis with like 70% confidence: The existing code has a very hazy 
definition of "myself" and "previous registration", from what I can tell.  The 
previous elect node from a different process qualifies as "myself" in some 
cases, so the new process thinks it's already registered when it's not.

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153011#comment-15153011
 ] 

Mark Miller commented on SOLR-8697:
---

bq. The old process's ZK leader elect node has not yet disappeared, and the new 
process fails to set appropriate watches.

I thought we waited for a while for the old one to go away...is that missing in 
the Overseer election code and just in the shard election code?

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8697) Fix LeaderElector issues

2016-02-18 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152995#comment-15152995
 ] 

Scott Blum commented on SOLR-8697:
--

[~shalinmangar] [~erickerickson] [~markrmil...@gmail.com]

> Fix LeaderElector issues
> 
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.4.1
>Reporter: Scott Blum
>  Labels: patch, reliability, solrcloud
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org