[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456851#comment-16456851 ] Mark Miller commented on SOLR-11702: This looks like a massive improvement for many long standing issues, great work [~caomanhdat]! > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400266#comment-16400266 ] ASF subversion and git services commented on SOLR-11702: Commit 1afe333844bf133538923a6ca1a3de0b2076d788 in lucene-solr's branch refs/heads/branch_7_3 from [~shalinmangar] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1afe333 ] SOLR-11702: Minor edits to log and exception messages (cherry picked from commit dab739a) (cherry picked from commit 4b52a19) > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400263#comment-16400263 ] ASF subversion and git services commented on SOLR-11702: Commit dab739ae4cc8c3ff4ece24992ad8c633f7a4b19c in lucene-solr's branch refs/heads/master from [~shalinmangar] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dab739a ] SOLR-11702: Minor edits to log and exception messages > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400264#comment-16400264 ] ASF subversion and git services commented on SOLR-11702: Commit 4b52a19f4adfed57c9265ebee85d4e03321f6dbb in lucene-solr's branch refs/heads/branch_7x from [~shalinmangar] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4b52a19 ] SOLR-11702: Minor edits to log and exception messages (cherry picked from commit dab739a) > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394368#comment-16394368 ] ASF subversion and git services commented on SOLR-11702: Commit 1f994c97301fbe8926115925102c78a8a133e26b in lucene-solr's branch refs/heads/branch_7x from [~caomanhdat] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f994c9 ] SOLR-11702: Remove noise of exception messages on failed to ping leader > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394367#comment-16394367 ] ASF subversion and git services commented on SOLR-11702: Commit e926f435d7e318b30b2d9ec38be87ad9ab7eed45 in lucene-solr's branch refs/heads/master from [~caomanhdat] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e926f43 ] SOLR-11702: Remove noise of exception messages on failed to ping leader > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392349#comment-16392349 ] ASF subversion and git services commented on SOLR-11702: Commit b992bbb2d7480d4cf2ff1d9302a7e20732c1100c in lucene-solr's branch refs/heads/branch_7x from [~caomanhdat] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b992bbb ] SOLR-11702: Fix precommit, only throw error to client if the replica is not in the same shard as leader > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392347#comment-16392347 ] ASF subversion and git services commented on SOLR-11702: Commit dae572819ba479bffd990ea7d8f0c4f7b76da5b0 in lucene-solr's branch refs/heads/master from [~caomanhdat] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dae5728 ] SOLR-11702: Fix precommit, only throw error to client if the replica is not in the same shard as leader > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383286#comment-16383286 ] ASF subversion and git services commented on SOLR-11702: Commit ce2386aaabc401bc89990597279eefeb67a914b0 in lucene-solr's branch refs/heads/branch_7x from [~caomanhdat] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ce2386a ] SOLR-11702: Remove old LIR call in SolrCmdDistributor and let DistributedUpdateProcessor handle it on finish() > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: master (8.0), 7.3 > > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383285#comment-16383285 ] ASF subversion and git services commented on SOLR-11702: Commit f1ce5419eebfa361f572802eb4a8b637c2849bb5 in lucene-solr's branch refs/heads/master from [~caomanhdat] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f1ce541 ] SOLR-11702: Remove old LIR call in SolrCmdDistributor and let DistributedUpdateProcessor handle it on finish() > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: master (8.0), 7.3 > > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343096#comment-16343096 ] ASF subversion and git services commented on SOLR-11702: Commit 8c8d78a4bb6c0f3322471af5765a01848247409c in lucene-solr's branch refs/heads/branch_7x from [~caomanhdat] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8c8d78a ] SOLR-11702: Redesign current LIR implementation > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343094#comment-16343094 ] ASF subversion and git services commented on SOLR-11702: Commit 27ef6530646a9af6f8fdf491afd80185bc4f7fee in lucene-solr's branch refs/heads/master from [~caomanhdat] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=27ef653 ] SOLR-11702: Redesign current LIR implementation > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335360#comment-16335360 ] Cao Manh Dat commented on SOLR-11702: - bq. LIRRollingUpdatesTest.testNewReplicaOldLeader – why is the proxy closed for both leader and replica? Isn't closing for replica sufficient to force LIR? Yeah, you're right, closing leader's proxy is not necessary. That call is only for safety, I just want to simulate the real network partition between leader and replica bq. LIRRollingUpdatesTest calls TestInjection.reset() in tearDown but fault injection isn't used anywhere in the test so it can be removed. +1 bq. Javadocs for ZkShardTerms.ensureTermIsHigher says "Ensure that leader's term is lower than some replica's terms" but shouldn't the leader have a higher term? This is also mentioned in the design document "The idea of term is only replicas (in the same shard) with highest term are considered healthy". The impl is doing the opposite i.e. it is increasing the replica's term to leaderTerm+1. +1, the javadoc is miss typed bq. Can you add javadocs to the various methods in the ZkShardTerms.Terms class? Sure > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335353#comment-16335353 ] Shalin Shekhar Mangar commented on SOLR-11702: -- Ok, thanks for clarifying Dat. A few more questions/comments: # LIRRollingUpdatesTest.testNewReplicaOldLeader -- why is the proxy closed for both leader and replica? Isn't closing for replica sufficient to force LIR? # LIRRollingUpdatesTest calls TestInjection.reset() in tearDown but fault injection isn't used anywhere in the test so it can be removed. # Javadocs for ZkShardTerms.ensureTermIsHigher says "Ensure that leader's term is lower than some replica's terms" but shouldn't the leader have a higher term? This is also mentioned in the design document "The idea of _term_ is only replicas (in the same shard) with highest term are considered healthy". The impl is doing the opposite i.e. it is increasing the replica's term to leaderTerm+1. # Can you add javadocs to the various methods in the ZkShardTerms.Terms class? > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326193#comment-16326193 ] Cao Manh Dat commented on SOLR-11702: - {quote} I'm not sure that ZkShardTerms.refreshTerms behaves correctly on ZK Reconnect. Say a watcher was set (numWatcher=1) but not fired and the zk client disconnects. Then on re-connect, the OnReconnectListener in ZkController fires which re-registers cores and calls refreshTerms again. Now watcher won't be initialized in this method (because numWatcher=1) and therefore won't be set on terms znode anymore. Can you please verify? {quote} The logic you described match with the code, but as I observed watcher is always fired on reconnect, at least on DISCONNECT event. > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326166#comment-16326166 ] Shalin Shekhar Mangar commented on SOLR-11702: -- I'm not sure that ZkShardTerms.refreshTerms behaves correctly on ZK Reconnect. Say a watcher was set (numWatcher=1) but not fired and the zk client disconnects. Then on re-connect, the OnReconnectListener in ZkController fires which re-registers cores and calls refreshTerms again. Now watcher won't be initialized in this method (because numWatcher=1) and therefore won't be set on terms znode anymore. Can you please verify? > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325907#comment-16325907 ] Cao Manh Dat commented on SOLR-11702: - Thanks [~shalinmangar] {quote} 1. DUP.setupRequest skips replicas having terms. If I understand correctly, this will mean that updates are no longer forwarded to replicas until they publish themselves in recovery? Is that right? {quote} Right, if term of a replica is less than leader term, leader will stop sending updates to that replica. {quote} 2. CreateCollectionCmd – throw InterruptedException directly from the method instead of trying to handle it here {quote} The code of deleting old term nodes in CreateCollectionCmd is handled exactly same as the code below it, I do not understand the problem here. {quote} 3. Mark LIR related classes/methods as deprecated – those are more likely to get attention right before 8.0 I think. {quote} Sure, this is a good idea {quote} 5. RecoveringCoreTermWatcher – Shouldn't lastTermDoRecovery be set after recovery completes? If not, how do we ensure that recoveries are stacked up? {quote} I do not see any problem in the current implementation, after we call {{doRecovery}}, the recovery process will start shortly {quote} 6. RecoveringCoreTermWatcher catches NullPointerException. Do a null check instead. {quote} Sure! {quote} 7. RecoveryStrategy – why pingLeader? isn't it sufficient to use ZkStateReader.getLeaderRetry as we used to do earlier? {quote} Imagine this case, when there are network partition between leader and replica * Leader increase term of replica * RecoveringCoreTermWatcher trigger recovery process of replica, replica goes into recovery ( hence increase its term ) * Leader increase term of replica ( because it failed to send update to replica and now term of replica is equals to leader's term) * RecoveringCoreTermWatcher trigger recovery process of replica, replica goes into recovery ( hence increase its term ) * ... this process will be repeated forever until the network is healed {quote} 8. ZkCollectionTerms – if getShard and remove methods need to be synchronized then seems like close can interfere. Perhaps better to synchronize on the terms map itself. {quote} This is a good idea {quote} 9. Can you explain the purpose of "new".equals(cd.getCoreProperty("lirVersion", "new"))) used in various places? {quote} That flag mostly used for testing rolling updates and can be removed in SOLR-11812 > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325893#comment-16325893 ] Shalin Shekhar Mangar commented on SOLR-11702: -- Thanks Dat. A few comments/questions: # DUP.setupRequest skips replicas having terms. If I understand correctly, this will mean that updates are no longer forwarded to replicas until they publish themselves in recovery? Is that right? # CreateCollectionCmd -- throw InterruptedException directly from the method instead of trying to handle it here # Mark LIR related classes/methods as deprecated -- those are more likely to get attention right before 8.0 I think. # ElectionContext -- Minor typo - "this replica is registered its term" -- s/is/has # RecoveringCoreTermWatcher -- Shouldn't lastTermDoRecovery be set after recovery completes? If not, how do we ensure that recoveries are stacked up? # RecoveringCoreTermWatcher catches NullPointerException. Do a null check instead. # RecoveryStrategy -- why pingLeader? isn't it sufficient to use ZkStateReader.getLeaderRetry as we used to do earlier? # ZkCollectionTerms -- if getShard and remove methods need to be synchronized then seems like close can interfere. Perhaps better to synchronize on the terms map itself. # Can you explain the purpose of "new".equals(cd.getCoreProperty("lirVersion", "new"))) used in various places? I'm still going through the rest of the changes. I'll add some more comments later. > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318845#comment-16318845 ] Erick Erickson commented on SOLR-11702: --- OK, reporting back. My problem was totally unrelated unfortunately. Mine went away with upgrading Jetty. > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307487#comment-16307487 ] Erick Erickson commented on SOLR-11702: --- OK, I'm giving it a try. My test case is quite simple, set up a 1-shard, 4-replica collection and fire a bunch of updates at it. So far this happens on 6.3 (where the problem was first reported) _and_ on 7.x. I suspect on master too, but don't want to spend the time since it happens on 7x. Anyway, the patch applied cleanly and I'm running the test now. Basic auth doesn't seem to be necessary. I'll report back later. > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307357#comment-16307357 ] Cao Manh Dat commented on SOLR-11702: - The current logic is quite stable. If the test can help us find some bugs in current implementation that will be great! > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307273#comment-16307273 ] Erick Erickson commented on SOLR-11702: --- [~caomanhdat] As it happens I'm working on understanding why a replica going into LIR and I have a test setup that let's me reproduce it reasonably reliably (although it may take a few hours). I'm determining whether having basic auth enabled is necessary or not. I _believe_ I've seen this on 7x and master The point is, when you think the patch is ready I'd be happy to give it a go in my test environment, although it may take me a week, let me know. > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307221#comment-16307221 ] Cao Manh Dat commented on SOLR-11702: - [~shalinmangar] [~markrmil...@gmail.com] I pushed all the changes to jira/solr-11702. Do you mind to take a look? Thanks! > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat > Attachments: SOLR-11702.patch, SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285422#comment-16285422 ] Cao Manh Dat commented on SOLR-11702: - [~manokovacs] Yeah, that's the idea of _term_. But the current design does not tell anything about DOWN replica, I postpone the fix for SOLR-7065 and SOLR-7034 to another issue, where we introduce a new rule like this: "only return success if all DOWN replicas have term less than leader's term" > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat > Attachments: SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283455#comment-16283455 ] Mano Kovacs commented on SOLR-11702: Really like this approach, [~caomanhdat]. Not just a cleaner and more robust approach, but I believe it could be an alternative solution for the problems that motivates SOLR-7065. Correct me if I am wrong, but replica could become leader, regardless of their previous state or the number of replicas participating, as their (and others) term number would explicitly say if they are in sync or behind. Is my assumption correct? > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat > Attachments: SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282999#comment-16282999 ] Cao Manh Dat commented on SOLR-11702: - [~mdrob] That's right. I borrowed term's idea from Raft. All the replicas can update its term equals to the leader's term. Only leader can increase terms of other replicas. > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat > Attachments: SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11702) Redesign current LIR implementation
[ https://issues.apache.org/jira/browse/SOLR-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282992#comment-16282992 ] Mike Drob commented on SOLR-11702: -- Ooooh, good approach. This is similar in concept to how RAFT works, I think. One thing that is unclear from design doc (haven't looked at code yet) is who updated the ZK terms when replica joins recovery. Is that the result of the leader acknowledging the PrepRecoveryCmd? > Redesign current LIR implementation > --- > > Key: SOLR-11702 > URL: https://issues.apache.org/jira/browse/SOLR-11702 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat > Attachments: SOLR-11702.patch > > > I recently looked into some problem related to racing between LIR and > Recovering. I would like to propose a totally new approach to solve SOLR-5495 > problem because fixing current implementation by a bandage will lead us to > other problems (we can not prove the correctness of the implementation). > Feel free to give comments/thoughts about this new scheme. > https://docs.google.com/document/d/1dM2GKMULsS45ZMuvtztVnM2m3fdUeRYNCyJorIIisEo/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org