[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393925#comment-16393925
 ] 

ASF subversion and git services commented on SOLR-12011:


Commit 40660ade9d296bacae4b7a2e23364da8aeae7b35 in lucene-solr's branch 
refs/heads/branch_7x from [~shalinmangar]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=40660ad ]

SOLR-12011: Remove unused imports

(cherry picked from commit e47bf8b)


> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Fix For: 7.3, master (8.0)
>
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393893#comment-16393893
 ] 

ASF subversion and git services commented on SOLR-12011:


Commit e47bf8b63aa732c884091f48ffe5b467c94e590c in lucene-solr's branch 
refs/heads/master from [~shalinmangar]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e47bf8b ]

SOLR-12011: Remove unused imports


> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Fix For: 7.3, master (8.0)
>
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-05 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385854#comment-16385854
 ] 

ASF subversion and git services commented on SOLR-12011:


Commit d7824a3793f8ec697a6f9a4f12eeb052a68b4782 in lucene-solr's branch 
refs/heads/branch_7x from [~caomanhdat]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d7824a3 ]

SOLR-12011: Remove FORCEPREPAREFORLEADERSHIP


> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Fix For: master (8.0), 7.3
>
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-05 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385852#comment-16385852
 ] 

ASF subversion and git services commented on SOLR-12011:


Commit 27eb6ba0620abc096aeabf5817e54c9e27976074 in lucene-solr's branch 
refs/heads/master from [~caomanhdat]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=27eb6ba ]

SOLR-12011: Remove FORCEPREPAREFORLEADERSHIP


> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Fix For: master (8.0), 7.3
>
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-03 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384982#comment-16384982
 ] 

ASF subversion and git services commented on SOLR-12011:


Commit 59f67468b7f9f90f2377033d358521d451508f9a in lucene-solr's branch 
refs/heads/branch_7x from [~caomanhdat]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=59f67468 ]

SOLR-12011: Consistence problem when in-sync replicas are DOWN


> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-03 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384981#comment-16384981
 ] 

ASF subversion and git services commented on SOLR-12011:


Commit 9de4225e9a54ba987c2c7c9d4510bea3e4f9de97 in lucene-solr's branch 
refs/heads/master from [~caomanhdat]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9de4225 ]

SOLR-12011: Consistence problem when in-sync replicas are DOWN


> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-02 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384312#comment-16384312
 ] 

Cao Manh Dat commented on SOLR-12011:
-

[~elyograg] Yeah, the shard just wait, with no leader at all, until the replica 
that got update comes back OR users use FORCE_LEADER API (if it never comes 
back)

> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-02 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383820#comment-16383820
 ] 

Shawn Heisey commented on SOLR-12011:
-

As I understand it, you're describing a situation where only the leader is up 
for an update, then that one replica (the leader) goes down, followed by the 
other replicas coming back up.

What are you suggesting happens in this situation?  Does the shard just wait, 
with no leader at all, until the replica that got the update comes back?  What 
if it never comes back, or when it comes back all its data is gone?


> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-02 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383405#comment-16383405
 ] 

Cao Manh Dat commented on SOLR-12011:
-

Thanks a lot for your hard works, [~shalinmangar]

The latest patch for this ticket will commit soon.

> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-02 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383394#comment-16383394
 ] 

Shalin Shekhar Mangar commented on SOLR-12011:
--

Thanks for explaining. +1 to commit.

> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-01 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383261#comment-16383261
 ] 

Cao Manh Dat commented on SOLR-12011:
-

{quote}So I traced the code and a live fetch on the leader works fine today but 
as a side-effect. We set the term equal to max for a recoverying replica (using 
ZkShardTerms.startRecovering() method) in ZkController.publish *before* we 
publish the replica state to the overseer queue. So if the leader (during prep 
recovery) sees replica state as recoverying then Zookeeper also guarantees that 
it will see the max term published before the recoverying state was published. 
I think we should make this behavior clear via a code comment.
{quote}
Yeah, I will add a comment for clarification 
{quote}bq. The changes in SolrCmdDistributor fix a different bug, no? Describe 
the problem here in this issue and how it is solved. Otherwise extract it to 
its own ticket.
{quote}
Hmm, you're right, maybe the changes in SolrCmdDistributor should go into 
SOLR-11702
{quote}Latest patch added changes for RestoreCoreOp and SplitOp where an empty 
core is added new data
{quote}
The destination for {{RestoreCoreOp}} and {{SplitOp}} should be for slice with 
no more than 1 replicas, and that how some collections API use these admins 
API. If not, how can other replicas in the same shard acknowledge the changes 
and put themselves to recovery? 

> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-01 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383232#comment-16383232
 ] 

Shalin Shekhar Mangar commented on SOLR-12011:
--

bq. Should we? Replica only sends PrepRecoveryOp to the leader after success 
updates its term. So I think a live-fetch on the leader's side will be enough.  
And I'm afraid that looping at that call can cause an endless loop. ( I'm not 
sure about this point )

So I traced the code and a live fetch on the leader works fine today but as a 
side-effect. We set the term equal to max for a recoverying replica (using 
ZkShardTerms.startRecovering() method) in ZkController.publish *before* we 
publish the replica state to the overseer queue. So if the leader (during prep 
recovery) sees replica state as recoverying then Zookeeper also guarantees that 
it will see the max term published before the recoverying state was published. 
I think we should make this behavior clear via a code comment.

The changes in SolrCmdDistributor fix a different bug, no? Describe the problem 
here in this issue and how it is solved. Otherwise extract it to its own ticket.

bq. Latest patch added changes for  RestoreCoreOp and SplitOp where an empty 
core is added new data

I don't understand this change. It seems to disallow restore or split if the 
slice has more than 1 replicas?

> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-03-01 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381688#comment-16381688
 ] 

Cao Manh Dat commented on SOLR-12011:
-

Latest patch added changes for  {{RestoreCoreOp}} and {{SplitOp}} where an 
empty core is added new data. We should change term of that replica from 0 -> 1

> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-02-28 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381510#comment-16381510
 ] 

Cao Manh Dat commented on SOLR-12011:
-

Found a bug in the previous patch. DUP should also skip replica with state == 
DOWN as well 

> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-02-28 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380116#comment-16380116
 ] 

Cao Manh Dat commented on SOLR-12011:
-

New patch based on [~shalinmangar]'s review. I will commit soon.

> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch, SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-02-28 Thread Cao Manh Dat (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379977#comment-16379977
 ] 

Cao Manh Dat commented on SOLR-12011:
-

Thank [~shalinmangar]

1, 3, 6 are correct

2, yeah it seems that \{{ if(isClosed) }} we can simply return

4, Can't be, in {{ZkController.register()}} we register term of replica first, 
then do join election after that. Besides that last published state is 
initialized to ACTIVE so when a core is first loaded on startup of a node, the 
flag is useless

5. Should we? Replica only sends PrepRecoveryOp to the leader after success 
updates its term. So I think a live-fetch on the leader's side will be enough.  
And I'm afraid that looping at that call can cause race-condition. ( I'm not 
sure about this point )

 

 

> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-02-28 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379928#comment-16379928
 ] 

Shalin Shekhar Mangar commented on SOLR-12011:
--

Also, it'd be nice to describe in this issue the solution you have implemented 
for the sake of completeness.

> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
> 1. A collection with 1 shard with 1 leader and 2 replicas
> 2. Nodes contain 2 replicas go down
> 3. The leader receives an update A, success
> 4. The node contains the leader goes down
> 5. 2 replicas come back
> 6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

2018-02-28 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379927#comment-16379927
 ] 

Shalin Shekhar Mangar commented on SOLR-12011:
--

Thanks Dat. A few comments:
# The line log.info("skip url:{} cause its term is less than leader", 
replica.getCoreUrl()); will be logged on each update request during the time 
the other replicas don't have the same term as leader. Perhaps this should be 
debug level.
# ElectionContext has {{if (weAreReplacement && isClosed)}}. Did you mean 
{{!isClosed}}?
# ElectionContext has {{getReplicas(EnumSet.of(Replica.Type.TLOG, 
Replica.Type.TLOG)}}. Perhaps you meant TLOG and NRT?
# ElectionContext has replaced shouldIBeLeader() which has a check for last 
published state being active. I'm curious if there can be a condition where the 
term is not registered and last published state is not active and therefore the 
replica becomes a leader.
# PrepRecoveryOp has refreshes terms if 
{{shardTerms.skipSendingUpdatesTo(coreNodeName)}} return true. But should it 
not wait for skip status to go away in a loop. The reason behind PrepRecovery 
is that we ensure that when the call to prep recovery returns, the leader has 
already seen the {{waitForState}} state and therefore is already forwarding the 
updates to the recoverying replica. Now that the behavior is changed to forward 
updates only after the term is equal and not depend on seeing 'recoverying' 
state, we should change PrepRecovery as well.
# Add a comment before calling {{getShardTerms(collection, 
shardId).startRecovering(coreNodeName);}} and {{getShardTerms(collection, 
shardId).doneRecovering(coreNodeName);}} in ZkController.publish() describing 
why it is necessary and why only PULL replicas are excluded. I understand the 
reason but it can be confusing to others reading this code

> Consistence problem when in-sync replicas are DOWN
> --
>
> Key: SOLR-12011
> URL: https://issues.apache.org/jira/browse/SOLR-12011
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
> 1. A collection with 1 shard with 1 leader and 2 replicas
> 2. Nodes contain 2 replicas go down
> 3. The leader receives an update A, success
> 4. The node contains the leader goes down
> 5. 2 replicas come back
> 6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org