[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-11-09 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569-testfix.patch

There was a failure on Windows in one of the tests,
http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/5385/testReport/org.apache.solr.cloud/ForceLeaderTest/testReplicasInLIRNoLeader/

I'd like to get this change in, please  (SOLR-7569-testfix.patch). Sorry for 
the trouble.

> Create an API to force a leader election between nodes
> --
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Noble Paul
>  Labels: difficulty-medium, impact-high
> Fix For: 5.4, Trunk
>
> Attachments: SOLR-7569-testfix.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-11-05 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

bq. This particular collection admin operation does not really have to go to 
overseer, it can be performed by the receiving node itself because the clearing 
of LIR node does not have to be done at overseer anyway

Here is a patch that adds the API command (FORCELEADER) to the 
CollectionsHandler instead of the OCMH. I couldn't find a way to do this ASYNC, 
which I could do it at OCMH, did I miss something? Does this look fine? 
([~noble.paul] ?) 

I somehow feel doing it in CollectionsHandler is a bit misplaced, and would 
rather do it at OCMH. But I am fine either ways so long as we do it; both 
patches are there.

Note: As with the previous patch (that puts the meat into the OCMH), this patch 
depends on prior application of patches in SOLR-8233 and SOLR-7989.

> Create an API to force a leader election between nodes
> --
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-high
> Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-11-04 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

bq. Let's not keep the core admin command as OVERRIDELASTPUBLISHED. This means 
it can be a generic enough API which may be abused by others for other things. 
Let's not tell others what we are doing internally and keep the command name 
opaque

This patch uses FORCEPREPAREFORLEADERSHIP from SOLR-8233. Does this sound fine?

bq.  This particular collection admin operation does not really have to  go to 
overseer, it can be performed by the receiving node itself because the clearing 
of LIR node does not have to be done at overseer anyway

The reason why I wanted to keep it at Overseer was that most cluster management 
code is there. I can move this to CollectionsHandler instead of OCMH.

> Create an API to force a leader election between nodes
> --
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-high
> Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-11-04 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

Updated patch. It now depends on SOLR-8233 and SOLR-7989.

After removing the part that was doing the force marking of down live replicas 
as active, the recovery from a situation where all replicas were down (due to 
LIR) was not working. Reason was that a down replica, when elected as leader, 
never gets marked as active. Fixed that in SOLR-7989.

[~markrmil...@gmail.com], [~shalinmangar] Please review this (as well as 
SOLR-8233 and SOLR-7989). Also, can you please edit this issue to depend on 
those two issues (I can't edit, since Shalin created the issue).

> Create an API to force a leader election between nodes
> --
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-high
> Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-10-23 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

Based on an offline conversation with Shalin (and the discussion above), I've 
removed that extra handling of the situation where:
# there is no LIR involved 
# all replicas are down
# there is no leader. 

This involved force marking the replica at the election queue head as a leader, 
which might have other unintended consequences. Hopefully, this situation never 
occurs in the real world. If it does, then we can tackle this in a separate 
issue.

The following situation is still taken care of:
# there is no LIR involved 
# all replicas are down

[~shalinmangar] please review the changes. Thanks.

> Create an API to force a leader election between nodes
> --
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-high
> Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-10-23 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

Ah, missed out the test in my last patch. Here it is.

> Create an API to force a leader election between nodes
> --
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-high
> Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-10-22 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

Thanks Shalin for looking into the patch and your review.

bq. ForceLeaderTest.testReplicasInLIRNoLeader has a 5 second sleep, why? Isn't 
waitForRecoveriesToFinish() enough?
Fixed. This was a left over from some previous patch. I think I wanted to put 
the waitForRecoveriesToFinish(), but forgot to remove the 5 second sleep.

bq. Similarly, ForceLeaderTest.testLeaderDown has a 15 second sleep for steady 
state to be reached? What is this steady state, is there a better way than 
waiting for an arbitrary amount of time? In general, Thread.sleep should be 
avoided as much as possible as a way to reach steady state.
In this case, waiting those 15 seconds results in one of the down replicas to 
become a leader (but stay down). This is the situation I'm using FORCELEADER to 
recover from. Instead of waiting 15 seconds, I've added some polling with wait 
to wake up earlier if needed, while increasing the timeout from 15s to 25s.


bq. Can you please add some javadocs on the various test methods describing the 
scenario that they are test?
Sure, added.

bq. minor nit - can you use assertEquals when testing equality of state etc 
instead of assertTrue. The advantage with assertEquals is that it logs the 
mismatched values in the exception messages.
Used assertEquals() now.

bq. In OverseerCollectionMessageHandler, lirPath can never be null. The lir 
path should probably be logged in debug rather than INFO.
Thanks for the pointer, I've removed the null check. I feel this should be INFO 
instead of DEBUG, so that if a user says I issued FORCELEADER but still nothing 
worked for him, his logs would help us understand if we ever had any LIR state 
which was cleared out. But, please feel free to remove it if this doesn't make 
sense.

bq. minor nit - you can compare enums directly using == instead of .equals
Fixed.

bq. Referring to the following, what is the thinking behind it? when can this 
happen? is there a test which specifically exercises this scenario? seems like 
this can interfere with the leader election if the leader election was taking 
some time? 

I modified the comment text to make it more clear. This is for the situation 
when all replicas are (somehow, due to bug maybe?) down/recovering (but not in 
LIR), and there is no leader, even though many replicas are on live; I don't 
know if this ever happens (the LIR case happens, I know). The 
testAllReplicasDownNoLeader test exercises this scenario. This is more or less 
the scenario that you described (with one difference that there is no leader as 
well): {{Leader is not live: Replicas are live but 'down' or 'recovering' -> 
mark them 'active'}}.

As you point out, I think it can indeed interfere with any on-going leader 
election; my thought was that this FORCELEADER call is issued only because the 
leader election isn't achieving a stable leader, so force marking the queue 
head replica as leader is okay. But I defer to your judgement if this is fine 
or not, and I can remove (or you feel free to remove) that code path from the 
patch if you feel it is not right.

> Create an API to force a leader election between nodes
> --
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-high
> Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-09-21 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

bq. When you use this, it will be because the system is blocking a leader from 
taking over. By running this API command, you remove the blocks, thus 'forcing' 
a leader the system would not normally pick - or at least attempting to force a 
leader the system would not really pick.

That makes sense! :-) I've renamed this to FORCELEADER now. 
Also managed to reincarnate the killed off leaders into the tests.

> Create an API to force a leader election between nodes
> --
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-high
> Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-09-19 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

[~thelabdude] Thanks for your review. I've added an extra wait before killing 
the leader. I tried to bring back a killed leader, but seems not to be working. 
Tried many things, but was stuck with "address already in use" exception, 
perhaps due to the socket proxy holding on the port.

> Create an API to force a leader election between nodes
> --
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-high
> Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-09-08 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

* Passing the async parameter through,
* Tests now randomly make async requests for the recover shard API call.

> Create an API to force a leader election between nodes
> --
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-high
> Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-09-02 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

> Create an API to force a leader election between nodes
> --
>
> Key: SOLR-7569
> URL: https://issues.apache.org/jira/browse/SOLR-7569
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Shalin Shekhar Mangar
>  Labels: difficulty-medium, impact-high
> Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-08-27 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

 Create an API to force a leader election between nodes
 --

 Key: SOLR-7569
 URL: https://issues.apache.org/jira/browse/SOLR-7569
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
  Labels: difficulty-medium, impact-high
 Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
 SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
 SOLR-7569_lir_down_state_test.patch


 There are many reasons why Solr will not elect a leader for a shard e.g. all 
 replicas' last published state was recovery or due to bugs which cause a 
 leader to be marked as 'down'. While the best solution is that they never get 
 into this state, we need a manual way to fix this when it does get into this  
 state. Right now we can do a series of dance involving bouncing the node 
 (since recovery paths between bouncing and REQUESTRECOVERY are different), 
 but that is difficult when running a large cluster. Although it is possible 
 that such a manual API may lead to some data loss but in some cases, it is 
 the only possible option to restore availability.
 This issue proposes to build a new collection API which can be used to force 
 replicas into recovering a leader while avoiding data loss on a best effort 
 basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-08-27 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

Adding more logging for tests, tests for asserting indexing fails during down 
state and works again after the recover operation.

 Create an API to force a leader election between nodes
 --

 Key: SOLR-7569
 URL: https://issues.apache.org/jira/browse/SOLR-7569
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
  Labels: difficulty-medium, impact-high
 Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
 SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch


 There are many reasons why Solr will not elect a leader for a shard e.g. all 
 replicas' last published state was recovery or due to bugs which cause a 
 leader to be marked as 'down'. While the best solution is that they never get 
 into this state, we need a manual way to fix this when it does get into this  
 state. Right now we can do a series of dance involving bouncing the node 
 (since recovery paths between bouncing and REQUESTRECOVERY are different), 
 but that is difficult when running a large cluster. Although it is possible 
 that such a manual API may lead to some data loss but in some cases, it is 
 the only possible option to restore availability.
 This issue proposes to build a new collection API which can be used to force 
 replicas into recovering a leader while avoiding data loss on a best effort 
 basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-08-26 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

Adding a wait for recoveries to finish after the recovery operation in the test.

 Create an API to force a leader election between nodes
 --

 Key: SOLR-7569
 URL: https://issues.apache.org/jira/browse/SOLR-7569
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
  Labels: difficulty-medium, impact-high
 Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
 SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch


 There are many reasons why Solr will not elect a leader for a shard e.g. all 
 replicas' last published state was recovery or due to bugs which cause a 
 leader to be marked as 'down'. While the best solution is that they never get 
 into this state, we need a manual way to fix this when it does get into this  
 state. Right now we can do a series of dance involving bouncing the node 
 (since recovery paths between bouncing and REQUESTRECOVERY are different), 
 but that is difficult when running a large cluster. Although it is possible 
 that such a manual API may lead to some data loss but in some cases, it is 
 the only possible option to restore availability.
 This issue proposes to build a new collection API which can be used to force 
 replicas into recovering a leader while avoiding data loss on a best effort 
 basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-08-25 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

bq. This should just get you out of the state that is preventing a leader from 
being elected.
Updating the patch that attempts to do just that. :-)
* Clear LIR znodes
* Mark all nodes as active
* Wait for leader election so that normal state is restored 

(I also tried to mark just one of the replicas as active instead of all the 
replicas, hoping it will become leader and others would recover from it. 
However, this resulted in one of the other down replicas becoming leader but 
still staying down. Looking into why that could be happening; bug?)

 Create an API to force a leader election between nodes
 --

 Key: SOLR-7569
 URL: https://issues.apache.org/jira/browse/SOLR-7569
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
  Labels: difficulty-medium, impact-high
 Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
 SOLR-7569_lir_down_state_test.patch


 There are many reasons why Solr will not elect a leader for a shard e.g. all 
 replicas' last published state was recovery or due to bugs which cause a 
 leader to be marked as 'down'. While the best solution is that they never get 
 into this state, we need a manual way to fix this when it does get into this  
 state. Right now we can do a series of dance involving bouncing the node 
 (since recovery paths between bouncing and REQUESTRECOVERY are different), 
 but that is difficult when running a large cluster. Although it is possible 
 that such a manual API may lead to some data loss but in some cases, it is 
 the only possible option to restore availability.
 This issue proposes to build a new collection API which can be used to force 
 replicas into recovering a leader while avoiding data loss on a best effort 
 basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-08-24 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569_lir_down_state_test.patch

A patch containing the test that simulates the state described above.

 Create an API to force a leader election between nodes
 --

 Key: SOLR-7569
 URL: https://issues.apache.org/jira/browse/SOLR-7569
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Shalin Shekhar Mangar
  Labels: difficulty-medium, impact-high
 Attachments: SOLR-7569.patch, SOLR-7569.patch, 
 SOLR-7569_lir_down_state_test.patch


 There are many reasons why Solr will not elect a leader for a shard e.g. all 
 replicas' last published state was recovery or due to bugs which cause a 
 leader to be marked as 'down'. While the best solution is that they never get 
 into this state, we need a manual way to fix this when it does get into this  
 state. Right now we can do a series of dance involving bouncing the node 
 (since recovery paths between bouncing and REQUESTRECOVERY are different), 
 but that is difficult when running a large cluster. Although it is possible 
 that such a manual API may lead to some data loss but in some cases, it is 
 the only possible option to restore availability.
 This issue proposes to build a new collection API which can be used to force 
 replicas into recovering a leader while avoiding data loss on a best effort 
 basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-08-20 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

Updated patch with some cleanup. Still some TODOs, nocommits to go.

 Create an API to force a leader election between nodes
 --

 Key: SOLR-7569
 URL: https://issues.apache.org/jira/browse/SOLR-7569
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Shalin Shekhar Mangar
  Labels: difficulty-medium, impact-high
 Attachments: SOLR-7569.patch, SOLR-7569.patch


 There are many reasons why Solr will not elect a leader for a shard e.g. all 
 replicas' last published state was recovery or due to bugs which cause a 
 leader to be marked as 'down'. While the best solution is that they never get 
 into this state, we need a manual way to fix this when it does get into this  
 state. Right now we can do a series of dance involving bouncing the node 
 (since recovery paths between bouncing and REQUESTRECOVERY are different), 
 but that is difficult when running a large cluster. Although it is possible 
 that such a manual API may lead to some data loss but in some cases, it is 
 the only possible option to restore availability.
 This issue proposes to build a new collection API which can be used to force 
 replicas into recovering a leader while avoiding data loss on a best effort 
 basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

2015-08-18 Thread Ishan Chattopadhyaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya updated SOLR-7569:
---
Attachment: SOLR-7569.patch

Trying to tackle this situation where all replicas (including the leader) are 
somehow marked down (maybe due to bugs?) and there is no leader in the shard, 
and hence the entire shard is down. Adding a new collection API RECOVERSHARD.

In this patch, I am evaluating the following approach:

* Remove all leader initiated recovery flags for this shard.
* Pick the next leader: If the leader election queue is not empty and the first 
replica in the queue is on a live node, choose the replica as the next leader. 
Otherwise, pick a random replica to become the next leader (TODO: we can have 
the user specify which replica he/she wants as the next leader).
* If the chosen leader is not the at the head of the leader election queue, 
have it join the election at the head (similar to what REBALANCELEADERS tries 
to do). [TODO]
* Mark the next leader as active. Mark rest of the replicas (which are on 
live nodes) as recovering.
* Issue core admin REQUESTRECOVERY command to all the replicas except the next 
leader.
* Wait till recovery completes. [TODO]

Does the above approach sound reasonable? Does the patch seem reasonable?

 Create an API to force a leader election between nodes
 --

 Key: SOLR-7569
 URL: https://issues.apache.org/jira/browse/SOLR-7569
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Shalin Shekhar Mangar
  Labels: difficulty-medium, impact-high
 Attachments: SOLR-7569.patch


 There are many reasons why Solr will not elect a leader for a shard e.g. all 
 replicas' last published state was recovery or due to bugs which cause a 
 leader to be marked as 'down'. While the best solution is that they never get 
 into this state, we need a manual way to fix this when it does get into this  
 state. Right now we can do a series of dance involving bouncing the node 
 (since recovery paths between bouncing and REQUESTRECOVERY are different), 
 but that is difficult when running a large cluster. Although it is possible 
 that such a manual API may lead to some data loss but in some cases, it is 
 the only possible option to restore availability.
 This issue proposes to build a new collection API which can be used to force 
 replicas into recovering a leader while avoiding data loss on a best effort 
 basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org