Problem: "pull" replica commits to leader?
Hi, everyone, The data is not synchronized to the pull replica from tlog solr node. We are using a replica of tlog and pull. We are trying to change the version of solr we use from 7.5.0 to 8.6.2. Configuration : The pull node is being started after the tlog node is filled with data. Problem. The data is not synchronized to the pull replica. Looking at the log, it looks like the "pull" replica is trying to commit to the leader in the doReplicateOnlyRecovery function. Is this a bug? Or is it set up wrong? The log looks like this 2020-09-14 09:34:45.576 ERROR (recoveryExecutor-11-thread-1-processing-n:172.20.17.40:8983_solr x:mycollection_shard1_replica_p3 c:mycollection s:shard1 r:core_node4) [c:mycollection s:shard1 r:core_node4 x:mycollection_shard1_replica_p3] o.a.s.c.RecoveryStrategy Error while trying to recover:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://172.20.16.100:8983/solr/mycollection_shard1_replica_t1: Thou shall not issue a commit! at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:681) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:214) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:231) at org.apache.solr.cloud.RecoveryStrategy.commitOnLeader(RecoveryStrategy.java:298) at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:230) at org.apache.solr.cloud.RecoveryStrategy.doReplicateOnlyRecovery(RecoveryStrategy.java:394) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:336) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:317) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:212) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) I need a solution. Thank you.
Weak Leader & Weak Replica VS Strong Leader
Hi all, Maybe a tricky question little bit, but I need to ask. Let's say I have infinite RAM and infinite SSDs, but I have deficiency of CPU (Lets's say 4 CPU for each shard). So, my question is which is more preferable: 1. One leader with 4 CPU OR 2. One leader with 2 CPU and one replica with 2 CPU OR 3. One leader with 1 CPU and 3 replicas with 1 CPU each? I understand that the options with replicas are more preferable due to fault tolerance, BUT what about PERFORMANCE theoretically? -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: replica never takes leader role
Yes, after 45 seconds a replica should take over as leader. It should likely explain in the logs of the replica that should be taking over why this is not happening. - Mar On Wed Jan 28 2015 at 2:52:32 PM Joshi, Shital shital.jo...@gs.com wrote: When leader reaches 99% physical memory on the box and starts swapping (stops replicating), we forcefully bring down leader (first kill -15 and then kill -9 if kill -15 doesn't work). This is when we are looking up to replica to assume leader's role and it never happens. Zookeeper timeout is 45 seconds. We can increase it up to 2 minutes and test. cores adminPath=/admin/cores defaultCoreName=collection1 host=${host:} hostPort=${jetty.port:8983} hostContext=${hostContext:solr} zkClientTimeout=${zkClientTimeout:45000} As per definition of zkClientTimeout, After the leader is brought down and it doesn't talk to zookeeper for 45 seconds, shouldn't ZK promote replica to leader? I am not sure how increasing zk timeout will help. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, January 28, 2015 11:42 AM To: solr-user@lucene.apache.org Subject: Re: replica never takes leader role This is not the desired behavior at all. I know there have been improvements in this area since 4.8, but can't seem to locate the JIRAs. I'm curious _why_ the nodes are going down though, is it happening at random or are you taking it down? One problem has been that the Zookeeper timeout used to default to 15 seconds, and occasionally a node would be unresponsive (sometimes due to GC pauses) and exceed the timeout. So upping the ZK timeout has helped some people avoid this... FWIW, Erick On Wed, Jan 28, 2015 at 7:11 AM, Joshi, Shital shital.jo...@gs.com wrote: We're using Solr 4.8.0 -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, January 27, 2015 7:47 PM To: solr-user@lucene.apache.org Subject: Re: replica never takes leader role What version of Solr? This is an ongoing area of improvements and several are very recent. Try searching the JIRA for Solr for details. Best, Erick On Tue, Jan 27, 2015 at 1:51 PM, Joshi, Shital shital.jo...@gs.com wrote: Hello, We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three zookeeper instances. We have noticed that when a leader node goes down the replica never takes over as a leader, cloud becomes unusable and we have to bounce entire cloud for replica to assume leader role. Is this default behavior? How can we change this? Thanks.
RE: replica never takes leader role
We're using Solr 4.8.0 -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, January 27, 2015 7:47 PM To: solr-user@lucene.apache.org Subject: Re: replica never takes leader role What version of Solr? This is an ongoing area of improvements and several are very recent. Try searching the JIRA for Solr for details. Best, Erick On Tue, Jan 27, 2015 at 1:51 PM, Joshi, Shital shital.jo...@gs.com wrote: Hello, We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three zookeeper instances. We have noticed that when a leader node goes down the replica never takes over as a leader, cloud becomes unusable and we have to bounce entire cloud for replica to assume leader role. Is this default behavior? How can we change this? Thanks.
Re: replica never takes leader role
This is not the desired behavior at all. I know there have been improvements in this area since 4.8, but can't seem to locate the JIRAs. I'm curious _why_ the nodes are going down though, is it happening at random or are you taking it down? One problem has been that the Zookeeper timeout used to default to 15 seconds, and occasionally a node would be unresponsive (sometimes due to GC pauses) and exceed the timeout. So upping the ZK timeout has helped some people avoid this... FWIW, Erick On Wed, Jan 28, 2015 at 7:11 AM, Joshi, Shital shital.jo...@gs.com wrote: We're using Solr 4.8.0 -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, January 27, 2015 7:47 PM To: solr-user@lucene.apache.org Subject: Re: replica never takes leader role What version of Solr? This is an ongoing area of improvements and several are very recent. Try searching the JIRA for Solr for details. Best, Erick On Tue, Jan 27, 2015 at 1:51 PM, Joshi, Shital shital.jo...@gs.com wrote: Hello, We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three zookeeper instances. We have noticed that when a leader node goes down the replica never takes over as a leader, cloud becomes unusable and we have to bounce entire cloud for replica to assume leader role. Is this default behavior? How can we change this? Thanks.
RE: replica never takes leader role
When leader reaches 99% physical memory on the box and starts swapping (stops replicating), we forcefully bring down leader (first kill -15 and then kill -9 if kill -15 doesn't work). This is when we are looking up to replica to assume leader's role and it never happens. Zookeeper timeout is 45 seconds. We can increase it up to 2 minutes and test. cores adminPath=/admin/cores defaultCoreName=collection1 host=${host:} hostPort=${jetty.port:8983} hostContext=${hostContext:solr} zkClientTimeout=${zkClientTimeout:45000} As per definition of zkClientTimeout, After the leader is brought down and it doesn't talk to zookeeper for 45 seconds, shouldn't ZK promote replica to leader? I am not sure how increasing zk timeout will help. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, January 28, 2015 11:42 AM To: solr-user@lucene.apache.org Subject: Re: replica never takes leader role This is not the desired behavior at all. I know there have been improvements in this area since 4.8, but can't seem to locate the JIRAs. I'm curious _why_ the nodes are going down though, is it happening at random or are you taking it down? One problem has been that the Zookeeper timeout used to default to 15 seconds, and occasionally a node would be unresponsive (sometimes due to GC pauses) and exceed the timeout. So upping the ZK timeout has helped some people avoid this... FWIW, Erick On Wed, Jan 28, 2015 at 7:11 AM, Joshi, Shital shital.jo...@gs.com wrote: We're using Solr 4.8.0 -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, January 27, 2015 7:47 PM To: solr-user@lucene.apache.org Subject: Re: replica never takes leader role What version of Solr? This is an ongoing area of improvements and several are very recent. Try searching the JIRA for Solr for details. Best, Erick On Tue, Jan 27, 2015 at 1:51 PM, Joshi, Shital shital.jo...@gs.com wrote: Hello, We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three zookeeper instances. We have noticed that when a leader node goes down the replica never takes over as a leader, cloud becomes unusable and we have to bounce entire cloud for replica to assume leader role. Is this default behavior? How can we change this? Thanks.
replica never takes leader role
Hello, We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three zookeeper instances. We have noticed that when a leader node goes down the replica never takes over as a leader, cloud becomes unusable and we have to bounce entire cloud for replica to assume leader role. Is this default behavior? How can we change this? Thanks.
Re: replica never takes leader role
What version of Solr? This is an ongoing area of improvements and several are very recent. Try searching the JIRA for Solr for details. Best, Erick On Tue, Jan 27, 2015 at 1:51 PM, Joshi, Shital shital.jo...@gs.com wrote: Hello, We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes and three zookeeper instances. We have noticed that when a leader node goes down the replica never takes over as a leader, cloud becomes unusable and we have to bounce entire cloud for replica to assume leader role. Is this default behavior? How can we change this? Thanks.
Re: Replica as a leader
bq: Is there a way that solr can recover without losing docs in this scenario? Not that I know of currently. SolrCloud is designed to _not_ lose documents as long as all leaders are present. And when a leader goes down, assuming there's a replica handy docs shouldn't be lost either. But taking down the leader then starting an out-of-date replica up and hoping that Solr has somehow magically cached all the intervening updates is not a supported scenario. Perhaps SOLR-5468 will help here, I'm not entirely sure. This scenario seems out-of-band though. Best, Erick On Sun, May 18, 2014 at 3:12 AM, Anshum Gupta ans...@anshumgupta.net wrote: SOLR-5468 https://issues.apache.org/jira/browse/SOLR-5468 might be useful for you. On Sun, May 18, 2014 at 1:54 AM, adfel70 adfe...@gmail.com wrote: *one of the most impotent requirements in my system is not to lose docs and not to retrieve part of the data at query time.* I expect the replica to wait until the real leader will start or at least to sync the real leader with the docs indexed in the replica after starting and syncing the replica with the docs that were indexed to the leader. Is there a way that solr can recover without losing docs in this scenario? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Replica-as-a-leader-tp4135614p4136729.html Sent from the Solr - User mailing list archive at Nabble.com. -- Anshum Gupta http://www.anshumgupta.net
Re: Replica as a leader
*one of the most impotent requirements in my system is not to lose docs and not to retrieve part of the data at query time.* I expect the replica to wait until the real leader will start or at least to sync the real leader with the docs indexed in the replica after starting and syncing the replica with the docs that were indexed to the leader. Is there a way that solr can recover without losing docs in this scenario? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Replica-as-a-leader-tp4135614p4136729.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Replica as a leader
SOLR-5468 https://issues.apache.org/jira/browse/SOLR-5468 might be useful for you. On Sun, May 18, 2014 at 1:54 AM, adfel70 adfe...@gmail.com wrote: *one of the most impotent requirements in my system is not to lose docs and not to retrieve part of the data at query time.* I expect the replica to wait until the real leader will start or at least to sync the real leader with the docs indexed in the replica after starting and syncing the replica with the docs that were indexed to the leader. Is there a way that solr can recover without losing docs in this scenario? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Replica-as-a-leader-tp4135614p4136729.html Sent from the Solr - User mailing list archive at Nabble.com. -- Anshum Gupta http://www.anshumgupta.net
Re: Replica as a leader
1. Indexing 100-200 docs per second. 2. Doing Pkill -9 java to 2 replicas (not the leader) in shard 3 (while indexing). 3. Indexing for 10-20 minutes and doing hard commit. 4. Doing Pkill -9 java to the leader and then starting one replica in shard 3 (while indexing). I think you're in uncharted territory. By only having the leader running, indexing docs to it, then killing it, there's no way for one of the restarted followers to know what docs were indexed. Eventually the follower will become the leader and the docs are just lost. Updates are NOT stored on ZK for instance. Why do you expect the machines to stay in down status? SolrCloud is doing the best it can. How do you expect this scenario to recover? FWIW, Erick On Thu, May 8, 2014 at 8:00 AM, adfel70 adfe...@gmail.com wrote: Solr Collection Info: solr 4.8 , 4 shards, 3 replicas per shard, 30-40 milion docs per shard. Process: 1. Indexing 100-200 docs per second. 2. Doing Pkill -9 java to 2 replicas (not the leader) in shard 3 (while indexing). 3. Indexing for 10-20 minutes and doing hard commit. 4. Doing Pkill -9 java to the leader and then starting one replica in shard 3 (while indexing). 5. After 20 minutes starting another replica in shard 3 ,while indexing (not the leader in step 1). Results: 2. Only the leader is active in shard 3. 3. Thousands of docs were added to the leader in shard 3. 4. After staring the replica, it's state was down and after 10 minutes it became the leader in cluster state (and still down). no servers hosting shards for index and search requests. 5. After starting another replica, it's state was recovering for 2-3 minutes and then it became active (not leader in cluster state). 6. Index, commit and search requests are handeled in the other replicae (*active status, not leader!!!*). Expected: 5. To stay in down status. *6. Not to handel index, commit and search requests - no servers hosting shards!* Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Replica-as-a-leader-tp4135077.html Sent from the Solr - User mailing list archive at Nabble.com.
Replica as a leader
/Solr Collection Info:/ Solr 4.8 , 4 shards, 3 replicas per shard, 30-40 million docs per shard. /Process:/ 1. Indexing 100-200 docs per second. 2. Doing Pkill -9 java to 2 replicas (not the leader) in shard 3 (while indexing). 3. Indexing for 10-20 minutes and doing hard commit. 4. Doing Pkill -9 java to the leader and then starting one replica in shard 3 (while indexing). 5. After 20 minutes starting another replica in shard 3 ,while indexing (not the leader in step 1). 6. After 10 minutes starting the rep that was the leader in step 1. /Results:/ 2. Only the leader is active in shard 3. 3. Thousands of docs were added to the leader in shard 3. 4. After staring the replica, it's state was down and after 10 minutes it became the leader in cluster state (and still down). no servers hosting shards for index and search requests. *5. After starting another replica, it's state was recovering for 2-3 minutes and then it became active (not leader in cluster state). Index, commit and search requests are handled in the other replica (active status, not leader!!!). The search Results not includes docs that have been indexed to the leader in step 3. * 6. syncing with the active rep. /Expected:/ *5. To stay in down status. Not to handle index, commit and search requests - no servers hosting shards!* 6. Become the leader. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Replica-as-a-leader-tp4135078.html Sent from the Solr - User mailing list archive at Nabble.com.
Replica as a leader
Solr Collection Info: solr 4.8 , 4 shards, 3 replicas per shard, 30-40 milion docs per shard. Process: 1. Indexing 100-200 docs per second. 2. Doing Pkill -9 java to 2 replicas (not the leader) in shard 3 (while indexing). 3. Indexing for 10-20 minutes and doing hard commit. 4. Doing Pkill -9 java to the leader and then starting one replica in shard 3 (while indexing). 5. After 20 minutes starting another replica in shard 3 ,while indexing (not the leader in step 1). Results: 2. Only the leader is active in shard 3. 3. Thousands of docs were added to the leader in shard 3. 4. After staring the replica, it's state was down and after 10 minutes it became the leader in cluster state (and still down). no servers hosting shards for index and search requests. 5. After starting another replica, it's state was recovering for 2-3 minutes and then it became active (not leader in cluster state). 6. Index, commit and search requests are handeled in the other replicae (*active status, not leader!!!*). Expected: 5. To stay in down status. *6. Not to handel index, commit and search requests - no servers hosting shards!* Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Replica-as-a-leader-tp4135077.html Sent from the Solr - User mailing list archive at Nabble.com.