Re: Solr 7.2.1 DELETEREPLICA automatically NRT replica appears
I'll check the logs when I'm back at my computer. Mostly errors about failing to find the core spamming the logs if I recall correctly. Node never becomes active. Just spams the logs. Only way to remove it is to stop solr in the node and delete the replica via API on another node. On Thu, 8 Mar 2018 at 15:49, Tomas Fernandez Lobbe <tflo...@apple.com> wrote: > This shouldn’t be happening. Did you see anything related in the logs? > Does the new NRT replica ever becomes active? Is there a new core created > or do you just see the replica in the clusterstate? > > Tomas > > Sent from my iPhone > > > On Mar 7, 2018, at 8:18 PM, Greg Roodt <gro...@gmail.com> wrote: > > > > Hi > > > > I am running a cluster of TLOG and PULL replicas. When I call the > > DELETEREPLICA api to remove a replica, the replica is removed, however, a > > new NRT replica pops up in a down state in the cluster. > > > > Any ideas why? > > > > Greg >
Re: Solr 7.2.1 DELETEREPLICA automatically NRT replica appears
This shouldn’t be happening. Did you see anything related in the logs? Does the new NRT replica ever becomes active? Is there a new core created or do you just see the replica in the clusterstate? Tomas Sent from my iPhone > On Mar 7, 2018, at 8:18 PM, Greg Roodt <gro...@gmail.com> wrote: > > Hi > > I am running a cluster of TLOG and PULL replicas. When I call the > DELETEREPLICA api to remove a replica, the replica is removed, however, a > new NRT replica pops up in a down state in the cluster. > > Any ideas why? > > Greg
Solr 7.2.1 DELETEREPLICA automatically NRT replica appears
Hi I am running a cluster of TLOG and PULL replicas. When I call the DELETEREPLICA api to remove a replica, the replica is removed, however, a new NRT replica pops up in a down state in the cluster. Any ideas why? Greg
Re: 700k entries in overseer q cannot addreplica or deletereplica
>>>>>> >>>>>> Im curious in the env this is happening to you are the zookeeper >>>>>> servers residing on solr nodes? Are the solr nodes underpowered ram >>>>>> and >>>>>> or >>>>>> cpu? >>>>>> >>>>>> Jeff Courtade >>>>>> M: 240.507.6116 >>>>>> >>>>>> On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> >>>>>> wrote: >>>>>> >>>>>> I'm always using a small Java program to delete the nodes directly. I >>>>>> >>>>>>> assume you can also delete the whole node but that is nothing I have >>>>>>> tried >>>>>>> myself. >>>>>>> >>>>>>> On 22.08.2017 14:27, Jeff Courtade wrote: >>>>>>> >>>>>>> So ... >>>>>>> >>>>>>>> Using the zkCli.sh i have the jute.maxbuffer setup so I can list it >>>>>>>> now. >>>>>>>> >>>>>>>> Can I >>>>>>>> >>>>>>>> rmr /overseer/queue >>>>>>>> >>>>>>>> Or do i need to delete individual entries? >>>>>>>> >>>>>>>> Will >>>>>>>> >>>>>>>> rmr /overseer/queue/* >>>>>>>> >>>>>>>> work? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Jeff Courtade >>>>>>>> M: 240.507.6116 >>>>>>>> >>>>>>>> On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> >>>>>>>> wrote: >>>>>>>> >>>>>>>> When Solr is stopped it did not cause a problem so far. >>>>>>>> >>>>>>>> I cleared the queue also a few times while Solr was still running. >>>>>>>>> >>>>>>>>> That >>>>>>>>> also didn't result in a real problem but some replicas might not >>>>>>>>> come >>>>>>>>> up >>>>>>>>> again. In those case it helps to either restart the node with the >>>>>>>>> replicas >>>>>>>>> that are in state "down" or to remove the failed replica and then >>>>>>>>> recreate >>>>>>>>> it. But as said, clearing it when Solr is stopped worked fine so >>>>>>>>> far. >>>>>>>>> >>>>>>>>> On 22.08.2017 14:03, Jeff Courtade wrote: >>>>>>>>> >>>>>>>>> How does the cluster react to the overseer q entries disapeering? >>>>>>>>> >>>>>>>>> >>>>>>>>>> Jeff Courtade >>>>>>>>>> M: 240.507.6116 >>>>>>>>>> >>>>>>>>>> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" >>>>>>>>>> <hendrik.hadd...@gmx.net >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Jeff, >>>>>>>>>> >>>>>>>>>> we ran into that a few times already. We have lots of collections >>>>>>>>>> and >>>>>>>>>> >>>>>>>>>>> when >>>>>>>>>>> nodes get started too fast the overseer queue grows faster then >>>>>>>>>>> Solr >>>>>>>>>>> can >>>>>>>>>>> process it. At some point Solr tries to redo things like leaders >>>>>>>>>>> votes >>>>>>>>>>> and >>>>>>>>>>> adds new tasks to the list, which then gets longer and longer. >>>>>>>>>>> Once >>>>>>>>>>> it >>>>>>>>>>> is >>>>>>>>>>> too long you can not read out the data anymore but Solr is still >>>>>>>>>>> adding >>>>>>>>>>> tasks. In case you already reached that point you have to start >>>>>>>>>>> ZooKeeper >>>>>>>>>>> and the ZooKeeper client with and increased "jute.maxbuffer" >>>>>>>>>>> value. I >>>>>>>>>>> usually double it until I can read out the queue again. After >>>>>>>>>>> that >>>>>>>>>>> I >>>>>>>>>>> delete >>>>>>>>>>> all entries in the queue and then start the Solr nodes one by >>>>>>>>>>> one, >>>>>>>>>>> like >>>>>>>>>>> every 5 minutes. >>>>>>>>>>> >>>>>>>>>>> regards, >>>>>>>>>>> Hendrik >>>>>>>>>>> >>>>>>>>>>> On 22.08.2017 13:42, Jeff Courtade wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I have an issue with what seems to be a blocked up >>>>>>>>>>> /overseer/queue >>>>>>>>>>> >>>>>>>>>>>> There are 700k + entries. >>>>>>>>>>>> >>>>>>>>>>>> Solr cloud 6.x >>>>>>>>>>>> >>>>>>>>>>>> You cannot addreplica or deletereplica the commands time out. >>>>>>>>>>>> >>>>>>>>>>>> Full stop and start of solr and zookeeper does not clear it. >>>>>>>>>>>> >>>>>>>>>>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr >>>>>>>>>>>> the >>>>>>>>>>>> /overseer/queue ? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Jeff Courtade >>>>>>>>>>>> M: 240.507.6116 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >
Re: 700k entries in overseer q cannot addreplica or deletereplica
until I can read out the queue again. After that I delete all entries in the queue and then start the Solr nodes one by one, like every 5 minutes. regards, Hendrik On 22.08.2017 13:42, Jeff Courtade wrote: Hi, I have an issue with what seems to be a blocked up /overseer/queue There are 700k + entries. Solr cloud 6.x You cannot addreplica or deletereplica the commands time out. Full stop and start of solr and zookeeper does not clear it. Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the /overseer/queue ? Jeff Courtade M: 240.507.6116
Re: 700k entries in overseer q cannot addreplica or deletereplica
>>>> again. In those case it helps to either restart the node with the >>>>>>> replicas >>>>>>> that are in state "down" or to remove the failed replica and then >>>>>>> recreate >>>>>>> it. But as said, clearing it when Solr is stopped worked fine so far. >>>>>>> >>>>>>> On 22.08.2017 14:03, Jeff Courtade wrote: >>>>>>> >>>>>>> How does the cluster react to the overseer q entries disapeering? >>>>>>> >>>>>>> >>>>>>>> Jeff Courtade >>>>>>>> M: 240.507.6116 >>>>>>>> >>>>>>>> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net >>>>>>>> > >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Jeff, >>>>>>>> >>>>>>>> we ran into that a few times already. We have lots of collections >>>>>>>> and >>>>>>>> >>>>>>>>> when >>>>>>>>> nodes get started too fast the overseer queue grows faster then >>>>>>>>> Solr >>>>>>>>> can >>>>>>>>> process it. At some point Solr tries to redo things like leaders >>>>>>>>> votes >>>>>>>>> and >>>>>>>>> adds new tasks to the list, which then gets longer and longer. Once >>>>>>>>> it >>>>>>>>> is >>>>>>>>> too long you can not read out the data anymore but Solr is still >>>>>>>>> adding >>>>>>>>> tasks. In case you already reached that point you have to start >>>>>>>>> ZooKeeper >>>>>>>>> and the ZooKeeper client with and increased "jute.maxbuffer" >>>>>>>>> value. I >>>>>>>>> usually double it until I can read out the queue again. After that >>>>>>>>> I >>>>>>>>> delete >>>>>>>>> all entries in the queue and then start the Solr nodes one by one, >>>>>>>>> like >>>>>>>>> every 5 minutes. >>>>>>>>> >>>>>>>>> regards, >>>>>>>>> Hendrik >>>>>>>>> >>>>>>>>> On 22.08.2017 13:42, Jeff Courtade wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I have an issue with what seems to be a blocked up /overseer/queue >>>>>>>>> >>>>>>>>>> There are 700k + entries. >>>>>>>>>> >>>>>>>>>> Solr cloud 6.x >>>>>>>>>> >>>>>>>>>> You cannot addreplica or deletereplica the commands time out. >>>>>>>>>> >>>>>>>>>> Full stop and start of solr and zookeeper does not clear it. >>>>>>>>>> >>>>>>>>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr >>>>>>>>>> the >>>>>>>>>> /overseer/queue ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Jeff Courtade >>>>>>>>>> M: 240.507.6116 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >
Re: 700k entries in overseer q cannot addreplica or deletereplica
- stop all solr nodes - start zk with the new jute.maxbuffer setting - start a zk client, like zkCli, with the changed jute.maxbuffer setting and check that you can read out the overseer queue - clear the queue - restart zk with the normal settings - slowly start solr On 22.08.2017 15:27, Jeff Courtade wrote: I set jute.maxbuffer on the so hosts should this be done to solr as well? Mine is happening in a severely memory constrained end as well. Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:53 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: We have Solr and ZK running in Docker containers. There is no more then one Solr/ZK node per host but Solr and ZK node can run on the same host. So Solr and ZK are spread out separately. I have not seen this problem during normal processing just when we recycle nodes or when we have nodes fail, which is pretty much always caused by being out of memory, which again is unfortunately a bit complex in Docker. When nodes come up they add quite a few tasks to the overseer queue. I assume one task for every core. We have about 2000 cores on each node. If nodes come up too fast the queue might grow to a few thousand entries. At maybe 1 entries it usually reaches the point of no return and Solr is just added more tasks then it is able to process. So it's best to pull the plug at that point as you will not have to play with jute.maxbuffer to get Solr up again. We are using Solr 6.3. There is some improvements in 6.6: https://issues.apache.org/jira/browse/SOLR-10524 https://issues.apache.org/jira/browse/SOLR-10619 On 22.08.2017 14:41, Jeff Courtade wrote: Thanks very much. I will followup when we try this. Im curious in the env this is happening to you are the zookeeper servers residing on solr nodes? Are the solr nodes underpowered ram and or cpu? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: I'm always using a small Java program to delete the nodes directly. I assume you can also delete the whole node but that is nothing I have tried myself. On 22.08.2017 14:27, Jeff Courtade wrote: So ... Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now. Can I rmr /overseer/queue Or do i need to delete individual entries? Will rmr /overseer/queue/* work? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: When Solr is stopped it did not cause a problem so far. I cleared the queue also a few times while Solr was still running. That also didn't result in a real problem but some replicas might not come up again. In those case it helps to either restart the node with the replicas that are in state "down" or to remove the failed replica and then recreate it. But as said, clearing it when Solr is stopped worked fine so far. On 22.08.2017 14:03, Jeff Courtade wrote: How does the cluster react to the overseer q entries disapeering? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: Hi Jeff, we ran into that a few times already. We have lots of collections and when nodes get started too fast the overseer queue grows faster then Solr can process it. At some point Solr tries to redo things like leaders votes and adds new tasks to the list, which then gets longer and longer. Once it is too long you can not read out the data anymore but Solr is still adding tasks. In case you already reached that point you have to start ZooKeeper and the ZooKeeper client with and increased "jute.maxbuffer" value. I usually double it until I can read out the queue again. After that I delete all entries in the queue and then start the Solr nodes one by one, like every 5 minutes. regards, Hendrik On 22.08.2017 13:42, Jeff Courtade wrote: Hi, I have an issue with what seems to be a blocked up /overseer/queue There are 700k + entries. Solr cloud 6.x You cannot addreplica or deletereplica the commands time out. Full stop and start of solr and zookeeper does not clear it. Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the /overseer/queue ? Jeff Courtade M: 240.507.6116
Re: 700k entries in overseer q cannot addreplica or deletereplica
I set jute.maxbuffer on the so hosts should this be done to solr as well? Mine is happening in a severely memory constrained end as well. Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:53 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: > We have Solr and ZK running in Docker containers. There is no more then > one Solr/ZK node per host but Solr and ZK node can run on the same host. So > Solr and ZK are spread out separately. > > I have not seen this problem during normal processing just when we recycle > nodes or when we have nodes fail, which is pretty much always caused by > being out of memory, which again is unfortunately a bit complex in Docker. > When nodes come up they add quite a few tasks to the overseer queue. I > assume one task for every core. We have about 2000 cores on each node. If > nodes come up too fast the queue might grow to a few thousand entries. At > maybe 1 entries it usually reaches the point of no return and Solr is > just added more tasks then it is able to process. So it's best to pull the > plug at that point as you will not have to play with jute.maxbuffer to get > Solr up again. > > We are using Solr 6.3. There is some improvements in 6.6: > https://issues.apache.org/jira/browse/SOLR-10524 > https://issues.apache.org/jira/browse/SOLR-10619 > > On 22.08.2017 14:41, Jeff Courtade wrote: > >> Thanks very much. >> >> I will followup when we try this. >> >> Im curious in the env this is happening to you are the zookeeper >> servers residing on solr nodes? Are the solr nodes underpowered ram and or >> cpu? >> >> Jeff Courtade >> M: 240.507.6116 >> >> On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> >> wrote: >> >> I'm always using a small Java program to delete the nodes directly. I >>> assume you can also delete the whole node but that is nothing I have >>> tried >>> myself. >>> >>> On 22.08.2017 14:27, Jeff Courtade wrote: >>> >>> So ... >>>> >>>> Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now. >>>> >>>> Can I >>>> >>>>rmr /overseer/queue >>>> >>>> Or do i need to delete individual entries? >>>> >>>> Will >>>> >>>> rmr /overseer/queue/* >>>> >>>> work? >>>> >>>> >>>> >>>> >>>> Jeff Courtade >>>> M: 240.507.6116 >>>> >>>> On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> >>>> wrote: >>>> >>>> When Solr is stopped it did not cause a problem so far. >>>> >>>>> I cleared the queue also a few times while Solr was still running. That >>>>> also didn't result in a real problem but some replicas might not come >>>>> up >>>>> again. In those case it helps to either restart the node with the >>>>> replicas >>>>> that are in state "down" or to remove the failed replica and then >>>>> recreate >>>>> it. But as said, clearing it when Solr is stopped worked fine so far. >>>>> >>>>> On 22.08.2017 14:03, Jeff Courtade wrote: >>>>> >>>>> How does the cluster react to the overseer q entries disapeering? >>>>> >>>>>> >>>>>> >>>>>> Jeff Courtade >>>>>> M: 240.507.6116 >>>>>> >>>>>> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> >>>>>> wrote: >>>>>> >>>>>> Hi Jeff, >>>>>> >>>>>> we ran into that a few times already. We have lots of collections and >>>>>>> when >>>>>>> nodes get started too fast the overseer queue grows faster then Solr >>>>>>> can >>>>>>> process it. At some point Solr tries to redo things like leaders >>>>>>> votes >>>>>>> and >>>>>>> adds new tasks to the list, which then gets longer and longer. Once >>>>>>> it >>>>>>> is >>>>>>> too long you can not read out the data anymore but Solr is still >>>>>>> adding >>>>>>> tasks. In case you already reached that point you have to start >>>>>>> ZooKeeper >>>>>>> and the ZooKeeper client with and increased "jute.maxbuffer" value. I >>>>>>> usually double it until I can read out the queue again. After that I >>>>>>> delete >>>>>>> all entries in the queue and then start the Solr nodes one by one, >>>>>>> like >>>>>>> every 5 minutes. >>>>>>> >>>>>>> regards, >>>>>>> Hendrik >>>>>>> >>>>>>> On 22.08.2017 13:42, Jeff Courtade wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have an issue with what seems to be a blocked up /overseer/queue >>>>>>>> >>>>>>>> There are 700k + entries. >>>>>>>> >>>>>>>> Solr cloud 6.x >>>>>>>> >>>>>>>> You cannot addreplica or deletereplica the commands time out. >>>>>>>> >>>>>>>> Full stop and start of solr and zookeeper does not clear it. >>>>>>>> >>>>>>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the >>>>>>>> /overseer/queue ? >>>>>>>> >>>>>>>> >>>>>>>> Jeff Courtade >>>>>>>> M: 240.507.6116 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >
Re: 700k entries in overseer q cannot addreplica or deletereplica
We have Solr and ZK running in Docker containers. There is no more then one Solr/ZK node per host but Solr and ZK node can run on the same host. So Solr and ZK are spread out separately. I have not seen this problem during normal processing just when we recycle nodes or when we have nodes fail, which is pretty much always caused by being out of memory, which again is unfortunately a bit complex in Docker. When nodes come up they add quite a few tasks to the overseer queue. I assume one task for every core. We have about 2000 cores on each node. If nodes come up too fast the queue might grow to a few thousand entries. At maybe 1 entries it usually reaches the point of no return and Solr is just added more tasks then it is able to process. So it's best to pull the plug at that point as you will not have to play with jute.maxbuffer to get Solr up again. We are using Solr 6.3. There is some improvements in 6.6: https://issues.apache.org/jira/browse/SOLR-10524 https://issues.apache.org/jira/browse/SOLR-10619 On 22.08.2017 14:41, Jeff Courtade wrote: Thanks very much. I will followup when we try this. Im curious in the env this is happening to you are the zookeeper servers residing on solr nodes? Are the solr nodes underpowered ram and or cpu? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: I'm always using a small Java program to delete the nodes directly. I assume you can also delete the whole node but that is nothing I have tried myself. On 22.08.2017 14:27, Jeff Courtade wrote: So ... Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now. Can I rmr /overseer/queue Or do i need to delete individual entries? Will rmr /overseer/queue/* work? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: When Solr is stopped it did not cause a problem so far. I cleared the queue also a few times while Solr was still running. That also didn't result in a real problem but some replicas might not come up again. In those case it helps to either restart the node with the replicas that are in state "down" or to remove the failed replica and then recreate it. But as said, clearing it when Solr is stopped worked fine so far. On 22.08.2017 14:03, Jeff Courtade wrote: How does the cluster react to the overseer q entries disapeering? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: Hi Jeff, we ran into that a few times already. We have lots of collections and when nodes get started too fast the overseer queue grows faster then Solr can process it. At some point Solr tries to redo things like leaders votes and adds new tasks to the list, which then gets longer and longer. Once it is too long you can not read out the data anymore but Solr is still adding tasks. In case you already reached that point you have to start ZooKeeper and the ZooKeeper client with and increased "jute.maxbuffer" value. I usually double it until I can read out the queue again. After that I delete all entries in the queue and then start the Solr nodes one by one, like every 5 minutes. regards, Hendrik On 22.08.2017 13:42, Jeff Courtade wrote: Hi, I have an issue with what seems to be a blocked up /overseer/queue There are 700k + entries. Solr cloud 6.x You cannot addreplica or deletereplica the commands time out. Full stop and start of solr and zookeeper does not clear it. Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the /overseer/queue ? Jeff Courtade M: 240.507.6116
Re: 700k entries in overseer q cannot addreplica or deletereplica
Thanks very much. I will followup when we try this. Im curious in the env this is happening to you are the zookeeper servers residing on solr nodes? Are the solr nodes underpowered ram and or cpu? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: > I'm always using a small Java program to delete the nodes directly. I > assume you can also delete the whole node but that is nothing I have tried > myself. > > On 22.08.2017 14:27, Jeff Courtade wrote: > >> So ... >> >> Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now. >> >> Can I >> >> rmr /overseer/queue >> >> Or do i need to delete individual entries? >> >> Will >> >> rmr /overseer/queue/* >> >> work? >> >> >> >> >> Jeff Courtade >> M: 240.507.6116 >> >> On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> >> wrote: >> >> When Solr is stopped it did not cause a problem so far. >>> I cleared the queue also a few times while Solr was still running. That >>> also didn't result in a real problem but some replicas might not come up >>> again. In those case it helps to either restart the node with the >>> replicas >>> that are in state "down" or to remove the failed replica and then >>> recreate >>> it. But as said, clearing it when Solr is stopped worked fine so far. >>> >>> On 22.08.2017 14:03, Jeff Courtade wrote: >>> >>> How does the cluster react to the overseer q entries disapeering? >>>> >>>> >>>> >>>> Jeff Courtade >>>> M: 240.507.6116 >>>> >>>> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> >>>> wrote: >>>> >>>> Hi Jeff, >>>> >>>>> we ran into that a few times already. We have lots of collections and >>>>> when >>>>> nodes get started too fast the overseer queue grows faster then Solr >>>>> can >>>>> process it. At some point Solr tries to redo things like leaders votes >>>>> and >>>>> adds new tasks to the list, which then gets longer and longer. Once it >>>>> is >>>>> too long you can not read out the data anymore but Solr is still adding >>>>> tasks. In case you already reached that point you have to start >>>>> ZooKeeper >>>>> and the ZooKeeper client with and increased "jute.maxbuffer" value. I >>>>> usually double it until I can read out the queue again. After that I >>>>> delete >>>>> all entries in the queue and then start the Solr nodes one by one, like >>>>> every 5 minutes. >>>>> >>>>> regards, >>>>> Hendrik >>>>> >>>>> On 22.08.2017 13:42, Jeff Courtade wrote: >>>>> >>>>> Hi, >>>>> >>>>>> I have an issue with what seems to be a blocked up /overseer/queue >>>>>> >>>>>> There are 700k + entries. >>>>>> >>>>>> Solr cloud 6.x >>>>>> >>>>>> You cannot addreplica or deletereplica the commands time out. >>>>>> >>>>>> Full stop and start of solr and zookeeper does not clear it. >>>>>> >>>>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the >>>>>> /overseer/queue ? >>>>>> >>>>>> >>>>>> Jeff Courtade >>>>>> M: 240.507.6116 >>>>>> >>>>>> >>>>>> >>>>>> >
Re: 700k entries in overseer q cannot addreplica or deletereplica
I'm always using a small Java program to delete the nodes directly. I assume you can also delete the whole node but that is nothing I have tried myself. On 22.08.2017 14:27, Jeff Courtade wrote: So ... Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now. Can I rmr /overseer/queue Or do i need to delete individual entries? Will rmr /overseer/queue/* work? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: When Solr is stopped it did not cause a problem so far. I cleared the queue also a few times while Solr was still running. That also didn't result in a real problem but some replicas might not come up again. In those case it helps to either restart the node with the replicas that are in state "down" or to remove the failed replica and then recreate it. But as said, clearing it when Solr is stopped worked fine so far. On 22.08.2017 14:03, Jeff Courtade wrote: How does the cluster react to the overseer q entries disapeering? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: Hi Jeff, we ran into that a few times already. We have lots of collections and when nodes get started too fast the overseer queue grows faster then Solr can process it. At some point Solr tries to redo things like leaders votes and adds new tasks to the list, which then gets longer and longer. Once it is too long you can not read out the data anymore but Solr is still adding tasks. In case you already reached that point you have to start ZooKeeper and the ZooKeeper client with and increased "jute.maxbuffer" value. I usually double it until I can read out the queue again. After that I delete all entries in the queue and then start the Solr nodes one by one, like every 5 minutes. regards, Hendrik On 22.08.2017 13:42, Jeff Courtade wrote: Hi, I have an issue with what seems to be a blocked up /overseer/queue There are 700k + entries. Solr cloud 6.x You cannot addreplica or deletereplica the commands time out. Full stop and start of solr and zookeeper does not clear it. Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the /overseer/queue ? Jeff Courtade M: 240.507.6116
Re: 700k entries in overseer q cannot addreplica or deletereplica
So ... Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now. Can I rmr /overseer/queue Or do i need to delete individual entries? Will rmr /overseer/queue/* work? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: > When Solr is stopped it did not cause a problem so far. > I cleared the queue also a few times while Solr was still running. That > also didn't result in a real problem but some replicas might not come up > again. In those case it helps to either restart the node with the replicas > that are in state "down" or to remove the failed replica and then recreate > it. But as said, clearing it when Solr is stopped worked fine so far. > > On 22.08.2017 14:03, Jeff Courtade wrote: > >> How does the cluster react to the overseer q entries disapeering? >> >> >> >> Jeff Courtade >> M: 240.507.6116 >> >> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> >> wrote: >> >> Hi Jeff, >>> >>> we ran into that a few times already. We have lots of collections and >>> when >>> nodes get started too fast the overseer queue grows faster then Solr can >>> process it. At some point Solr tries to redo things like leaders votes >>> and >>> adds new tasks to the list, which then gets longer and longer. Once it is >>> too long you can not read out the data anymore but Solr is still adding >>> tasks. In case you already reached that point you have to start ZooKeeper >>> and the ZooKeeper client with and increased "jute.maxbuffer" value. I >>> usually double it until I can read out the queue again. After that I >>> delete >>> all entries in the queue and then start the Solr nodes one by one, like >>> every 5 minutes. >>> >>> regards, >>> Hendrik >>> >>> On 22.08.2017 13:42, Jeff Courtade wrote: >>> >>> Hi, >>>> >>>> I have an issue with what seems to be a blocked up /overseer/queue >>>> >>>> There are 700k + entries. >>>> >>>> Solr cloud 6.x >>>> >>>> You cannot addreplica or deletereplica the commands time out. >>>> >>>> Full stop and start of solr and zookeeper does not clear it. >>>> >>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the >>>> /overseer/queue ? >>>> >>>> >>>> Jeff Courtade >>>> M: 240.507.6116 >>>> >>>> >>>> >
Re: 700k entries in overseer q cannot addreplica or deletereplica
When Solr is stopped it did not cause a problem so far. I cleared the queue also a few times while Solr was still running. That also didn't result in a real problem but some replicas might not come up again. In those case it helps to either restart the node with the replicas that are in state "down" or to remove the failed replica and then recreate it. But as said, clearing it when Solr is stopped worked fine so far. On 22.08.2017 14:03, Jeff Courtade wrote: How does the cluster react to the overseer q entries disapeering? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: Hi Jeff, we ran into that a few times already. We have lots of collections and when nodes get started too fast the overseer queue grows faster then Solr can process it. At some point Solr tries to redo things like leaders votes and adds new tasks to the list, which then gets longer and longer. Once it is too long you can not read out the data anymore but Solr is still adding tasks. In case you already reached that point you have to start ZooKeeper and the ZooKeeper client with and increased "jute.maxbuffer" value. I usually double it until I can read out the queue again. After that I delete all entries in the queue and then start the Solr nodes one by one, like every 5 minutes. regards, Hendrik On 22.08.2017 13:42, Jeff Courtade wrote: Hi, I have an issue with what seems to be a blocked up /overseer/queue There are 700k + entries. Solr cloud 6.x You cannot addreplica or deletereplica the commands time out. Full stop and start of solr and zookeeper does not clear it. Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the /overseer/queue ? Jeff Courtade M: 240.507.6116
Re: 700k entries in overseer q cannot addreplica or deletereplica
How does the cluster react to the overseer q entries disapeering? Jeff Courtade M: 240.507.6116 On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote: > Hi Jeff, > > we ran into that a few times already. We have lots of collections and when > nodes get started too fast the overseer queue grows faster then Solr can > process it. At some point Solr tries to redo things like leaders votes and > adds new tasks to the list, which then gets longer and longer. Once it is > too long you can not read out the data anymore but Solr is still adding > tasks. In case you already reached that point you have to start ZooKeeper > and the ZooKeeper client with and increased "jute.maxbuffer" value. I > usually double it until I can read out the queue again. After that I delete > all entries in the queue and then start the Solr nodes one by one, like > every 5 minutes. > > regards, > Hendrik > > On 22.08.2017 13:42, Jeff Courtade wrote: > >> Hi, >> >> I have an issue with what seems to be a blocked up /overseer/queue >> >> There are 700k + entries. >> >> Solr cloud 6.x >> >> You cannot addreplica or deletereplica the commands time out. >> >> Full stop and start of solr and zookeeper does not clear it. >> >> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the >> /overseer/queue ? >> >> >> Jeff Courtade >> M: 240.507.6116 >> >> >
Re: 700k entries in overseer q cannot addreplica or deletereplica
Hi Jeff, we ran into that a few times already. We have lots of collections and when nodes get started too fast the overseer queue grows faster then Solr can process it. At some point Solr tries to redo things like leaders votes and adds new tasks to the list, which then gets longer and longer. Once it is too long you can not read out the data anymore but Solr is still adding tasks. In case you already reached that point you have to start ZooKeeper and the ZooKeeper client with and increased "jute.maxbuffer" value. I usually double it until I can read out the queue again. After that I delete all entries in the queue and then start the Solr nodes one by one, like every 5 minutes. regards, Hendrik On 22.08.2017 13:42, Jeff Courtade wrote: Hi, I have an issue with what seems to be a blocked up /overseer/queue There are 700k + entries. Solr cloud 6.x You cannot addreplica or deletereplica the commands time out. Full stop and start of solr and zookeeper does not clear it. Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the /overseer/queue ? Jeff Courtade M: 240.507.6116
700k entries in overseer q cannot addreplica or deletereplica
Hi, I have an issue with what seems to be a blocked up /overseer/queue There are 700k + entries. Solr cloud 6.x You cannot addreplica or deletereplica the commands time out. Full stop and start of solr and zookeeper does not clear it. Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the /overseer/queue ? Jeff Courtade M: 240.507.6116
Re: DELETEREPLICA command shouldn't delete de last replica of a shard
I raised a JIRA with this, SOLR-8257 - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/DELETEREPLICA-command-shouldn-t-delete-de-last-replica-of-a-shard-tp4239054p4239139.html Sent from the Solr - User mailing list archive at Nabble.com.
DELETEREPLICA command shouldn't delete de last replica of a shard
I don't know if this behaviour makes sense but in my IMHO the last replica of a shard should be removed only by the DELETESHARD command ... If you don't not notice about this behaviour, the result is the deletion of the whole shard, and as a consequence, the deletion of the shard in the clusterstate.json ... with all the pain associated. - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/DELETEREPLICA-command-shouldn-t-delete-de-last-replica-of-a-shard-tp4239054.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DELETEREPLICA command shouldn't delete de last replica of a shard
On Sun, Nov 8, 2015 at 11:20 PM, Erick Erickson <erickerick...@gmail.com> wrote: > I see your point. In the custom routing case where it _is_ reasonable > to delete all replicas in a shard, DELETESHARD does the trick. > > It's reasonable to raise a JIRA I think so we have a record of the > discussion/decision. > There is a related, but maybe not exactly same, JIRA: https://issues.apache.org/jira/browse/SOLR-5209 > > Erick > > On Sun, Nov 8, 2015 at 9:13 AM, Yago Riveiro <yago.rive...@gmail.com> > wrote: > > I don't know if this behaviour makes sense but in my IMHO the last > replica of > > a shard should be removed only by the DELETESHARD command ... > > > > If you don't not notice about this behaviour, the result is the deletion > of > > the whole shard, and as a consequence, the deletion of the shard in the > > clusterstate.json ... with all the pain associated. > > > > > > > > ----- > > Best regards > > -- > > View this message in context: > http://lucene.472066.n3.nabble.com/DELETEREPLICA-command-shouldn-t-delete-de-last-replica-of-a-shard-tp4239054.html > > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: DELETEREPLICA command shouldn't delete de last replica of a shard
I see your point. In the custom routing case where it _is_ reasonable to delete all replicas in a shard, DELETESHARD does the trick. It's reasonable to raise a JIRA I think so we have a record of the discussion/decision. Erick On Sun, Nov 8, 2015 at 9:13 AM, Yago Riveiro <yago.rive...@gmail.com> wrote: > I don't know if this behaviour makes sense but in my IMHO the last replica of > a shard should be removed only by the DELETESHARD command ... > > If you don't not notice about this behaviour, the result is the deletion of > the whole shard, and as a consequence, the deletion of the shard in the > clusterstate.json ... with all the pain associated. > > > > - > Best regards > -- > View this message in context: > http://lucene.472066.n3.nabble.com/DELETEREPLICA-command-shouldn-t-delete-de-last-replica-of-a-shard-tp4239054.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: DELETEREPLICA
Yeah, this all feedbacks into the ZK as Truth mode that we've been talking about. The documentation is misleading because yes, if a node comes back up, then it will add itself to the cluster state. There are plans to change that behaviour once we've thought through more use-cases and are ready to break back-compat. There is a new cluster level property called legacyCloud which defaults to true but if it is set to false then behaviour documented on the reference guide will be used. On Wed, Sep 3, 2014 at 6:01 PM, Erick Erickson erickerick...@gmail.com wrote: I'm confused, wondering if it's a mismatch between the docs and the intent or just a bug or whether I'm just not understanding the point: The DELETEREPLICA docs say: Delete a replica from a given collection and shard. If the corresponding core is up and running the core is unloaded and the entry is removed from the clusterstate. If the node/core is down, the entry is taken off the clusterstate and if the core comes up later it is automatically unregistered. However, if I do the following: 1 create a follower on nodeX 2 shut down nodeX (at this point, the clusterstate has indicates the follower is down) 3 issue a DELETEREPLICA for the follower (clusterstate entry for this follower is removed) 4 restart nodeX (clusterstate shows this node is back, it's visible in cloud veiw, gets sync'd etc.). Based on the docs, I didn't expect to see the node present in step 4, what am I missing? The core has docs (i.e. it's synched from the leader) etc. So this bit of the documentation is confusing me: If the node/core is down, the entry is taken off the clusterstate and if the core comes up later it is automatically unregistered. That doesn't square with what I'm seeing so either the docs are wrong or I'm misunderstanding the intent. If the node _is_ up, then it's removed from the node and clusterstate and stays gone. Personally, I don't particularly like the idea of queueing up the DELETEREPLICAS for later execution, seems like it's overly complex. Having the clusterstate info removed if the node is down seems very useful though. Thanks, Erick -- Regards, Shalin Shekhar Mangar.
Re: DELETEREPLICA
Thanks. But since the docs are incorrect as far as current behavior is concerned, I'll change the info on the CWiki to reflect the current state of affairs. I just did a brief test with creating a collection with legacyCloud set to false and no change. Erick On Thu, Sep 4, 2014 at 12:06 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Yeah, this all feedbacks into the ZK as Truth mode that we've been talking about. The documentation is misleading because yes, if a node comes back up, then it will add itself to the cluster state. There are plans to change that behaviour once we've thought through more use-cases and are ready to break back-compat. There is a new cluster level property called legacyCloud which defaults to true but if it is set to false then behaviour documented on the reference guide will be used. On Wed, Sep 3, 2014 at 6:01 PM, Erick Erickson erickerick...@gmail.com wrote: I'm confused, wondering if it's a mismatch between the docs and the intent or just a bug or whether I'm just not understanding the point: The DELETEREPLICA docs say: Delete a replica from a given collection and shard. If the corresponding core is up and running the core is unloaded and the entry is removed from the clusterstate. If the node/core is down, the entry is taken off the clusterstate and if the core comes up later it is automatically unregistered. However, if I do the following: 1 create a follower on nodeX 2 shut down nodeX (at this point, the clusterstate has indicates the follower is down) 3 issue a DELETEREPLICA for the follower (clusterstate entry for this follower is removed) 4 restart nodeX (clusterstate shows this node is back, it's visible in cloud veiw, gets sync'd etc.). Based on the docs, I didn't expect to see the node present in step 4, what am I missing? The core has docs (i.e. it's synched from the leader) etc. So this bit of the documentation is confusing me: If the node/core is down, the entry is taken off the clusterstate and if the core comes up later it is automatically unregistered. That doesn't square with what I'm seeing so either the docs are wrong or I'm misunderstanding the intent. If the node _is_ up, then it's removed from the node and clusterstate and stays gone. Personally, I don't particularly like the idea of queueing up the DELETEREPLICAS for later execution, seems like it's overly complex. Having the clusterstate info removed if the node is down seems very useful though. Thanks, Erick -- Regards, Shalin Shekhar Mangar.
DELETEREPLICA
I'm confused, wondering if it's a mismatch between the docs and the intent or just a bug or whether I'm just not understanding the point: The DELETEREPLICA docs say: Delete a replica from a given collection and shard. If the corresponding core is up and running the core is unloaded and the entry is removed from the clusterstate. If the node/core is down, the entry is taken off the clusterstate and if the core comes up later it is automatically unregistered. However, if I do the following: 1 create a follower on nodeX 2 shut down nodeX (at this point, the clusterstate has indicates the follower is down) 3 issue a DELETEREPLICA for the follower (clusterstate entry for this follower is removed) 4 restart nodeX (clusterstate shows this node is back, it's visible in cloud veiw, gets sync'd etc.). Based on the docs, I didn't expect to see the node present in step 4, what am I missing? The core has docs (i.e. it's synched from the leader) etc. So this bit of the documentation is confusing me: If the node/core is down, the entry is taken off the clusterstate and if the core comes up later it is automatically unregistered. That doesn't square with what I'm seeing so either the docs are wrong or I'm misunderstanding the intent. If the node _is_ up, then it's removed from the node and clusterstate and stays gone. Personally, I don't particularly like the idea of queueing up the DELETEREPLICAS for later execution, seems like it's overly complex. Having the clusterstate info removed if the node is down seems very useful though. Thanks, Erick