Re: Solr 7.2.1 DELETEREPLICA automatically NRT replica appears

2018-03-07 Thread Greg Roodt
I'll check the logs when I'm back at my computer. Mostly errors about
failing to find the core spamming the logs if I recall correctly.

Node never becomes active. Just spams the logs. Only way to remove it is to
stop solr in the node and delete the replica via API on another node.


On Thu, 8 Mar 2018 at 15:49, Tomas Fernandez Lobbe <tflo...@apple.com>
wrote:

> This shouldn’t be happening. Did you see anything related in the logs?
> Does the new NRT replica ever becomes active? Is there a new core created
> or do you just see the replica in the clusterstate?
>
> Tomas
>
> Sent from my iPhone
>
> > On Mar 7, 2018, at 8:18 PM, Greg Roodt <gro...@gmail.com> wrote:
> >
> > Hi
> >
> > I am running a cluster of TLOG and PULL replicas. When I call the
> > DELETEREPLICA api to remove a replica, the replica is removed, however, a
> > new NRT replica pops up in a down state in the cluster.
> >
> > Any ideas why?
> >
> > Greg
>


Re: Solr 7.2.1 DELETEREPLICA automatically NRT replica appears

2018-03-07 Thread Tomas Fernandez Lobbe
This shouldn’t be happening. Did you see anything related in the logs? Does the 
new NRT replica ever becomes active? Is there a new core created or do you just 
see the replica in the clusterstate?

Tomas 

Sent from my iPhone

> On Mar 7, 2018, at 8:18 PM, Greg Roodt <gro...@gmail.com> wrote:
> 
> Hi
> 
> I am running a cluster of TLOG and PULL replicas. When I call the
> DELETEREPLICA api to remove a replica, the replica is removed, however, a
> new NRT replica pops up in a down state in the cluster.
> 
> Any ideas why?
> 
> Greg


Solr 7.2.1 DELETEREPLICA automatically NRT replica appears

2018-03-07 Thread Greg Roodt
Hi

I am running a cluster of TLOG and PULL replicas. When I call the
DELETEREPLICA api to remove a replica, the replica is removed, however, a
new NRT replica pops up in a down state in the cluster.

Any ideas why?

Greg


Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Erick Erickson
>>>>>>
>>>>>> Im curious in the env this is happening to you are the zookeeper
>>>>>> servers residing on solr nodes? Are the solr nodes underpowered ram
>>>>>> and
>>>>>> or
>>>>>> cpu?
>>>>>>
>>>>>> Jeff Courtade
>>>>>> M: 240.507.6116
>>>>>>
>>>>>> On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
>>>>>> wrote:
>>>>>>
>>>>>> I'm always using a small Java program to delete the nodes directly. I
>>>>>>
>>>>>>> assume you can also delete the whole node but that is nothing I have
>>>>>>> tried
>>>>>>> myself.
>>>>>>>
>>>>>>> On 22.08.2017 14:27, Jeff Courtade wrote:
>>>>>>>
>>>>>>> So ...
>>>>>>>
>>>>>>>> Using the zkCli.sh i have the jute.maxbuffer setup so I can list it
>>>>>>>> now.
>>>>>>>>
>>>>>>>> Can I
>>>>>>>>
>>>>>>>>  rmr /overseer/queue
>>>>>>>>
>>>>>>>> Or do i need to delete individual entries?
>>>>>>>>
>>>>>>>> Will
>>>>>>>>
>>>>>>>> rmr /overseer/queue/*
>>>>>>>>
>>>>>>>> work?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Jeff Courtade
>>>>>>>> M: 240.507.6116
>>>>>>>>
>>>>>>>> On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> When Solr is stopped it did not cause a problem so far.
>>>>>>>>
>>>>>>>> I cleared the queue also a few times while Solr was still running.
>>>>>>>>>
>>>>>>>>> That
>>>>>>>>> also didn't result in a real problem but some replicas might not
>>>>>>>>> come
>>>>>>>>> up
>>>>>>>>> again. In those case it helps to either restart the node with the
>>>>>>>>> replicas
>>>>>>>>> that are in state "down" or to remove the failed replica and then
>>>>>>>>> recreate
>>>>>>>>> it. But as said, clearing it when Solr is stopped worked fine so
>>>>>>>>> far.
>>>>>>>>>
>>>>>>>>> On 22.08.2017 14:03, Jeff Courtade wrote:
>>>>>>>>>
>>>>>>>>> How does the cluster react to the overseer q entries disapeering?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Jeff Courtade
>>>>>>>>>> M: 240.507.6116
>>>>>>>>>>
>>>>>>>>>> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp"
>>>>>>>>>> <hendrik.hadd...@gmx.net
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Jeff,
>>>>>>>>>>
>>>>>>>>>> we ran into that a few times already. We have lots of collections
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>> when
>>>>>>>>>>> nodes get started too fast the overseer queue grows faster then
>>>>>>>>>>> Solr
>>>>>>>>>>> can
>>>>>>>>>>> process it. At some point Solr tries to redo things like leaders
>>>>>>>>>>> votes
>>>>>>>>>>> and
>>>>>>>>>>> adds new tasks to the list, which then gets longer and longer.
>>>>>>>>>>> Once
>>>>>>>>>>> it
>>>>>>>>>>> is
>>>>>>>>>>> too long you can not read out the data anymore but Solr is still
>>>>>>>>>>> adding
>>>>>>>>>>> tasks. In case you already reached that point you have to start
>>>>>>>>>>> ZooKeeper
>>>>>>>>>>> and the ZooKeeper client with and increased "jute.maxbuffer"
>>>>>>>>>>> value. I
>>>>>>>>>>> usually double it until I can read out the queue again. After
>>>>>>>>>>> that
>>>>>>>>>>> I
>>>>>>>>>>> delete
>>>>>>>>>>> all entries in the queue and then start the Solr nodes one by
>>>>>>>>>>> one,
>>>>>>>>>>> like
>>>>>>>>>>> every 5 minutes.
>>>>>>>>>>>
>>>>>>>>>>> regards,
>>>>>>>>>>> Hendrik
>>>>>>>>>>>
>>>>>>>>>>> On 22.08.2017 13:42, Jeff Courtade wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I have an issue with what seems to be a blocked up
>>>>>>>>>>> /overseer/queue
>>>>>>>>>>>
>>>>>>>>>>>> There are 700k + entries.
>>>>>>>>>>>>
>>>>>>>>>>>> Solr cloud 6.x
>>>>>>>>>>>>
>>>>>>>>>>>> You cannot addreplica or deletereplica the commands time out.
>>>>>>>>>>>>
>>>>>>>>>>>> Full stop and start of solr and zookeeper does not clear it.
>>>>>>>>>>>>
>>>>>>>>>>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr
>>>>>>>>>>>> the
>>>>>>>>>>>> /overseer/queue ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Jeff Courtade
>>>>>>>>>>>> M: 240.507.6116
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>


Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Hendrik Haddorp
until I can read out the queue again. After that
I
delete
all entries in the queue and then start the Solr nodes one by one,
like
every 5 minutes.

regards,
Hendrik

On 22.08.2017 13:42, Jeff Courtade wrote:

Hi,

I have an issue with what seems to be a blocked up /overseer/queue


There are 700k + entries.

Solr cloud 6.x

You cannot addreplica or deletereplica the commands time out.

Full stop and start of solr and zookeeper does not clear it.

Is it safe to use the zookeeper supplied zkCli.sh to simple rmr
the
/overseer/queue ?


Jeff Courtade
M: 240.507.6116










Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Jeff Courtade
>>>> again. In those case it helps to either restart the node with the
>>>>>>> replicas
>>>>>>> that are in state "down" or to remove the failed replica and then
>>>>>>> recreate
>>>>>>> it. But as said, clearing it when Solr is stopped worked fine so far.
>>>>>>>
>>>>>>> On 22.08.2017 14:03, Jeff Courtade wrote:
>>>>>>>
>>>>>>> How does the cluster react to the overseer q entries disapeering?
>>>>>>>
>>>>>>>
>>>>>>>> Jeff Courtade
>>>>>>>> M: 240.507.6116
>>>>>>>>
>>>>>>>> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net
>>>>>>>> >
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Jeff,
>>>>>>>>
>>>>>>>> we ran into that a few times already. We have lots of collections
>>>>>>>> and
>>>>>>>>
>>>>>>>>> when
>>>>>>>>> nodes get started too fast the overseer queue grows faster then
>>>>>>>>> Solr
>>>>>>>>> can
>>>>>>>>> process it. At some point Solr tries to redo things like leaders
>>>>>>>>> votes
>>>>>>>>> and
>>>>>>>>> adds new tasks to the list, which then gets longer and longer. Once
>>>>>>>>> it
>>>>>>>>> is
>>>>>>>>> too long you can not read out the data anymore but Solr is still
>>>>>>>>> adding
>>>>>>>>> tasks. In case you already reached that point you have to start
>>>>>>>>> ZooKeeper
>>>>>>>>> and the ZooKeeper client with and increased "jute.maxbuffer"
>>>>>>>>> value. I
>>>>>>>>> usually double it until I can read out the queue again. After that
>>>>>>>>> I
>>>>>>>>> delete
>>>>>>>>> all entries in the queue and then start the Solr nodes one by one,
>>>>>>>>> like
>>>>>>>>> every 5 minutes.
>>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>> Hendrik
>>>>>>>>>
>>>>>>>>> On 22.08.2017 13:42, Jeff Courtade wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have an issue with what seems to be a blocked up /overseer/queue
>>>>>>>>>
>>>>>>>>>> There are 700k + entries.
>>>>>>>>>>
>>>>>>>>>> Solr cloud 6.x
>>>>>>>>>>
>>>>>>>>>> You cannot addreplica or deletereplica the commands time out.
>>>>>>>>>>
>>>>>>>>>> Full stop and start of solr and zookeeper does not clear it.
>>>>>>>>>>
>>>>>>>>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr
>>>>>>>>>> the
>>>>>>>>>> /overseer/queue ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Jeff Courtade
>>>>>>>>>> M: 240.507.6116
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>


Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Hendrik Haddorp

- stop all solr nodes
- start zk with the new jute.maxbuffer setting
- start a zk client, like zkCli, with the changed jute.maxbuffer setting 
and check that you can read out the overseer queue

- clear the queue
- restart zk with the normal settings
- slowly start solr

On 22.08.2017 15:27, Jeff Courtade wrote:

I set jute.maxbuffer on the so hosts should this be done to solr as well?

Mine is happening in a severely memory constrained end as well.

Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:53 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote:


We have Solr and ZK running in Docker containers. There is no more then
one Solr/ZK node per host but Solr and ZK node can run on the same host. So
Solr and ZK are spread out separately.

I have not seen this problem during normal processing just when we recycle
nodes or when we have nodes fail, which is pretty much always caused by
being out of memory, which again is unfortunately a bit complex in Docker.
When nodes come up they add quite a few tasks to the overseer queue. I
assume one task for every core. We have about 2000 cores on each node. If
nodes come up too fast the queue might grow to a few thousand entries. At
maybe 1 entries it usually reaches the point of no return and Solr is
just added more tasks then it is able to process. So it's best to pull the
plug at that point as you will not have to play with jute.maxbuffer to get
Solr up again.

We are using Solr 6.3. There is some improvements in 6.6:
 https://issues.apache.org/jira/browse/SOLR-10524
 https://issues.apache.org/jira/browse/SOLR-10619

On 22.08.2017 14:41, Jeff Courtade wrote:


Thanks very much.

I will followup when we try this.

Im curious in the env this is happening to you are the zookeeper
servers residing on solr nodes? Are the solr nodes underpowered ram and or
cpu?

Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
wrote:

I'm always using a small Java program to delete the nodes directly. I

assume you can also delete the whole node but that is nothing I have
tried
myself.

On 22.08.2017 14:27, Jeff Courtade wrote:

So ...

Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now.

Can I

rmr /overseer/queue

Or do i need to delete individual entries?

Will

rmr /overseer/queue/*

work?




Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
wrote:

When Solr is stopped it did not cause a problem so far.


I cleared the queue also a few times while Solr was still running. That
also didn't result in a real problem but some replicas might not come
up
again. In those case it helps to either restart the node with the
replicas
that are in state "down" or to remove the failed replica and then
recreate
it. But as said, clearing it when Solr is stopped worked fine so far.

On 22.08.2017 14:03, Jeff Courtade wrote:

How does the cluster react to the overseer q entries disapeering?



Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
wrote:

Hi Jeff,

we ran into that a few times already. We have lots of collections and

when
nodes get started too fast the overseer queue grows faster then Solr
can
process it. At some point Solr tries to redo things like leaders
votes
and
adds new tasks to the list, which then gets longer and longer. Once
it
is
too long you can not read out the data anymore but Solr is still
adding
tasks. In case you already reached that point you have to start
ZooKeeper
and the ZooKeeper client with and increased "jute.maxbuffer" value. I
usually double it until I can read out the queue again. After that I
delete
all entries in the queue and then start the Solr nodes one by one,
like
every 5 minutes.

regards,
Hendrik

On 22.08.2017 13:42, Jeff Courtade wrote:

Hi,

I have an issue with what seems to be a blocked up /overseer/queue

There are 700k + entries.

Solr cloud 6.x

You cannot addreplica or deletereplica the commands time out.

Full stop and start of solr and zookeeper does not clear it.

Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the
/overseer/queue ?


Jeff Courtade
M: 240.507.6116









Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Jeff Courtade
I set jute.maxbuffer on the so hosts should this be done to solr as well?

Mine is happening in a severely memory constrained end as well.

Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:53 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote:

> We have Solr and ZK running in Docker containers. There is no more then
> one Solr/ZK node per host but Solr and ZK node can run on the same host. So
> Solr and ZK are spread out separately.
>
> I have not seen this problem during normal processing just when we recycle
> nodes or when we have nodes fail, which is pretty much always caused by
> being out of memory, which again is unfortunately a bit complex in Docker.
> When nodes come up they add quite a few tasks to the overseer queue. I
> assume one task for every core. We have about 2000 cores on each node. If
> nodes come up too fast the queue might grow to a few thousand entries. At
> maybe 1 entries it usually reaches the point of no return and Solr is
> just added more tasks then it is able to process. So it's best to pull the
> plug at that point as you will not have to play with jute.maxbuffer to get
> Solr up again.
>
> We are using Solr 6.3. There is some improvements in 6.6:
> https://issues.apache.org/jira/browse/SOLR-10524
> https://issues.apache.org/jira/browse/SOLR-10619
>
> On 22.08.2017 14:41, Jeff Courtade wrote:
>
>> Thanks very much.
>>
>> I will followup when we try this.
>>
>> Im curious in the env this is happening to you are the zookeeper
>> servers residing on solr nodes? Are the solr nodes underpowered ram and or
>> cpu?
>>
>> Jeff Courtade
>> M: 240.507.6116
>>
>> On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
>> wrote:
>>
>> I'm always using a small Java program to delete the nodes directly. I
>>> assume you can also delete the whole node but that is nothing I have
>>> tried
>>> myself.
>>>
>>> On 22.08.2017 14:27, Jeff Courtade wrote:
>>>
>>> So ...
>>>>
>>>> Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now.
>>>>
>>>> Can I
>>>>
>>>>rmr /overseer/queue
>>>>
>>>> Or do i need to delete individual entries?
>>>>
>>>> Will
>>>>
>>>> rmr /overseer/queue/*
>>>>
>>>> work?
>>>>
>>>>
>>>>
>>>>
>>>> Jeff Courtade
>>>> M: 240.507.6116
>>>>
>>>> On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
>>>> wrote:
>>>>
>>>> When Solr is stopped it did not cause a problem so far.
>>>>
>>>>> I cleared the queue also a few times while Solr was still running. That
>>>>> also didn't result in a real problem but some replicas might not come
>>>>> up
>>>>> again. In those case it helps to either restart the node with the
>>>>> replicas
>>>>> that are in state "down" or to remove the failed replica and then
>>>>> recreate
>>>>> it. But as said, clearing it when Solr is stopped worked fine so far.
>>>>>
>>>>> On 22.08.2017 14:03, Jeff Courtade wrote:
>>>>>
>>>>> How does the cluster react to the overseer q entries disapeering?
>>>>>
>>>>>>
>>>>>>
>>>>>> Jeff Courtade
>>>>>> M: 240.507.6116
>>>>>>
>>>>>> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Jeff,
>>>>>>
>>>>>> we ran into that a few times already. We have lots of collections and
>>>>>>> when
>>>>>>> nodes get started too fast the overseer queue grows faster then Solr
>>>>>>> can
>>>>>>> process it. At some point Solr tries to redo things like leaders
>>>>>>> votes
>>>>>>> and
>>>>>>> adds new tasks to the list, which then gets longer and longer. Once
>>>>>>> it
>>>>>>> is
>>>>>>> too long you can not read out the data anymore but Solr is still
>>>>>>> adding
>>>>>>> tasks. In case you already reached that point you have to start
>>>>>>> ZooKeeper
>>>>>>> and the ZooKeeper client with and increased "jute.maxbuffer" value. I
>>>>>>> usually double it until I can read out the queue again. After that I
>>>>>>> delete
>>>>>>> all entries in the queue and then start the Solr nodes one by one,
>>>>>>> like
>>>>>>> every 5 minutes.
>>>>>>>
>>>>>>> regards,
>>>>>>> Hendrik
>>>>>>>
>>>>>>> On 22.08.2017 13:42, Jeff Courtade wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have an issue with what seems to be a blocked up /overseer/queue
>>>>>>>>
>>>>>>>> There are 700k + entries.
>>>>>>>>
>>>>>>>> Solr cloud 6.x
>>>>>>>>
>>>>>>>> You cannot addreplica or deletereplica the commands time out.
>>>>>>>>
>>>>>>>> Full stop and start of solr and zookeeper does not clear it.
>>>>>>>>
>>>>>>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the
>>>>>>>> /overseer/queue ?
>>>>>>>>
>>>>>>>>
>>>>>>>> Jeff Courtade
>>>>>>>> M: 240.507.6116
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>


Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Hendrik Haddorp
We have Solr and ZK running in Docker containers. There is no more then 
one Solr/ZK node per host but Solr and ZK node can run on the same host. 
So Solr and ZK are spread out separately.


I have not seen this problem during normal processing just when we 
recycle nodes or when we have nodes fail, which is pretty much always 
caused by being out of memory, which again is unfortunately a bit 
complex in Docker. When nodes come up they add quite a few tasks to the 
overseer queue. I assume one task for every core. We have about 2000 
cores on each node. If nodes come up too fast the queue might grow to a 
few thousand entries. At maybe 1 entries it usually reaches the 
point of no return and Solr is just added more tasks then it is able to 
process. So it's best to pull the plug at that point as you will not 
have to play with jute.maxbuffer to get Solr up again.


We are using Solr 6.3. There is some improvements in 6.6:
https://issues.apache.org/jira/browse/SOLR-10524
https://issues.apache.org/jira/browse/SOLR-10619

On 22.08.2017 14:41, Jeff Courtade wrote:

Thanks very much.

I will followup when we try this.

Im curious in the env this is happening to you are the zookeeper
servers residing on solr nodes? Are the solr nodes underpowered ram and or
cpu?

Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote:


I'm always using a small Java program to delete the nodes directly. I
assume you can also delete the whole node but that is nothing I have tried
myself.

On 22.08.2017 14:27, Jeff Courtade wrote:


So ...

Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now.

Can I

   rmr /overseer/queue

Or do i need to delete individual entries?

Will

rmr /overseer/queue/*

work?




Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
wrote:

When Solr is stopped it did not cause a problem so far.

I cleared the queue also a few times while Solr was still running. That
also didn't result in a real problem but some replicas might not come up
again. In those case it helps to either restart the node with the
replicas
that are in state "down" or to remove the failed replica and then
recreate
it. But as said, clearing it when Solr is stopped worked fine so far.

On 22.08.2017 14:03, Jeff Courtade wrote:

How does the cluster react to the overseer q entries disapeering?



Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
wrote:

Hi Jeff,


we ran into that a few times already. We have lots of collections and
when
nodes get started too fast the overseer queue grows faster then Solr
can
process it. At some point Solr tries to redo things like leaders votes
and
adds new tasks to the list, which then gets longer and longer. Once it
is
too long you can not read out the data anymore but Solr is still adding
tasks. In case you already reached that point you have to start
ZooKeeper
and the ZooKeeper client with and increased "jute.maxbuffer" value. I
usually double it until I can read out the queue again. After that I
delete
all entries in the queue and then start the Solr nodes one by one, like
every 5 minutes.

regards,
Hendrik

On 22.08.2017 13:42, Jeff Courtade wrote:

Hi,


I have an issue with what seems to be a blocked up /overseer/queue

There are 700k + entries.

Solr cloud 6.x

You cannot addreplica or deletereplica the commands time out.

Full stop and start of solr and zookeeper does not clear it.

Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the
/overseer/queue ?


Jeff Courtade
M: 240.507.6116








Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Jeff Courtade
Thanks very much.

I will followup when we try this.

Im curious in the env this is happening to you are the zookeeper
servers residing on solr nodes? Are the solr nodes underpowered ram and or
cpu?

Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:30 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote:

> I'm always using a small Java program to delete the nodes directly. I
> assume you can also delete the whole node but that is nothing I have tried
> myself.
>
> On 22.08.2017 14:27, Jeff Courtade wrote:
>
>> So ...
>>
>> Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now.
>>
>> Can I
>>
>>   rmr /overseer/queue
>>
>> Or do i need to delete individual entries?
>>
>> Will
>>
>> rmr /overseer/queue/*
>>
>> work?
>>
>>
>>
>>
>> Jeff Courtade
>> M: 240.507.6116
>>
>> On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
>> wrote:
>>
>> When Solr is stopped it did not cause a problem so far.
>>> I cleared the queue also a few times while Solr was still running. That
>>> also didn't result in a real problem but some replicas might not come up
>>> again. In those case it helps to either restart the node with the
>>> replicas
>>> that are in state "down" or to remove the failed replica and then
>>> recreate
>>> it. But as said, clearing it when Solr is stopped worked fine so far.
>>>
>>> On 22.08.2017 14:03, Jeff Courtade wrote:
>>>
>>> How does the cluster react to the overseer q entries disapeering?
>>>>
>>>>
>>>>
>>>> Jeff Courtade
>>>> M: 240.507.6116
>>>>
>>>> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
>>>> wrote:
>>>>
>>>> Hi Jeff,
>>>>
>>>>> we ran into that a few times already. We have lots of collections and
>>>>> when
>>>>> nodes get started too fast the overseer queue grows faster then Solr
>>>>> can
>>>>> process it. At some point Solr tries to redo things like leaders votes
>>>>> and
>>>>> adds new tasks to the list, which then gets longer and longer. Once it
>>>>> is
>>>>> too long you can not read out the data anymore but Solr is still adding
>>>>> tasks. In case you already reached that point you have to start
>>>>> ZooKeeper
>>>>> and the ZooKeeper client with and increased "jute.maxbuffer" value. I
>>>>> usually double it until I can read out the queue again. After that I
>>>>> delete
>>>>> all entries in the queue and then start the Solr nodes one by one, like
>>>>> every 5 minutes.
>>>>>
>>>>> regards,
>>>>> Hendrik
>>>>>
>>>>> On 22.08.2017 13:42, Jeff Courtade wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>> I have an issue with what seems to be a blocked up /overseer/queue
>>>>>>
>>>>>> There are 700k + entries.
>>>>>>
>>>>>> Solr cloud 6.x
>>>>>>
>>>>>> You cannot addreplica or deletereplica the commands time out.
>>>>>>
>>>>>> Full stop and start of solr and zookeeper does not clear it.
>>>>>>
>>>>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the
>>>>>> /overseer/queue ?
>>>>>>
>>>>>>
>>>>>> Jeff Courtade
>>>>>> M: 240.507.6116
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>


Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Hendrik Haddorp
I'm always using a small Java program to delete the nodes directly. I 
assume you can also delete the whole node but that is nothing I have 
tried myself.


On 22.08.2017 14:27, Jeff Courtade wrote:

So ...

Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now.

Can I

  rmr /overseer/queue

Or do i need to delete individual entries?

Will

rmr /overseer/queue/*

work?




Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote:


When Solr is stopped it did not cause a problem so far.
I cleared the queue also a few times while Solr was still running. That
also didn't result in a real problem but some replicas might not come up
again. In those case it helps to either restart the node with the replicas
that are in state "down" or to remove the failed replica and then recreate
it. But as said, clearing it when Solr is stopped worked fine so far.

On 22.08.2017 14:03, Jeff Courtade wrote:


How does the cluster react to the overseer q entries disapeering?



Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
wrote:

Hi Jeff,

we ran into that a few times already. We have lots of collections and
when
nodes get started too fast the overseer queue grows faster then Solr can
process it. At some point Solr tries to redo things like leaders votes
and
adds new tasks to the list, which then gets longer and longer. Once it is
too long you can not read out the data anymore but Solr is still adding
tasks. In case you already reached that point you have to start ZooKeeper
and the ZooKeeper client with and increased "jute.maxbuffer" value. I
usually double it until I can read out the queue again. After that I
delete
all entries in the queue and then start the Solr nodes one by one, like
every 5 minutes.

regards,
Hendrik

On 22.08.2017 13:42, Jeff Courtade wrote:

Hi,

I have an issue with what seems to be a blocked up /overseer/queue

There are 700k + entries.

Solr cloud 6.x

You cannot addreplica or deletereplica the commands time out.

Full stop and start of solr and zookeeper does not clear it.

Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the
/overseer/queue ?


Jeff Courtade
M: 240.507.6116







Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Jeff Courtade
So ...

Using the zkCli.sh i have the jute.maxbuffer setup so I can list it now.

Can I

 rmr /overseer/queue

Or do i need to delete individual entries?

Will

rmr /overseer/queue/*

work?




Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:20 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote:

> When Solr is stopped it did not cause a problem so far.
> I cleared the queue also a few times while Solr was still running. That
> also didn't result in a real problem but some replicas might not come up
> again. In those case it helps to either restart the node with the replicas
> that are in state "down" or to remove the failed replica and then recreate
> it. But as said, clearing it when Solr is stopped worked fine so far.
>
> On 22.08.2017 14:03, Jeff Courtade wrote:
>
>> How does the cluster react to the overseer q entries disapeering?
>>
>>
>>
>> Jeff Courtade
>> M: 240.507.6116
>>
>> On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net>
>> wrote:
>>
>> Hi Jeff,
>>>
>>> we ran into that a few times already. We have lots of collections and
>>> when
>>> nodes get started too fast the overseer queue grows faster then Solr can
>>> process it. At some point Solr tries to redo things like leaders votes
>>> and
>>> adds new tasks to the list, which then gets longer and longer. Once it is
>>> too long you can not read out the data anymore but Solr is still adding
>>> tasks. In case you already reached that point you have to start ZooKeeper
>>> and the ZooKeeper client with and increased "jute.maxbuffer" value. I
>>> usually double it until I can read out the queue again. After that I
>>> delete
>>> all entries in the queue and then start the Solr nodes one by one, like
>>> every 5 minutes.
>>>
>>> regards,
>>> Hendrik
>>>
>>> On 22.08.2017 13:42, Jeff Courtade wrote:
>>>
>>> Hi,
>>>>
>>>> I have an issue with what seems to be a blocked up /overseer/queue
>>>>
>>>> There are 700k + entries.
>>>>
>>>> Solr cloud 6.x
>>>>
>>>> You cannot addreplica or deletereplica the commands time out.
>>>>
>>>> Full stop and start of solr and zookeeper does not clear it.
>>>>
>>>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the
>>>> /overseer/queue ?
>>>>
>>>>
>>>> Jeff Courtade
>>>> M: 240.507.6116
>>>>
>>>>
>>>>
>


Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Hendrik Haddorp

When Solr is stopped it did not cause a problem so far.
I cleared the queue also a few times while Solr was still running. That 
also didn't result in a real problem but some replicas might not come up 
again. In those case it helps to either restart the node with the 
replicas that are in state "down" or to remove the failed replica and 
then recreate it. But as said, clearing it when Solr is stopped worked 
fine so far.


On 22.08.2017 14:03, Jeff Courtade wrote:

How does the cluster react to the overseer q entries disapeering?



Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote:


Hi Jeff,

we ran into that a few times already. We have lots of collections and when
nodes get started too fast the overseer queue grows faster then Solr can
process it. At some point Solr tries to redo things like leaders votes and
adds new tasks to the list, which then gets longer and longer. Once it is
too long you can not read out the data anymore but Solr is still adding
tasks. In case you already reached that point you have to start ZooKeeper
and the ZooKeeper client with and increased "jute.maxbuffer" value. I
usually double it until I can read out the queue again. After that I delete
all entries in the queue and then start the Solr nodes one by one, like
every 5 minutes.

regards,
Hendrik

On 22.08.2017 13:42, Jeff Courtade wrote:


Hi,

I have an issue with what seems to be a blocked up /overseer/queue

There are 700k + entries.

Solr cloud 6.x

You cannot addreplica or deletereplica the commands time out.

Full stop and start of solr and zookeeper does not clear it.

Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the
/overseer/queue ?


Jeff Courtade
M: 240.507.6116






Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Jeff Courtade
How does the cluster react to the overseer q entries disapeering?



Jeff Courtade
M: 240.507.6116

On Aug 22, 2017 8:01 AM, "Hendrik Haddorp" <hendrik.hadd...@gmx.net> wrote:

> Hi Jeff,
>
> we ran into that a few times already. We have lots of collections and when
> nodes get started too fast the overseer queue grows faster then Solr can
> process it. At some point Solr tries to redo things like leaders votes and
> adds new tasks to the list, which then gets longer and longer. Once it is
> too long you can not read out the data anymore but Solr is still adding
> tasks. In case you already reached that point you have to start ZooKeeper
> and the ZooKeeper client with and increased "jute.maxbuffer" value. I
> usually double it until I can read out the queue again. After that I delete
> all entries in the queue and then start the Solr nodes one by one, like
> every 5 minutes.
>
> regards,
> Hendrik
>
> On 22.08.2017 13:42, Jeff Courtade wrote:
>
>> Hi,
>>
>> I have an issue with what seems to be a blocked up /overseer/queue
>>
>> There are 700k + entries.
>>
>> Solr cloud 6.x
>>
>> You cannot addreplica or deletereplica the commands time out.
>>
>> Full stop and start of solr and zookeeper does not clear it.
>>
>> Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the
>> /overseer/queue ?
>>
>>
>> Jeff Courtade
>> M: 240.507.6116
>>
>>
>


Re: 700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Hendrik Haddorp

Hi Jeff,

we ran into that a few times already. We have lots of collections and 
when nodes get started too fast the overseer queue grows faster then 
Solr can process it. At some point Solr tries to redo things like 
leaders votes and adds new tasks to the list, which then gets longer and 
longer. Once it is too long you can not read out the data anymore but 
Solr is still adding tasks. In case you already reached that point you 
have to start ZooKeeper and the ZooKeeper client with and increased 
"jute.maxbuffer" value. I usually double it until I can read out the 
queue again. After that I delete all entries in the queue and then start 
the Solr nodes one by one, like every 5 minutes.


regards,
Hendrik

On 22.08.2017 13:42, Jeff Courtade wrote:

Hi,

I have an issue with what seems to be a blocked up /overseer/queue

There are 700k + entries.

Solr cloud 6.x

You cannot addreplica or deletereplica the commands time out.

Full stop and start of solr and zookeeper does not clear it.

Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the
/overseer/queue ?


Jeff Courtade
M: 240.507.6116





700k entries in overseer q cannot addreplica or deletereplica

2017-08-22 Thread Jeff Courtade
Hi,

I have an issue with what seems to be a blocked up /overseer/queue

There are 700k + entries.

Solr cloud 6.x

You cannot addreplica or deletereplica the commands time out.

Full stop and start of solr and zookeeper does not clear it.

Is it safe to use the zookeeper supplied zkCli.sh to simple rmr the
/overseer/queue ?


Jeff Courtade
M: 240.507.6116


Re: DELETEREPLICA command shouldn't delete de last replica of a shard

2015-11-09 Thread Yago Riveiro
I raised a JIRA with this, SOLR-8257



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/DELETEREPLICA-command-shouldn-t-delete-de-last-replica-of-a-shard-tp4239054p4239139.html
Sent from the Solr - User mailing list archive at Nabble.com.


DELETEREPLICA command shouldn't delete de last replica of a shard

2015-11-08 Thread Yago Riveiro
I don't know if this behaviour makes sense but in my IMHO the last replica of
a shard should be removed only by the DELETESHARD command ...

If you don't not notice about this behaviour, the result is the deletion of
the whole shard, and as a consequence, the deletion of the shard in the
clusterstate.json ... with all the pain associated.



-
Best regards
--
View this message in context: 
http://lucene.472066.n3.nabble.com/DELETEREPLICA-command-shouldn-t-delete-de-last-replica-of-a-shard-tp4239054.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DELETEREPLICA command shouldn't delete de last replica of a shard

2015-11-08 Thread Ishan Chattopadhyaya
On Sun, Nov 8, 2015 at 11:20 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> I see your point. In the custom routing case where it _is_ reasonable
> to delete all replicas in a shard, DELETESHARD does the trick.
>
> It's reasonable to raise a JIRA I think so we have a record of the
> discussion/decision.
>

There is a related, but maybe not exactly same, JIRA:
https://issues.apache.org/jira/browse/SOLR-5209


>
> Erick
>
> On Sun, Nov 8, 2015 at 9:13 AM, Yago Riveiro <yago.rive...@gmail.com>
> wrote:
> > I don't know if this behaviour makes sense but in my IMHO the last
> replica of
> > a shard should be removed only by the DELETESHARD command ...
> >
> > If you don't not notice about this behaviour, the result is the deletion
> of
> > the whole shard, and as a consequence, the deletion of the shard in the
> > clusterstate.json ... with all the pain associated.
> >
> >
> >
> > -----
> > Best regards
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/DELETEREPLICA-command-shouldn-t-delete-de-last-replica-of-a-shard-tp4239054.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: DELETEREPLICA command shouldn't delete de last replica of a shard

2015-11-08 Thread Erick Erickson
I see your point. In the custom routing case where it _is_ reasonable
to delete all replicas in a shard, DELETESHARD does the trick.

It's reasonable to raise a JIRA I think so we have a record of the
discussion/decision.

Erick

On Sun, Nov 8, 2015 at 9:13 AM, Yago Riveiro <yago.rive...@gmail.com> wrote:
> I don't know if this behaviour makes sense but in my IMHO the last replica of
> a shard should be removed only by the DELETESHARD command ...
>
> If you don't not notice about this behaviour, the result is the deletion of
> the whole shard, and as a consequence, the deletion of the shard in the
> clusterstate.json ... with all the pain associated.
>
>
>
> -
> Best regards
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/DELETEREPLICA-command-shouldn-t-delete-de-last-replica-of-a-shard-tp4239054.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: DELETEREPLICA

2014-09-04 Thread Shalin Shekhar Mangar
Yeah, this all feedbacks into the ZK as Truth mode that we've been talking
about. The documentation is misleading because yes, if a node comes back
up, then it will add itself to the cluster state. There are plans to change
that behaviour once we've thought through more use-cases and are ready to
break back-compat. There is a new cluster level property called
legacyCloud which defaults to true but if it is set to false then
behaviour documented on the reference guide will be used.


On Wed, Sep 3, 2014 at 6:01 PM, Erick Erickson erickerick...@gmail.com
wrote:

 I'm confused, wondering if it's a mismatch between the docs and the
 intent or just a bug or whether I'm just not understanding the point:

 The DELETEREPLICA docs say:

 Delete a replica from a given collection and shard. If the
 corresponding core is up and running the core is unloaded and the
 entry is removed from the clusterstate. If the node/core is down, the
 entry is taken off the clusterstate and if the core comes up later it
 is automatically unregistered.

 However, if I do the following:
 1 create a follower on nodeX
 2 shut down nodeX (at this point, the clusterstate has indicates the
 follower is down)
 3 issue a DELETEREPLICA for the follower (clusterstate entry for this
 follower is removed)
 4 restart nodeX (clusterstate shows this node is back, it's visible
 in cloud veiw, gets sync'd etc.).

 Based on the docs, I didn't expect to see the node present in step 4,
 what am I missing?

 The core has docs (i.e. it's synched from the leader) etc. So this bit
 of the documentation is confusing me: If the node/core is down, the
 entry is taken off the clusterstate and if the core comes up later it
 is automatically unregistered.

 That doesn't square with what I'm seeing so either the docs are wrong
 or I'm misunderstanding the intent.

 If the node _is_ up, then it's removed from the node and clusterstate
 and stays gone.

 Personally, I don't particularly like the idea of queueing up the
 DELETEREPLICAS for later execution, seems like it's overly complex.
 Having the clusterstate info removed if the node is down seems very
 useful though.

 Thanks,
 Erick




-- 
Regards,
Shalin Shekhar Mangar.


Re: DELETEREPLICA

2014-09-04 Thread Erick Erickson
Thanks.

But since the docs are incorrect as far as current behavior is
concerned, I'll change the info on the CWiki to reflect the current
state of affairs.

I just did a brief test with creating a collection with legacyCloud
set to false and no change.

Erick

On Thu, Sep 4, 2014 at 12:06 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 Yeah, this all feedbacks into the ZK as Truth mode that we've been talking
 about. The documentation is misleading because yes, if a node comes back
 up, then it will add itself to the cluster state. There are plans to change
 that behaviour once we've thought through more use-cases and are ready to
 break back-compat. There is a new cluster level property called
 legacyCloud which defaults to true but if it is set to false then
 behaviour documented on the reference guide will be used.


 On Wed, Sep 3, 2014 at 6:01 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 I'm confused, wondering if it's a mismatch between the docs and the
 intent or just a bug or whether I'm just not understanding the point:

 The DELETEREPLICA docs say:

 Delete a replica from a given collection and shard. If the
 corresponding core is up and running the core is unloaded and the
 entry is removed from the clusterstate. If the node/core is down, the
 entry is taken off the clusterstate and if the core comes up later it
 is automatically unregistered.

 However, if I do the following:
 1 create a follower on nodeX
 2 shut down nodeX (at this point, the clusterstate has indicates the
 follower is down)
 3 issue a DELETEREPLICA for the follower (clusterstate entry for this
 follower is removed)
 4 restart nodeX (clusterstate shows this node is back, it's visible
 in cloud veiw, gets sync'd etc.).

 Based on the docs, I didn't expect to see the node present in step 4,
 what am I missing?

 The core has docs (i.e. it's synched from the leader) etc. So this bit
 of the documentation is confusing me: If the node/core is down, the
 entry is taken off the clusterstate and if the core comes up later it
 is automatically unregistered.

 That doesn't square with what I'm seeing so either the docs are wrong
 or I'm misunderstanding the intent.

 If the node _is_ up, then it's removed from the node and clusterstate
 and stays gone.

 Personally, I don't particularly like the idea of queueing up the
 DELETEREPLICAS for later execution, seems like it's overly complex.
 Having the clusterstate info removed if the node is down seems very
 useful though.

 Thanks,
 Erick




 --
 Regards,
 Shalin Shekhar Mangar.


DELETEREPLICA

2014-09-03 Thread Erick Erickson
I'm confused, wondering if it's a mismatch between the docs and the
intent or just a bug or whether I'm just not understanding the point:

The DELETEREPLICA docs say:

Delete a replica from a given collection and shard. If the
corresponding core is up and running the core is unloaded and the
entry is removed from the clusterstate. If the node/core is down, the
entry is taken off the clusterstate and if the core comes up later it
is automatically unregistered.

However, if I do the following:
1 create a follower on nodeX
2 shut down nodeX (at this point, the clusterstate has indicates the
follower is down)
3 issue a DELETEREPLICA for the follower (clusterstate entry for this
follower is removed)
4 restart nodeX (clusterstate shows this node is back, it's visible
in cloud veiw, gets sync'd etc.).

Based on the docs, I didn't expect to see the node present in step 4,
what am I missing?

The core has docs (i.e. it's synched from the leader) etc. So this bit
of the documentation is confusing me: If the node/core is down, the
entry is taken off the clusterstate and if the core comes up later it
is automatically unregistered.

That doesn't square with what I'm seeing so either the docs are wrong
or I'm misunderstanding the intent.

If the node _is_ up, then it's removed from the node and clusterstate
and stays gone.

Personally, I don't particularly like the idea of queueing up the
DELETEREPLICAS for later execution, seems like it's overly complex.
Having the clusterstate info removed if the node is down seems very
useful though.

Thanks,
Erick