Re: Replicates not recovering after rolling restart

2017-09-22 Thread Erick Erickson
Gah! Don't you hate it when you spend days on something like this?

Slight clarification. _version_ is used for optimistic locking, not
replication. Let's say you have two clients updating the same document
and sending it to Solr at the same time. The _version_ field is filled
out automagically and one of the updates will be rejected. Otherwise
there'd be no good way to fail a document due to this kind of thing.

Thanks for letting us know what the problem really was.

Best,
Erick

On Fri, Sep 22, 2017 at 2:57 PM, Bill Oconnor <bocon...@plos.org> wrote:
>
> Thanks everyone for the responses.
>
>
> I believe I have found the problem.
>
>
> The type of __version__ is incorrect in our schema. This is a required field 
> that is primarily used by Solr.
>
>
> Our schema has typed it as type=int instead of  type=long
>
>
> I believe that this number is used by the replication process to figure out 
> what needs to be sync'd on an
>
> individual replicate. In our case Solr puts the value in during indexing. It 
> appears that Solr has chosen a
>
> number that cannot be represented by "int". As the replicates query the 
> leader to determine if a sync is
>
> necessary the the leader throws an error as it try's to format the response 
> with the large _version_ .
>
> This process continues until the replicates give up.
>
>
> I finally verified this by doing a simple query _version_:*which throws 
> the same error but gives
>
> more helpful info "re-index your documents"
>
>
> Thanks.
>
>
>
>
>
> ____
> From: Rick Leir <rl...@leirtech.com>
> Sent: Friday, September 22, 2017 12:34:57 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Replicates not recovering after rolling restart
>
> Wunder, Erick
>
> $ dc
> 16o
> 1578578283947098112p
> 15E83C95E8D0
>
> That is an interesting number. Is it, as a guess, machine instructions
> or an address pointer? It does not look like UTF-8 or ASCII. Machine
> code looks promising:
>
>
> Disassembly:
>
> 0:  15 e8 3c 95 e8  adceax,0xe8953ce8
> 5:  d0 00   rolBYTE PTR [rax],1
> 
>
> /ADC/dest,src Modifies flags: AF CF OF SF PF ZF Sums two binary operands
> placing the result in the destination.
>
> *ROL - Rotate Left*
>
> Registers: the/64-bit/extension of/eax/is called/rax/.
>
> Is that code possibly in the JVM executable? Or a random memory page.
>
> cheers -- Rick
>
> On 2017-09-20 07:21 PM, Walter Underwood wrote:
>> 1578578283947098112 needs 61 bits. Is it being parsed into a 32 bit target?
>>
>> That doesn’t explain where it came from, of course.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Sep 20, 2017, at 3:35 PM, Erick Erickson <erickerick...@gmail.com> wrote:
>>>
>>> The numberformatexception is...odd. Clearly that's too big a number
>>> for an integer, did anything in the underlying schema change?
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Sep 20, 2017 at 3:00 PM, Walter Underwood <wun...@wunderwood.org> 
>>> wrote:
>>>> Rolling restarts work fine for us. I often include installing new configs 
>>>> with that. Here is our script. Pass it any hostname in the cluster. I use 
>>>> the load balancer name. You’ll need to change the domain and the install 
>>>> directory of course.
>>>>
>>>> #!/bin/bash
>>>>
>>>> cluster=$1
>>>>
>>>> hosts=`curl -s 
>>>> "http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS=json;
>>>>  | jq -r '.cluster.live_nodes[]' | sort`
>>>>
>>>> for host in $hosts
>>>> do
>>>> host="${host}.cloud.cheggnet.com"
>>>> echo restarting Solr on $host
>>>> ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin 
>>>> bin/solr start -cloud -h `hostname`'
>>>> done
>>>>
>>>>
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>>
>>>>
>>>>> On Sep 20, 2017, at 1:42 PM, Bill Oconnor <bocon...@plos.org> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>>
>>>>> Background:
>>>>>
>>>>>
>>>>> We have been successfully using Solr for over 5 years and we recently 
>>&

Re: Replicates not recovering after rolling restart

2017-09-22 Thread Bill Oconnor

Thanks everyone for the responses.


I believe I have found the problem.


The type of __version__ is incorrect in our schema. This is a required field 
that is primarily used by Solr.


Our schema has typed it as type=int instead of  type=long


I believe that this number is used by the replication process to figure out 
what needs to be sync'd on an

individual replicate. In our case Solr puts the value in during indexing. It 
appears that Solr has chosen a

number that cannot be represented by "int". As the replicates query the leader 
to determine if a sync is

necessary the the leader throws an error as it try's to format the response 
with the large _version_ .

This process continues until the replicates give up.


I finally verified this by doing a simple query _version_:*which throws the 
same error but gives

more helpful info "re-index your documents"


Thanks.






From: Rick Leir <rl...@leirtech.com>
Sent: Friday, September 22, 2017 12:34:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Replicates not recovering after rolling restart

Wunder, Erick

$ dc
16o
1578578283947098112p
15E83C95E8D0

That is an interesting number. Is it, as a guess, machine instructions
or an address pointer? It does not look like UTF-8 or ASCII. Machine
code looks promising:


Disassembly:

0:  15 e8 3c 95 e8  adceax,0xe8953ce8
5:  d0 00   rolBYTE PTR [rax],1


/ADC/dest,src Modifies flags: AF CF OF SF PF ZF Sums two binary operands
placing the result in the destination.

*ROL - Rotate Left*

Registers: the/64-bit/extension of/eax/is called/rax/.

Is that code possibly in the JVM executable? Or a random memory page.

cheers -- Rick

On 2017-09-20 07:21 PM, Walter Underwood wrote:
> 1578578283947098112 needs 61 bits. Is it being parsed into a 32 bit target?
>
> That doesn’t explain where it came from, of course.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 20, 2017, at 3:35 PM, Erick Erickson <erickerick...@gmail.com> wrote:
>>
>> The numberformatexception is...odd. Clearly that's too big a number
>> for an integer, did anything in the underlying schema change?
>>
>> Best,
>> Erick
>>
>> On Wed, Sep 20, 2017 at 3:00 PM, Walter Underwood <wun...@wunderwood.org> 
>> wrote:
>>> Rolling restarts work fine for us. I often include installing new configs 
>>> with that. Here is our script. Pass it any hostname in the cluster. I use 
>>> the load balancer name. You’ll need to change the domain and the install 
>>> directory of course.
>>>
>>> #!/bin/bash
>>>
>>> cluster=$1
>>>
>>> hosts=`curl -s 
>>> "http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS=json;
>>>  | jq -r '.cluster.live_nodes[]' | sort`
>>>
>>> for host in $hosts
>>> do
>>> host="${host}.cloud.cheggnet.com"
>>> echo restarting Solr on $host
>>> ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin 
>>> bin/solr start -cloud -h `hostname`'
>>> done
>>>
>>>
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>
>>>> On Sep 20, 2017, at 1:42 PM, Bill Oconnor <bocon...@plos.org> wrote:
>>>>
>>>> Hello,
>>>>
>>>>
>>>> Background:
>>>>
>>>>
>>>> We have been successfully using Solr for over 5 years and we recently made 
>>>> the decision to move into SolrCloud. For the most part that has been easy 
>>>> but we have repeated problems with our rolling restart were server remain 
>>>> functional but stay in Recovery until they stop trying. We restarted 
>>>> because we increased the memory from 12GB to 16GB on the JVM.
>>>>
>>>>
>>>> Does anyone have any insight as to what is going on here?
>>>>
>>>> Is there a special procedure I should use for starting a stopping host?
>>>>
>>>> Is it ok to do a rolling restart on all the nodes in s shard?
>>>>
>>>>
>>>> Any insight would be appreciated.
>>>>
>>>>
>>>> Configuration:
>>>>
>>>>
>>>> We have a group of servers with multiple collections. Each collection 
>>>> consist of one shard and multiple replicates. We are running the latest 
>>>> stable version of SolrClound 6.6 on Ubuntu LTS and Oracle Corporation Java 
>>>> HotSpot(TM

Re: Replicates not recovering after rolling restart

2017-09-22 Thread Rick Leir

Wunder, Erick

$ dc
16o
1578578283947098112p
15E83C95E8D0

That is an interesting number. Is it, as a guess, machine instructions 
or an address pointer? It does not look like UTF-8 or ASCII. Machine 
code looks promising:



Disassembly:

0:  15 e8 3c 95 e8  adceax,0xe8953ce8
5:  d0 00   rolBYTE PTR [rax],1


/ADC/dest,src Modifies flags: AF CF OF SF PF ZF Sums two binary operands 
placing the result in the destination.


*ROL - Rotate Left*

Registers: the/64-bit/extension of/eax/is called/rax/.

Is that code possibly in the JVM executable? Or a random memory page.

cheers -- Rick

On 2017-09-20 07:21 PM, Walter Underwood wrote:

1578578283947098112 needs 61 bits. Is it being parsed into a 32 bit target?

That doesn’t explain where it came from, of course.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Sep 20, 2017, at 3:35 PM, Erick Erickson  wrote:

The numberformatexception is...odd. Clearly that's too big a number
for an integer, did anything in the underlying schema change?

Best,
Erick

On Wed, Sep 20, 2017 at 3:00 PM, Walter Underwood  wrote:

Rolling restarts work fine for us. I often include installing new configs with 
that. Here is our script. Pass it any hostname in the cluster. I use the load 
balancer name. You’ll need to change the domain and the install directory of 
course.

#!/bin/bash

cluster=$1

hosts=`curl -s 
"http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS=json; | 
jq -r '.cluster.live_nodes[]' | sort`

for host in $hosts
do
host="${host}.cloud.cheggnet.com"
echo restarting Solr on $host
ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin bin/solr 
start -cloud -h `hostname`'
done


Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Sep 20, 2017, at 1:42 PM, Bill Oconnor  wrote:

Hello,


Background:


We have been successfully using Solr for over 5 years and we recently made the 
decision to move into SolrCloud. For the most part that has been easy but we 
have repeated problems with our rolling restart were server remain functional 
but stay in Recovery until they stop trying. We restarted because we increased 
the memory from 12GB to 16GB on the JVM.


Does anyone have any insight as to what is going on here?

Is there a special procedure I should use for starting a stopping host?

Is it ok to do a rolling restart on all the nodes in s shard?


Any insight would be appreciated.


Configuration:


We have a group of servers with multiple collections. Each collection consist 
of one shard and multiple replicates. We are running the latest stable version 
of SolrClound 6.6 on Ubuntu LTS and Oracle Corporation Java HotSpot(TM) 64-Bit 
Server VM 1.8.0_66 25.66-b17


(collection)  (shard)  (replicates)

journals_stage   ->  shard1  ->  solr-220 (leader) , solr-223, solr-221, 
solr-222 (replicates)


Problem:


Restarting the system puts the replicates in a recovery state they never exit 
from. They eventually give up after 500 tries.  If I go to the individual 
replicates and execute a query the data is still available.


Using tcpdump I find the replicates sending this request to the leader (the 
leader appears to be active).


The exchange goes  like this - :


solr-220 is the leader.

Solr-221 to Solr-220


10:18:42.426823 IP solr-221:54341 > solr-220:8983:


POST /solr/journals_stage_shard1_replica1/update HTTP/1.1
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
User-Agent: 
Solr[org.apache.solr.client.solrj.impl.HttpSolrClient]
 1.0
Content-Length: 108
Host: solr-220:8983
Connection: Keep-Alive


commit_end_point=true=false=true=false=true=javabin=2


Solr-220 back to Solr-221


IP solr-220:8983 > solr-221:54341: Flags [P.], seq 1:5152, ack 385, win 235, 
options [nop,nop,
TS val 85813 ecr 858107069], length 5151
..HTTP/1.1 500 Server Error
Content-Type: application/octet-stream
Content-Length: 5060


.responseHeader..%QTimeC.%error..#msg?.For input string: 
"1578578283947098112".%trace?.: For
input string: "1578578283947098112"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:583)
at java.lang.Integer.parseInt(Integer.java:615)
at 
org.apache.lucene.queries.function.docvalues.IntDocValues.getRangeScorer(IntDocValues.java:89)
at 
org.apache.solr.search.function.ValueSourceRangeFilter$1.iterator(ValueSourceRangeFilter.java:83)
at 
org.apache.solr.search.SolrConstantScoreQuery$ConstantWeight.scorer(SolrConstantScoreQuery.java:100)
at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:126)
at org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:400)
at 

Re: Replicates not recovering after rolling restart

2017-09-21 Thread Bill Oconnor

  1.  We are moving from 4.X to 6.6.
  2.  Changed the schema - adding the version etc nothing major.
  3.  Full re-index of documents into the cluster - so this is not a migration.
  4.  Changed the the JVM parameter from 12GB to 16GB and did a restart.
  5.  Replicates go into recovery which fails to complete after many hours. 
They still respond to queries but the /update POST from the replicates fails 
with the 500 server error and a stack trace because of the number format 
failure.


My other cluster  does not reuse any nodes. The restart went as expected with 
the JVM change. Al


From: Erick Erickson <erickerick...@gmail.com>
Sent: Thursday, September 21, 2017 8:25:32 AM
To: solr-user
Subject: Re: Replicates not recovering after rolling restart

Hmmm, I didn't ask what version you're upgrading _from_. 5 years ago
would be Solr 4. Are you replacing Solr 5 or 4? I'm guessing 5, but
want to check unlikely possibilities.

Next question: I'm assuming all your nodes have been upgraded to Solr 6, right?

Best,
Erick

On Wed, Sep 20, 2017 at 7:18 PM, Bill Oconnor <bocon...@plos.org> wrote:
> I have no clue where that number comes from it does not seem to be in the 
> actual post to the leader as seen in my tcpdump.   It is mystery.
>
> 
> From: Walter Underwood <wun...@wunderwood.org>
> Sent: Wednesday, September 20, 2017 7:00:53 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Replicates not recovering after rolling restart
>
>
>> On Sep 20, 2017, at 6:15 PM, Bill Oconnor <bocon...@plos.org> wrote:
>>
>> I restart using the standard "sudo service solr start/stop"
>
> You might look into what that actually does.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
[https://wunderwood.files.wordpress.com/2017/02/diva.png?w=32]<http://observer.wunderwood.org/>

Most Casual Observer<http://observer.wunderwood.org/>
observer.wunderwood.org



>


Re: Replicates not recovering after rolling restart

2017-09-21 Thread Erick Erickson
Hmmm, I didn't ask what version you're upgrading _from_. 5 years ago
would be Solr 4. Are you replacing Solr 5 or 4? I'm guessing 5, but
want to check unlikely possibilities.

Next question: I'm assuming all your nodes have been upgraded to Solr 6, right?

Best,
Erick

On Wed, Sep 20, 2017 at 7:18 PM, Bill Oconnor <bocon...@plos.org> wrote:
> I have no clue where that number comes from it does not seem to be in the 
> actual post to the leader as seen in my tcpdump.   It is mystery.
>
> 
> From: Walter Underwood <wun...@wunderwood.org>
> Sent: Wednesday, September 20, 2017 7:00:53 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Replicates not recovering after rolling restart
>
>
>> On Sep 20, 2017, at 6:15 PM, Bill Oconnor <bocon...@plos.org> wrote:
>>
>> I restart using the standard "sudo service solr start/stop"
>
> You might look into what that actually does.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>


Re: Replicates not recovering after rolling restart

2017-09-20 Thread Bill Oconnor
I have no clue where that number comes from it does not seem to be in the 
actual post to the leader as seen in my tcpdump.   It is mystery.


From: Walter Underwood <wun...@wunderwood.org>
Sent: Wednesday, September 20, 2017 7:00:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Replicates not recovering after rolling restart


> On Sep 20, 2017, at 6:15 PM, Bill Oconnor <bocon...@plos.org> wrote:
>
> I restart using the standard "sudo service solr start/stop"

You might look into what that actually does.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Replicates not recovering after rolling restart

2017-09-20 Thread Walter Underwood

> On Sep 20, 2017, at 6:15 PM, Bill Oconnor  wrote:
> 
> I restart using the standard "sudo service solr start/stop"

You might look into what that actually does.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Replicates not recovering after rolling restart

2017-09-20 Thread Bill Oconnor
Thanks everyone for the response.


I do not think we changed anything other than the JVM memory size.


I did leave out one piece of info - one of the host is a replicate in another 
shard.


collection1 -> shard1 -> *h1, h2, h3, h4where star is leader

collection2 -> shard1 -> *h5, h3


When I restart *h1 works fine h2,h3,h4 go into recovery but still respond to 
request. *h1 starts getting

the post from the recovering servers and responds with the 500 Server Error 
until the servers quit.


Collection2 with h3 is active and fine even though it is recovering in 
collection1.


This happened before and I resolved it by deleting and then creating a new 
collection.


I restart using the standard "sudo service solr start/stop"


I have to say I am not comfortable with have multiple shards being shared on 
the same host. The Productions servers will not be configured this way but 
these servers are for development.


From: Erick Erickson <erickerick...@gmail.com>
Sent: Wednesday, September 20, 2017 3:35:16 PM
To: solr-user
Subject: Re: Replicates not recovering after rolling restart

The numberformatexception is...odd. Clearly that's too big a number
for an integer, did anything in the underlying schema change?

Best,
Erick

On Wed, Sep 20, 2017 at 3:00 PM, Walter Underwood <wun...@wunderwood.org> wrote:
> Rolling restarts work fine for us. I often include installing new configs 
> with that. Here is our script. Pass it any hostname in the cluster. I use the 
> load balancer name. You’ll need to change the domain and the install 
> directory of course.
>
> #!/bin/bash
>
> cluster=$1
>
> hosts=`curl -s 
> "http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS=json; 
> | jq -r '.cluster.live_nodes[]' | sort`
>
> for host in $hosts
> do
> host="${host}.cloud.cheggnet.com"
> echo restarting Solr on $host
> ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin 
> bin/solr start -cloud -h `hostname`'
> done
>
>
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 20, 2017, at 1:42 PM, Bill Oconnor <bocon...@plos.org> wrote:
>>
>> Hello,
>>
>>
>> Background:
>>
>>
>> We have been successfully using Solr for over 5 years and we recently made 
>> the decision to move into SolrCloud. For the most part that has been easy 
>> but we have repeated problems with our rolling restart were server remain 
>> functional but stay in Recovery until they stop trying. We restarted because 
>> we increased the memory from 12GB to 16GB on the JVM.
>>
>>
>> Does anyone have any insight as to what is going on here?
>>
>> Is there a special procedure I should use for starting a stopping host?
>>
>> Is it ok to do a rolling restart on all the nodes in s shard?
>>
>>
>> Any insight would be appreciated.
>>
>>
>> Configuration:
>>
>>
>> We have a group of servers with multiple collections. Each collection 
>> consist of one shard and multiple replicates. We are running the latest 
>> stable version of SolrClound 6.6 on Ubuntu LTS and Oracle Corporation Java 
>> HotSpot(TM) 64-Bit Server VM 1.8.0_66 25.66-b17
>>
>>
>> (collection)  (shard)  (replicates)
>>
>> journals_stage   ->  shard1  ->  solr-220 (leader) , solr-223, solr-221, 
>> solr-222 (replicates)
>>
>>
>> Problem:
>>
>>
>> Restarting the system puts the replicates in a recovery state they never 
>> exit from. They eventually give up after 500 tries.  If I go to the 
>> individual replicates and execute a query the data is still available.
>>
>>
>> Using tcpdump I find the replicates sending this request to the leader (the 
>> leader appears to be active).
>>
>>
>> The exchange goes  like this - :
>>
>>
>> solr-220 is the leader.
>>
>> Solr-221 to Solr-220
>>
>>
>> 10:18:42.426823 IP solr-221:54341 > solr-220:8983:
>>
>>
>> POST /solr/journals_stage_shard1_replica1/update HTTP/1.1
>> Content-Type: application/x-www-form-urlencoded; charset=UTF-8
>> User-Agent: 
>> Solr[org.apache.solr<http://org.apache.solr/>.client.solrj.impl<http://client.solrj.impl/>.HttpSolrClient]
>>  1.0
>> Content-Length: 108
>> Host: solr-220:8983
>> Connection: Keep-Alive
>>
>>
>> commit_end_point=true=false=true=false=true=javabin=2
>>
>>
>> Solr-220 back to Solr-221
>>
>>
>> IP solr-220:8983 > solr-221:54341: Flags [

Re: Replicates not recovering after rolling restart

2017-09-20 Thread Walter Underwood
1578578283947098112 needs 61 bits. Is it being parsed into a 32 bit target? 

That doesn’t explain where it came from, of course.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 20, 2017, at 3:35 PM, Erick Erickson  wrote:
> 
> The numberformatexception is...odd. Clearly that's too big a number
> for an integer, did anything in the underlying schema change?
> 
> Best,
> Erick
> 
> On Wed, Sep 20, 2017 at 3:00 PM, Walter Underwood  
> wrote:
>> Rolling restarts work fine for us. I often include installing new configs 
>> with that. Here is our script. Pass it any hostname in the cluster. I use 
>> the load balancer name. You’ll need to change the domain and the install 
>> directory of course.
>> 
>> #!/bin/bash
>> 
>> cluster=$1
>> 
>> hosts=`curl -s 
>> "http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS=json; 
>> | jq -r '.cluster.live_nodes[]' | sort`
>> 
>> for host in $hosts
>> do
>>host="${host}.cloud.cheggnet.com"
>>echo restarting Solr on $host
>>ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin 
>> bin/solr start -cloud -h `hostname`'
>> done
>> 
>> 
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Sep 20, 2017, at 1:42 PM, Bill Oconnor  wrote:
>>> 
>>> Hello,
>>> 
>>> 
>>> Background:
>>> 
>>> 
>>> We have been successfully using Solr for over 5 years and we recently made 
>>> the decision to move into SolrCloud. For the most part that has been easy 
>>> but we have repeated problems with our rolling restart were server remain 
>>> functional but stay in Recovery until they stop trying. We restarted 
>>> because we increased the memory from 12GB to 16GB on the JVM.
>>> 
>>> 
>>> Does anyone have any insight as to what is going on here?
>>> 
>>> Is there a special procedure I should use for starting a stopping host?
>>> 
>>> Is it ok to do a rolling restart on all the nodes in s shard?
>>> 
>>> 
>>> Any insight would be appreciated.
>>> 
>>> 
>>> Configuration:
>>> 
>>> 
>>> We have a group of servers with multiple collections. Each collection 
>>> consist of one shard and multiple replicates. We are running the latest 
>>> stable version of SolrClound 6.6 on Ubuntu LTS and Oracle Corporation Java 
>>> HotSpot(TM) 64-Bit Server VM 1.8.0_66 25.66-b17
>>> 
>>> 
>>> (collection)  (shard)  (replicates)
>>> 
>>> journals_stage   ->  shard1  ->  solr-220 (leader) , solr-223, solr-221, 
>>> solr-222 (replicates)
>>> 
>>> 
>>> Problem:
>>> 
>>> 
>>> Restarting the system puts the replicates in a recovery state they never 
>>> exit from. They eventually give up after 500 tries.  If I go to the 
>>> individual replicates and execute a query the data is still available.
>>> 
>>> 
>>> Using tcpdump I find the replicates sending this request to the leader (the 
>>> leader appears to be active).
>>> 
>>> 
>>> The exchange goes  like this - :
>>> 
>>> 
>>> solr-220 is the leader.
>>> 
>>> Solr-221 to Solr-220
>>> 
>>> 
>>> 10:18:42.426823 IP solr-221:54341 > solr-220:8983:
>>> 
>>> 
>>> POST /solr/journals_stage_shard1_replica1/update HTTP/1.1
>>> Content-Type: application/x-www-form-urlencoded; charset=UTF-8
>>> User-Agent: 
>>> Solr[org.apache.solr.client.solrj.impl.HttpSolrClient]
>>>  1.0
>>> Content-Length: 108
>>> Host: solr-220:8983
>>> Connection: Keep-Alive
>>> 
>>> 
>>> commit_end_point=true=false=true=false=true=javabin=2
>>> 
>>> 
>>> Solr-220 back to Solr-221
>>> 
>>> 
>>> IP solr-220:8983 > solr-221:54341: Flags [P.], seq 1:5152, ack 385, win 
>>> 235, options [nop,nop,
>>> TS val 85813 ecr 858107069], length 5151
>>> ..HTTP/1.1 500 Server Error
>>> Content-Type: application/octet-stream
>>> Content-Length: 5060
>>> 
>>> 
>>> .responseHeader..%QTimeC.%error..#msg?.For input string: 
>>> "1578578283947098112".%trace?.: For
>>> input string: "1578578283947098112"
>>> at 
>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>> at java.lang.Integer.parseInt(Integer.java:583)
>>> at java.lang.Integer.parseInt(Integer.java:615)
>>> at 
>>> org.apache.lucene.queries.function.docvalues.IntDocValues.getRangeScorer(IntDocValues.java:89)
>>> at 
>>> org.apache.solr.search.function.ValueSourceRangeFilter$1.iterator(ValueSourceRangeFilter.java:83)
>>> at 
>>> org.apache.solr.search.SolrConstantScoreQuery$ConstantWeight.scorer(SolrConstantScoreQuery.java:100)
>>> at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:126)
>>> at 
>>> org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:400)
>>> at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:381)
>>> at 
>>> org.apache.solr.update.DeleteByQueryWrapper$1.scorer(DeleteByQueryWrapper.java:90)
>>> at 
>>> 

Re: Replicates not recovering after rolling restart

2017-09-20 Thread Erick Erickson
The numberformatexception is...odd. Clearly that's too big a number
for an integer, did anything in the underlying schema change?

Best,
Erick

On Wed, Sep 20, 2017 at 3:00 PM, Walter Underwood  wrote:
> Rolling restarts work fine for us. I often include installing new configs 
> with that. Here is our script. Pass it any hostname in the cluster. I use the 
> load balancer name. You’ll need to change the domain and the install 
> directory of course.
>
> #!/bin/bash
>
> cluster=$1
>
> hosts=`curl -s 
> "http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS=json; 
> | jq -r '.cluster.live_nodes[]' | sort`
>
> for host in $hosts
> do
> host="${host}.cloud.cheggnet.com"
> echo restarting Solr on $host
> ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin 
> bin/solr start -cloud -h `hostname`'
> done
>
>
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 20, 2017, at 1:42 PM, Bill Oconnor  wrote:
>>
>> Hello,
>>
>>
>> Background:
>>
>>
>> We have been successfully using Solr for over 5 years and we recently made 
>> the decision to move into SolrCloud. For the most part that has been easy 
>> but we have repeated problems with our rolling restart were server remain 
>> functional but stay in Recovery until they stop trying. We restarted because 
>> we increased the memory from 12GB to 16GB on the JVM.
>>
>>
>> Does anyone have any insight as to what is going on here?
>>
>> Is there a special procedure I should use for starting a stopping host?
>>
>> Is it ok to do a rolling restart on all the nodes in s shard?
>>
>>
>> Any insight would be appreciated.
>>
>>
>> Configuration:
>>
>>
>> We have a group of servers with multiple collections. Each collection 
>> consist of one shard and multiple replicates. We are running the latest 
>> stable version of SolrClound 6.6 on Ubuntu LTS and Oracle Corporation Java 
>> HotSpot(TM) 64-Bit Server VM 1.8.0_66 25.66-b17
>>
>>
>> (collection)  (shard)  (replicates)
>>
>> journals_stage   ->  shard1  ->  solr-220 (leader) , solr-223, solr-221, 
>> solr-222 (replicates)
>>
>>
>> Problem:
>>
>>
>> Restarting the system puts the replicates in a recovery state they never 
>> exit from. They eventually give up after 500 tries.  If I go to the 
>> individual replicates and execute a query the data is still available.
>>
>>
>> Using tcpdump I find the replicates sending this request to the leader (the 
>> leader appears to be active).
>>
>>
>> The exchange goes  like this - :
>>
>>
>> solr-220 is the leader.
>>
>> Solr-221 to Solr-220
>>
>>
>> 10:18:42.426823 IP solr-221:54341 > solr-220:8983:
>>
>>
>> POST /solr/journals_stage_shard1_replica1/update HTTP/1.1
>> Content-Type: application/x-www-form-urlencoded; charset=UTF-8
>> User-Agent: 
>> Solr[org.apache.solr.client.solrj.impl.HttpSolrClient]
>>  1.0
>> Content-Length: 108
>> Host: solr-220:8983
>> Connection: Keep-Alive
>>
>>
>> commit_end_point=true=false=true=false=true=javabin=2
>>
>>
>> Solr-220 back to Solr-221
>>
>>
>> IP solr-220:8983 > solr-221:54341: Flags [P.], seq 1:5152, ack 385, win 235, 
>> options [nop,nop,
>> TS val 85813 ecr 858107069], length 5151
>> ..HTTP/1.1 500 Server Error
>> Content-Type: application/octet-stream
>> Content-Length: 5060
>>
>>
>> .responseHeader..%QTimeC.%error..#msg?.For input string: 
>> "1578578283947098112".%trace?.: For
>> input string: "1578578283947098112"
>> at 
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>> at java.lang.Integer.parseInt(Integer.java:583)
>> at java.lang.Integer.parseInt(Integer.java:615)
>> at 
>> org.apache.lucene.queries.function.docvalues.IntDocValues.getRangeScorer(IntDocValues.java:89)
>> at 
>> org.apache.solr.search.function.ValueSourceRangeFilter$1.iterator(ValueSourceRangeFilter.java:83)
>> at 
>> org.apache.solr.search.SolrConstantScoreQuery$ConstantWeight.scorer(SolrConstantScoreQuery.java:100)
>> at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:126)
>> at 
>> org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:400)
>> at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:381)
>> at 
>> org.apache.solr.update.DeleteByQueryWrapper$1.scorer(DeleteByQueryWrapper.java:90)
>> at 
>> org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:709)
>>
>> at 
>> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:267)
>>
>>
>


Re: Replicates not recovering after rolling restart

2017-09-20 Thread Walter Underwood
Rolling restarts work fine for us. I often include installing new configs with 
that. Here is our script. Pass it any hostname in the cluster. I use the load 
balancer name. You’ll need to change the domain and the install directory of 
course.

#!/bin/bash

cluster=$1

hosts=`curl -s 
"http://${cluster}:8983/solr/admin/collections?action=CLUSTERSTATUS=json; | 
jq -r '.cluster.live_nodes[]' | sort`

for host in $hosts
do
host="${host}.cloud.cheggnet.com"
echo restarting Solr on $host
ssh $host 'cd /apps/solr6 ; sudo -u bin bin/solr stop; sudo -u bin bin/solr 
start -cloud -h `hostname`'
done


Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 20, 2017, at 1:42 PM, Bill Oconnor  wrote:
> 
> Hello,
> 
> 
> Background:
> 
> 
> We have been successfully using Solr for over 5 years and we recently made 
> the decision to move into SolrCloud. For the most part that has been easy but 
> we have repeated problems with our rolling restart were server remain 
> functional but stay in Recovery until they stop trying. We restarted because 
> we increased the memory from 12GB to 16GB on the JVM.
> 
> 
> Does anyone have any insight as to what is going on here?
> 
> Is there a special procedure I should use for starting a stopping host?
> 
> Is it ok to do a rolling restart on all the nodes in s shard?
> 
> 
> Any insight would be appreciated.
> 
> 
> Configuration:
> 
> 
> We have a group of servers with multiple collections. Each collection consist 
> of one shard and multiple replicates. We are running the latest stable 
> version of SolrClound 6.6 on Ubuntu LTS and Oracle Corporation Java 
> HotSpot(TM) 64-Bit Server VM 1.8.0_66 25.66-b17
> 
> 
> (collection)  (shard)  (replicates)
> 
> journals_stage   ->  shard1  ->  solr-220 (leader) , solr-223, solr-221, 
> solr-222 (replicates)
> 
> 
> Problem:
> 
> 
> Restarting the system puts the replicates in a recovery state they never exit 
> from. They eventually give up after 500 tries.  If I go to the individual 
> replicates and execute a query the data is still available.
> 
> 
> Using tcpdump I find the replicates sending this request to the leader (the 
> leader appears to be active).
> 
> 
> The exchange goes  like this - :
> 
> 
> solr-220 is the leader.
> 
> Solr-221 to Solr-220
> 
> 
> 10:18:42.426823 IP solr-221:54341 > solr-220:8983:
> 
> 
> POST /solr/journals_stage_shard1_replica1/update HTTP/1.1
> Content-Type: application/x-www-form-urlencoded; charset=UTF-8
> User-Agent: 
> Solr[org.apache.solr.client.solrj.impl.HttpSolrClient]
>  1.0
> Content-Length: 108
> Host: solr-220:8983
> Connection: Keep-Alive
> 
> 
> commit_end_point=true=false=true=false=true=javabin=2
> 
> 
> Solr-220 back to Solr-221
> 
> 
> IP solr-220:8983 > solr-221:54341: Flags [P.], seq 1:5152, ack 385, win 235, 
> options [nop,nop,
> TS val 85813 ecr 858107069], length 5151
> ..HTTP/1.1 500 Server Error
> Content-Type: application/octet-stream
> Content-Length: 5060
> 
> 
> .responseHeader..%QTimeC.%error..#msg?.For input string: 
> "1578578283947098112".%trace?.: For
> input string: "1578578283947098112"
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Integer.parseInt(Integer.java:583)
> at java.lang.Integer.parseInt(Integer.java:615)
> at 
> org.apache.lucene.queries.function.docvalues.IntDocValues.getRangeScorer(IntDocValues.java:89)
> at 
> org.apache.solr.search.function.ValueSourceRangeFilter$1.iterator(ValueSourceRangeFilter.java:83)
> at 
> org.apache.solr.search.SolrConstantScoreQuery$ConstantWeight.scorer(SolrConstantScoreQuery.java:100)
> at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:126)
> at 
> org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:400)
> at 
> org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:381)
> at 
> org.apache.solr.update.DeleteByQueryWrapper$1.scorer(DeleteByQueryWrapper.java:90)
> at 
> org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:709)
> 
> at 
> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:267)
> 
> 



Replicates not recovering after rolling restart

2017-09-20 Thread Bill Oconnor
Hello,


Background:


We have been successfully using Solr for over 5 years and we recently made the 
decision to move into SolrCloud. For the most part that has been easy but we 
have repeated problems with our rolling restart were server remain functional 
but stay in Recovery until they stop trying. We restarted because we increased 
the memory from 12GB to 16GB on the JVM.


Does anyone have any insight as to what is going on here?

Is there a special procedure I should use for starting a stopping host?

Is it ok to do a rolling restart on all the nodes in s shard?


Any insight would be appreciated.


Configuration:


We have a group of servers with multiple collections. Each collection consist 
of one shard and multiple replicates. We are running the latest stable version 
of SolrClound 6.6 on Ubuntu LTS and Oracle Corporation Java HotSpot(TM) 64-Bit 
Server VM 1.8.0_66 25.66-b17


(collection)  (shard)  (replicates)

journals_stage   ->  shard1  ->  solr-220 (leader) , solr-223, solr-221, 
solr-222 (replicates)


Problem:


Restarting the system puts the replicates in a recovery state they never exit 
from. They eventually give up after 500 tries.  If I go to the individual 
replicates and execute a query the data is still available.


Using tcpdump I find the replicates sending this request to the leader (the 
leader appears to be active).


The exchange goes  like this - :


solr-220 is the leader.

Solr-221 to Solr-220


10:18:42.426823 IP solr-221:54341 > solr-220:8983:


POST /solr/journals_stage_shard1_replica1/update HTTP/1.1
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
User-Agent: 
Solr[org.apache.solr.client.solrj.impl.HttpSolrClient]
 1.0
Content-Length: 108
Host: solr-220:8983
Connection: Keep-Alive


commit_end_point=true=false=true=false=true=javabin=2


Solr-220 back to Solr-221


IP solr-220:8983 > solr-221:54341: Flags [P.], seq 1:5152, ack 385, win 235, 
options [nop,nop,
TS val 85813 ecr 858107069], length 5151
..HTTP/1.1 500 Server Error
Content-Type: application/octet-stream
Content-Length: 5060


.responseHeader..%QTimeC.%error..#msg?.For input string: 
"1578578283947098112".%trace?.: For
input string: "1578578283947098112"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:583)
at java.lang.Integer.parseInt(Integer.java:615)
at 
org.apache.lucene.queries.function.docvalues.IntDocValues.getRangeScorer(IntDocValues.java:89)
at 
org.apache.solr.search.function.ValueSourceRangeFilter$1.iterator(ValueSourceRangeFilter.java:83)
at 
org.apache.solr.search.SolrConstantScoreQuery$ConstantWeight.scorer(SolrConstantScoreQuery.java:100)
at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:126)
at 
org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:400)
at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:381)
at 
org.apache.solr.update.DeleteByQueryWrapper$1.scorer(DeleteByQueryWrapper.java:90)
at 
org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:709)

at 
org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:267)