Re: PeerSync Recovery fails, starting Replication Recovery

2014-01-09 Thread Anca Kopetz

Hi,

We tried to understand why we get a Connection reset exception on the leader 
when it tries to foward the documents to one of its replica. We analyzed the GC logs and 
we did not see any long GC pauses around the time the exception was thrown.

For 24 hours of gc logs, the max full gc pause = 3 seconds, but it is not in the same 
time as the Connection reset exception.

My question again :

Why does the leader not retry to forward the documents to the replica when it gets an 
IOException in SolrCmdDistributor ? Instead, it sents a recovery request to 
the replica.

Solr version is 4.5.1 (we tried 4.6.0, but we had some problems with this version, 
detailed in the mail : SolrCloud 4.6.0 - leader election issue).

Related to Connection reset exception, is there any other tests that we can 
do in order to the cause of it ?

Thank you,
Anca

On 12/20/2013 11:07 AM, Anca Kopetz wrote:

Hi,

We used to have many Client session timeout messages in solr logs.

INFO org.apache.zookeeper.ClientCnxn:run:1083 - Client session timed out, have 
not heard from server in 18461ms for sessionid 0x242047fc6d77804, closing 
socket connection and attempting reconnect

Then we set the zkClientTimeout to 30 seconds. Therefore there are no more 
messages of this type in the logs.

But we get some other messages :

Leader logs (solr-39):

2013-12-18 09:45:26,052 [http-8083-74] ERROR 
org.apache.solr.update.SolrCmdDistributor:log:119  - shard update error 
StdNode: 
http://solr-40/searchsolrnodees/es_blue/:org.apache.solr.client.solrj.SolrServerException:
 IOException occured when talking to server at: 
http://solr-40/searchsolrnodees/es_blue
...
Caused by: java.net.SocketException: Connection reset

2013-12-18 09:45:26,060 [http-8083-74] INFO  
org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  - Creating new http 
client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false
2013-12-18 09:45:26,140 [http-8083-49] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleWaitForStateAction:819  - 
Going to wait for coreNodeName: solr-40_searchsolrnodees_es_blue, state: 
recovering, checkLive: true, onlyIfLeader: true

Replica logs (solr-40) :

2013-12-18 09:45:26,083 [http-8083-65] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705  
- It has been requested that we recover
2013-12-18 09:45:26,091 [http-8083-65] INFO  
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] webapp=null 
path=/admin/cores 
params={action=REQUESTRECOVERYcore=es_bluewt=javabinversion=2} status=0 
QTime=8
...
2013-12-18 09:45:29,190 [RecoveryThread] WARN  
org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=es_blue 
url=http://dc2-s6-prod-solr-40.prod.dc2.kelkoo.net:8083/searchsolrnodees too 
many updates received since start - startingUpdates no longer overlaps with our 
currentUpdates
2013-12-18 09:45:29,191 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery was 
not successful - trying replication. core=es_blue
2013-12-18 09:45:29,191 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting Replication 
Recovery. core=es_blue

Therefore, if I understand it right, the leader does not manage to foward the documents 
to the replica due to a Connection reset problem, and it asks the replica to 
recover. The replica tries to recover, it fails, and it starts a replication recovery.

Why does the leader not retry to forward the documents to the replica when it 
gets an IOException in SolrCmdDistributor ?

Solr version is 4.5.1

For the Garbage Collector, we use the settings defined here 
http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Thank you,
Anca

On 12/19/2013 04:50 PM, Mark Miller wrote:

Sounds like you need to raise your ZooKeeper connection timeout.

Also, make sure you are using a concurrent garbage collector as a side note - 
stop the world pauses should be avoided. Just good advice :)

- Mark

On Dec 18, 2013, at 5:48 AM, Anca Kopetz 
anca.kop...@kelkoo.commailto:anca.kop...@kelkoo.commailto:anca.kop...@kelkoo.commailto:anca.kop...@kelkoo.com
 wrote:



Hi,

In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to 
time into recovering state, and it takes more than 10 minutes to finish to 
recover.

In logs, we see that PeerSync Recovery fails with the message :

PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates 
received since start - startingUpdates no longer overlaps with our 
currentUpdates

Then Replication Recovery starts.

Is there something we can do to avoid the failure of Peer Recovery so that 
the recovery process is more rapid (less than 10 minutes) ?

The full trace log is here :

2013-12-05 13:51:53,740 [http-8080-46] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705  
- It has been requested that we recover
2013-12-05 13:51:53,740 [http-8080-112] INFO  

Re: PeerSync Recovery fails, starting Replication Recovery

2013-12-20 Thread Anca Kopetz

Hi,

We used to have many Client session timeout messages in solr logs.

INFO org.apache.zookeeper.ClientCnxn:run:1083 - Client session timed out, have 
not heard from server in 18461ms for sessionid 0x242047fc6d77804, closing 
socket connection and attempting reconnect

Then we set the zkClientTimeout to 30 seconds. Therefore there are no more 
messages of this type in the logs.

But we get some other messages :

Leader logs (solr-39):

2013-12-18 09:45:26,052 [http-8083-74] ERROR 
org.apache.solr.update.SolrCmdDistributor:log:119  - shard update error 
StdNode: 
http://solr-40/searchsolrnodees/es_blue/:org.apache.solr.client.solrj.SolrServerException:
 IOException occured when talking to server at: 
http://solr-40/searchsolrnodees/es_blue
...
Caused by: java.net.SocketException: Connection reset

2013-12-18 09:45:26,060 [http-8083-74] INFO  
org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  - Creating new http 
client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false
2013-12-18 09:45:26,140 [http-8083-49] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleWaitForStateAction:819  - 
Going to wait for coreNodeName: solr-40_searchsolrnodees_es_blue, state: 
recovering, checkLive: true, onlyIfLeader: true

Replica logs (solr-40) :

2013-12-18 09:45:26,083 [http-8083-65] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705  
- It has been requested that we recover
2013-12-18 09:45:26,091 [http-8083-65] INFO  
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] webapp=null 
path=/admin/cores 
params={action=REQUESTRECOVERYcore=es_bluewt=javabinversion=2} status=0 
QTime=8
...
2013-12-18 09:45:29,190 [RecoveryThread] WARN  
org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=es_blue 
url=http://dc2-s6-prod-solr-40.prod.dc2.kelkoo.net:8083/searchsolrnodees too 
many updates received since start - startingUpdates no longer overlaps with our 
currentUpdates
2013-12-18 09:45:29,191 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery was 
not successful - trying replication. core=es_blue
2013-12-18 09:45:29,191 [RecoveryThread] INFO  
org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting Replication 
Recovery. core=es_blue

Therefore, if I understand it right, the leader does not manage to foward the documents 
to the replica due to a Connection reset problem, and it asks the replica to 
recover. The replica tries to recover, it fails, and it starts a replication recovery.

Why does the leader not retry to forward the documents to the replica when it 
gets an IOException in SolrCmdDistributor ?

Solr version is 4.5.1

For the Garbage Collector, we use the settings defined here 
http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Thank you,
Anca

On 12/19/2013 04:50 PM, Mark Miller wrote:

Sounds like you need to raise your ZooKeeper connection timeout.

Also, make sure you are using a concurrent garbage collector as a side note - 
stop the world pauses should be avoided. Just good advice :)

- Mark

On Dec 18, 2013, at 5:48 AM, Anca Kopetz 
anca.kop...@kelkoo.commailto:anca.kop...@kelkoo.com wrote:



Hi,

In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to 
time into recovering state, and it takes more than 10 minutes to finish to 
recover.

In logs, we see that PeerSync Recovery fails with the message :

PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates 
received since start - startingUpdates no longer overlaps with our 
currentUpdates

Then Replication Recovery starts.

Is there something we can do to avoid the failure of Peer Recovery so that 
the recovery process is more rapid (less than 10 minutes) ?

The full trace log is here :

2013-12-05 13:51:53,740 [http-8080-46] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705  
- It has been requested that we recover
2013-12-05 13:51:53,740 [http-8080-112] INFO  
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705  
- It has been requested that we recover
2013-12-05 13:51:53,740 [http-8080-112] INFO  
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] webapp=null 
path=/admin/cores 
params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 
QTime=0
2013-12-05 13:51:53,740 [Thread-1544] INFO  
org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
state=recovering
2013-12-05 13:51:53,741 [http-8080-46] INFO  
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] webapp=null 
path=/admin/cores 
params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 
QTime=1
2013-12-05 13:51:53,740 [Thread-1543] INFO  
org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
state=recovering
2013-12-05 13:51:53,743 [Thread-1544] INFO  
org.apache.solr.cloud.ZkController:publish:1021 

Re: PeerSync Recovery fails, starting Replication Recovery

2013-12-20 Thread Anca Kopetz

Hi,

We do not use a NRT solution. We do a hardCommit (autocommit with
openSearcher=false) every 15 minutes, and updates with commitWithin
every 30 minutes. The updates are done one by one by many feeders, we do
not send batches of documents.

Solr version : 4.5.1

Yes, the problem is before the recovery, as I explained in the reply to
Mark's mail.

Best regards,
Anca Kopetz

On 12/19/2013 06:39 PM, Daniel Collins wrote:

Are you using a NRT solution, how often do you commit?  We see similar
issues with PeerSync, but then we have a very active NRT system and we
soft-commit sub-second, so since PeerSync has a limit of 100 versions
before it decides its too much to do, if we try and PeerSync whilst
indexing is running, we inevitably have to fallback to a full-sync as this
does.

What Solr version are you using?  There were issues with early 4.X (up to
4.3 wasn't it, I can't find the ticket now?) whhereby if PeerSync failed,
it did a fullCopy when what it should do is an incremental copy (i.e. only
the segments that are missing) which again might hurt you.

But surely your real problem isn't that the recovery taking a long time,
your problem is why did the system enter recovery in the first place (which
is the bit *just before* the trace you gave us!)
The first line is It has been requested that we recover (aren't Solr
trace message polite), which I know from recent experience means either the
leader thinks this replica is out of date or you just had a leadership
transfer.  As mark says, the root cause of that is probably a ZK timeout
issue



On 19 December 2013 15:50, Mark Millermarkrmil...@gmail.com  wrote:


Sounds like you need to raise your ZooKeeper connection timeout.

Also, make sure you are using a concurrent garbage collector as a side
note - stop the world pauses should be avoided. Just good advice :)

- Mark

On Dec 18, 2013, at 5:48 AM, Anca Kopetzanca.kop...@kelkoo.com  wrote:


Hi,

In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from

time to time into recovering state, and it takes more than 10 minutes to
finish to recover.

In logs, we see that PeerSync Recovery fails with the message :

PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many

updates received since start - startingUpdates no longer overlaps with our
currentUpdates

Then Replication Recovery starts.

Is there something we can do to avoid the failure of Peer Recovery so

that the recovery process is more rapid (less than 10 minutes) ?

The full trace log is here :

2013-12-05 13:51:53,740 [http-8080-46] INFO

  org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
  - It has been requested that we recover

2013-12-05 13:51:53,740 [http-8080-112] INFO

  org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
  - It has been requested that we recover

2013-12-05 13:51:53,740 [http-8080-112] INFO

  org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0
QTime=0

2013-12-05 13:51:53,740 [Thread-1544] INFO

  org.apache.solr.cloud.ZkController:publish:1017  - publishing
core=fr_green state=recovering

2013-12-05 13:51:53,741 [http-8080-46] INFO

  org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0
QTime=1

2013-12-05 13:51:53,740 [Thread-1543] INFO

  org.apache.solr.cloud.ZkController:publish:1017  - publishing
core=fr_green state=recovering

2013-12-05 13:51:53,743 [Thread-1544] INFO

  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,746 [Thread-1543] INFO

  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,755 [Thread-1543] WARN

  org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for
zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green

2013-12-05 13:51:53,756 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
process.  core=fr_green recoveringAfterStartup=false

2013-12-05 13:51:53,762 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
process. core=fr_green

2013-12-05 13:51:53,762 [RecoveryThread] INFO

  org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
process.  core=fr_green recoveringAfterStartup=false

2013-12-05 13:51:53,765 [RecoveryThread] INFO

  org.apache.solr.cloud.ZkController:publish:1017  - publishing
core=fr_green state=recovering

2013-12-05 13:51:53,765 [RecoveryThread] INFO

  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,767 [RecoveryThread] INFO

  

Re: PeerSync Recovery fails, starting Replication Recovery

2013-12-19 Thread Anca Kopetz

Hi,

Thank you for the link, it does not seem to be the same problem ...

Best regards,
Anca

On 12/18/2013 11:41 PM, Furkan KAMACI wrote:

Hi Anca;

Could you check the conversation at here:
http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-td4061831.html

Thanks;
Furkan KAMACI


18 Aralık 2013 Çarşamba tarihinde Anca Kopetzanca.kop...@kelkoo.com  adlı
kullanıcı şöyle yazdı:

Hi,

In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from

time to time into recovering state, and it takes more than 10 minutes to
finish to recover.

In logs, we see that PeerSync Recovery fails with the message :

PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many

updates received since start - startingUpdates no longer overlaps with our
currentUpdates

Then Replication Recovery starts.

Is there something we can do to avoid the failure of Peer Recovery so

that the recovery process is more rapid (less than 10 minutes) ?

The full trace log is here :

2013-12-05 13:51:53,740 [http-8080-46] INFO

org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
- It has been requested that we recover

2013-12-05 13:51:53,740 [http-8080-112] INFO

org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
- It has been requested that we recover

2013-12-05 13:51:53,740 [http-8080-112] INFO

org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0
QTime=0

2013-12-05 13:51:53,740 [Thread-1544] INFO

org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green
state=recovering

2013-12-05 13:51:53,741 [http-8080-46] INFO

org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0
QTime=1

2013-12-05 13:51:53,740 [Thread-1543] INFO

org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green
state=recovering

2013-12-05 13:51:53,743 [Thread-1544] INFO

org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,746 [Thread-1543] INFO

org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,755 [Thread-1543] WARN

org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for
zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green

2013-12-05 13:51:53,756 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
process.  core=fr_green recoveringAfterStartup=false

2013-12-05 13:51:53,762 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
process. core=fr_green

2013-12-05 13:51:53,762 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
process.  core=fr_green recoveringAfterStartup=false

2013-12-05 13:51:53,765 [RecoveryThread] INFO

org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green
state=recovering

2013-12-05 13:51:53,765 [RecoveryThread] INFO

org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property

2013-12-05 13:51:53,767 [RecoveryThread] INFO

org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
Creating new http client,
config:maxConnections=128maxConnectionsPerHost=32followRedirects=false

2013-12-05 13:51:54,777 [main-EventThread] INFO

org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)

2013-12-05 13:51:56,804 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:doRecovery:356  - Attempting to
PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green -
recoveringAfterStartup=false

2013-12-05 13:51:56,806 [RecoveryThread] WARN

org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=fr_green url=
http://solr-08/searchsolrnodefr too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates

2013-12-05 13:51:56,806 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery
was not successful - trying replication. core=fr_green

2013-12-05 13:51:56,806 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting
Replication Recovery. core=fr_green

2013-12-05 13:51:56,806 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:doRecovery:399  - Begin buffering
updates. core=fr_green

2013-12-05 13:51:56,806 [RecoveryThread] INFO

org.apache.solr.cloud.RecoveryStrategy:replicate:127  - Attempting to
replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green

Re: PeerSync Recovery fails, starting Replication Recovery

2013-12-19 Thread Mark Miller
Sounds like you need to raise your ZooKeeper connection timeout. 

Also, make sure you are using a concurrent garbage collector as a side note - 
stop the world pauses should be avoided. Just good advice :)

- Mark

On Dec 18, 2013, at 5:48 AM, Anca Kopetz anca.kop...@kelkoo.com wrote:

 Hi,
 
 In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to 
 time into recovering state, and it takes more than 10 minutes to finish to 
 recover.
 
 In logs, we see that PeerSync Recovery fails with the message :
 
 PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates 
 received since start - startingUpdates no longer overlaps with our 
 currentUpdates
 
 Then Replication Recovery starts. 
 
 Is there something we can do to avoid the failure of Peer Recovery so that 
 the recovery process is more rapid (less than 10 minutes) ?
 
 The full trace log is here : 
 
 2013-12-05 13:51:53,740 [http-8080-46] INFO  
 org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
   - It has been requested that we recover
 2013-12-05 13:51:53,740 [http-8080-112] INFO  
 org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
   - It has been requested that we recover
 2013-12-05 13:51:53,740 [http-8080-112] INFO  
 org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] 
 webapp=null path=/admin/cores 
 params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 
 QTime=0
 2013-12-05 13:51:53,740 [Thread-1544] INFO  
 org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
 state=recovering
 2013-12-05 13:51:53,741 [http-8080-46] INFO  
 org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  - [admin] 
 webapp=null path=/admin/cores 
 params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 
 QTime=1
 2013-12-05 13:51:53,740 [Thread-1543] INFO  
 org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
 state=recovering
 2013-12-05 13:51:53,743 [Thread-1544] INFO  
 org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
 descriptor - reading it from system property
 2013-12-05 13:51:53,746 [Thread-1543] INFO  
 org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
 descriptor - reading it from system property
 2013-12-05 13:51:53,755 [Thread-1543] WARN  
 org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for 
 zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green
 2013-12-05 13:51:53,756 [RecoveryThread] INFO  
 org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery process.  
 core=fr_green recoveringAfterStartup=false
 2013-12-05 13:51:53,762 [RecoveryThread] INFO  
 org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery 
 process. core=fr_green
 2013-12-05 13:51:53,762 [RecoveryThread] INFO  
 org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery process.  
 core=fr_green recoveringAfterStartup=false
 2013-12-05 13:51:53,765 [RecoveryThread] INFO  
 org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green 
 state=recovering
 2013-12-05 13:51:53,765 [RecoveryThread] INFO  
 org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on 
 descriptor - reading it from system property
 2013-12-05 13:51:53,767 [RecoveryThread] INFO  
 org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  - Creating 
 new http client, 
 config:maxConnections=128maxConnectionsPerHost=32followRedirects=false
 2013-12-05 13:51:54,777 [main-EventThread] INFO  
 org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state 
 change: WatchedEvent state:SyncConnected type:NodeDataChanged 
 path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
 2013-12-05 13:51:56,804 [RecoveryThread] INFO  
 org.apache.solr.cloud.RecoveryStrategy:doRecovery:356  - Attempting to 
 PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green - 
 recoveringAfterStartup=false
 2013-12-05 13:51:56,806 [RecoveryThread] WARN  
 org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=fr_green 
 url=http://solr-08/searchsolrnodefr too many updates received since start - 
 startingUpdates no longer overlaps with our currentUpdates
 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
 org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery 
 was not successful - trying replication. core=fr_green
 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
 org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting Replication 
 Recovery. core=fr_green
 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
 org.apache.solr.cloud.RecoveryStrategy:doRecovery:399  - Begin buffering 
 updates. core=fr_green
 2013-12-05 13:51:56,806 [RecoveryThread] INFO  
 org.apache.solr.cloud.RecoveryStrategy:replicate:127  - Attempting to 
 replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green
 

Re: PeerSync Recovery fails, starting Replication Recovery

2013-12-19 Thread Daniel Collins
Are you using a NRT solution, how often do you commit?  We see similar
issues with PeerSync, but then we have a very active NRT system and we
soft-commit sub-second, so since PeerSync has a limit of 100 versions
before it decides its too much to do, if we try and PeerSync whilst
indexing is running, we inevitably have to fallback to a full-sync as this
does.

What Solr version are you using?  There were issues with early 4.X (up to
4.3 wasn't it, I can't find the ticket now?) whhereby if PeerSync failed,
it did a fullCopy when what it should do is an incremental copy (i.e. only
the segments that are missing) which again might hurt you.

But surely your real problem isn't that the recovery taking a long time,
your problem is why did the system enter recovery in the first place (which
is the bit *just before* the trace you gave us!)
The first line is It has been requested that we recover (aren't Solr
trace message polite), which I know from recent experience means either the
leader thinks this replica is out of date or you just had a leadership
transfer.  As mark says, the root cause of that is probably a ZK timeout
issue



On 19 December 2013 15:50, Mark Miller markrmil...@gmail.com wrote:

 Sounds like you need to raise your ZooKeeper connection timeout.

 Also, make sure you are using a concurrent garbage collector as a side
 note - stop the world pauses should be avoided. Just good advice :)

 - Mark

 On Dec 18, 2013, at 5:48 AM, Anca Kopetz anca.kop...@kelkoo.com wrote:

  Hi,
 
  In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from
 time to time into recovering state, and it takes more than 10 minutes to
 finish to recover.
 
  In logs, we see that PeerSync Recovery fails with the message :
 
  PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many
 updates received since start - startingUpdates no longer overlaps with our
 currentUpdates
 
  Then Replication Recovery starts.
 
  Is there something we can do to avoid the failure of Peer Recovery so
 that the recovery process is more rapid (less than 10 minutes) ?
 
  The full trace log is here :
 
  2013-12-05 13:51:53,740 [http-8080-46] INFO
  
 org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
  - It has been requested that we recover
  2013-12-05 13:51:53,740 [http-8080-112] INFO
  
 org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
  - It has been requested that we recover
  2013-12-05 13:51:53,740 [http-8080-112] INFO
  org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
 [admin] webapp=null path=/admin/cores
 params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0
 QTime=0
  2013-12-05 13:51:53,740 [Thread-1544] INFO
  org.apache.solr.cloud.ZkController:publish:1017  - publishing
 core=fr_green state=recovering
  2013-12-05 13:51:53,741 [http-8080-46] INFO
  org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
 [admin] webapp=null path=/admin/cores
 params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0
 QTime=1
  2013-12-05 13:51:53,740 [Thread-1543] INFO
  org.apache.solr.cloud.ZkController:publish:1017  - publishing
 core=fr_green state=recovering
  2013-12-05 13:51:53,743 [Thread-1544] INFO
  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
 descriptor - reading it from system property
  2013-12-05 13:51:53,746 [Thread-1543] INFO
  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
 descriptor - reading it from system property
  2013-12-05 13:51:53,755 [Thread-1543] WARN
  org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for
 zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green
  2013-12-05 13:51:53,756 [RecoveryThread] INFO
  org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
 process.  core=fr_green recoveringAfterStartup=false
  2013-12-05 13:51:53,762 [RecoveryThread] INFO
  org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
 process. core=fr_green
  2013-12-05 13:51:53,762 [RecoveryThread] INFO
  org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
 process.  core=fr_green recoveringAfterStartup=false
  2013-12-05 13:51:53,765 [RecoveryThread] INFO
  org.apache.solr.cloud.ZkController:publish:1017  - publishing
 core=fr_green state=recovering
  2013-12-05 13:51:53,765 [RecoveryThread] INFO
  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
 descriptor - reading it from system property
  2013-12-05 13:51:53,767 [RecoveryThread] INFO
  org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
 Creating new http client,
 config:maxConnections=128maxConnectionsPerHost=32followRedirects=false
  2013-12-05 13:51:54,777 [main-EventThread] INFO
  org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
 change: WatchedEvent state:SyncConnected type:NodeDataChanged
 path:/clusterstate.json, has occurred - 

PeerSync Recovery fails, starting Replication Recovery

2013-12-18 Thread Anca Kopetz



Hi,

In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to time into recovering state, and it takes more than 10 minutes to finish to recover.

In logs, we see that PeerSync Recovery fails with the message :

PeerSync: core=fr_green url="" class="moz-txt-link-freetext" href="http://solr-08/searchsolrnodefr">http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates

Then Replication Recovery starts. 

Is there something we can do to avoid the failure of Peer Recovery so that the recovery process is more rapid (less than 10 minutes) ?

The full trace log is here : 

2013-12-05 13:51:53,740 [http-8080-46] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover
2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover
2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action="" status=0 QTime=0
2013-12-05 13:51:53,740 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering
2013-12-05 13:51:53,741 [http-8080-46] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action="" status=0 QTime=1
2013-12-05 13:51:53,740 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering
2013-12-05 13:51:53,743 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property
2013-12-05 13:51:53,746 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property
2013-12-05 13:51:53,755 [Thread-1543] WARN org.apache.solr.cloud.RecoveryStrategy:close:105 - Stopping recovery for zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green
2013-12-05 13:51:53,756 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false
2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery process. core=fr_green
2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false
2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering
2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property
2013-12-05 13:51:53,767 [RecoveryThread] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false
2013-12-05 13:51:54,777 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
2013-12-05 13:51:56,804 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:356 - Attempting to PeerSync from

http://solr-02/searchsolrnodefr/fr_green/ core=fr_green - recoveringAfterStartup=false
2013-12-05 13:51:56,806 [RecoveryThread] WARN org.apache.solr.update.PeerSync:sync:232 - PeerSync: core=fr_green url="" class="moz-txt-link-freetext" href="http://solr-08/searchsolrnodefr">http://solr-08/searchsolrnodefr too many updates received since
 start - startingUpdates no longer overlaps with our currentUpdates
2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:394 - PeerSync Recovery was not successful - trying replication. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:397 - Starting Replication Recovery. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:399 - Begin buffering updates. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:replicate:127 - Attempting to replicate from

http://solr-02/searchsolrnodefr/fr_green/. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false
2013-12-05 13:52:01,203 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller:init:211 - No value set for 'pollInterval'. Timer Task not started.
2013-12-05 13:52:01,209 [RecoveryThread] INFO 

Re: PeerSync Recovery fails, starting Replication Recovery

2013-12-18 Thread Furkan KAMACI
Hi Anca;

Could you check the conversation at here:
http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-td4061831.html

Thanks;
Furkan KAMACI


18 Aralık 2013 Çarşamba tarihinde Anca Kopetz anca.kop...@kelkoo.com adlı
kullanıcı şöyle yazdı:
 Hi,

 In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from
time to time into recovering state, and it takes more than 10 minutes to
finish to recover.

 In logs, we see that PeerSync Recovery fails with the message :

 PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many
updates received since start - startingUpdates no longer overlaps with our
currentUpdates

 Then Replication Recovery starts.

 Is there something we can do to avoid the failure of Peer Recovery so
that the recovery process is more rapid (less than 10 minutes) ?

 The full trace log is here :

 2013-12-05 13:51:53,740 [http-8080-46] INFO
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
- It has been requested that we recover
 2013-12-05 13:51:53,740 [http-8080-112] INFO
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
- It has been requested that we recover
 2013-12-05 13:51:53,740 [http-8080-112] INFO
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0
QTime=0
 2013-12-05 13:51:53,740 [Thread-1544] INFO
org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green
state=recovering
 2013-12-05 13:51:53,741 [http-8080-46] INFO
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0
QTime=1
 2013-12-05 13:51:53,740 [Thread-1543] INFO
org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green
state=recovering
 2013-12-05 13:51:53,743 [Thread-1544] INFO
org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property
 2013-12-05 13:51:53,746 [Thread-1543] INFO
org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property
 2013-12-05 13:51:53,755 [Thread-1543] WARN
org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for
zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green
 2013-12-05 13:51:53,756 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
process.  core=fr_green recoveringAfterStartup=false
 2013-12-05 13:51:53,762 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
process. core=fr_green
 2013-12-05 13:51:53,762 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
process.  core=fr_green recoveringAfterStartup=false
 2013-12-05 13:51:53,765 [RecoveryThread] INFO
org.apache.solr.cloud.ZkController:publish:1017  - publishing core=fr_green
state=recovering
 2013-12-05 13:51:53,765 [RecoveryThread] INFO
org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
descriptor - reading it from system property
 2013-12-05 13:51:53,767 [RecoveryThread] INFO
org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
Creating new http client,
config:maxConnections=128maxConnectionsPerHost=32followRedirects=false
 2013-12-05 13:51:54,777 [main-EventThread] INFO
org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
 2013-12-05 13:51:56,804 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:356  - Attempting to
PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green -
recoveringAfterStartup=false
 2013-12-05 13:51:56,806 [RecoveryThread] WARN
org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=fr_green url=
http://solr-08/searchsolrnodefr too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates
 2013-12-05 13:51:56,806 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery
was not successful - trying replication. core=fr_green
 2013-12-05 13:51:56,806 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting
Replication Recovery. core=fr_green
 2013-12-05 13:51:56,806 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:399  - Begin buffering
updates. core=fr_green
 2013-12-05 13:51:56,806 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:replicate:127  - Attempting to
replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green
 2013-12-05 13:51:56,806 [RecoveryThread] INFO
org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
Creating new http client,