Re: PeerSync Recovery fails, starting Replication Recovery
Hi, We tried to understand why we get a Connection reset exception on the leader when it tries to foward the documents to one of its replica. We analyzed the GC logs and we did not see any long GC pauses around the time the exception was thrown. For 24 hours of gc logs, the max full gc pause = 3 seconds, but it is not in the same time as the Connection reset exception. My question again : Why does the leader not retry to forward the documents to the replica when it gets an IOException in SolrCmdDistributor ? Instead, it sents a recovery request to the replica. Solr version is 4.5.1 (we tried 4.6.0, but we had some problems with this version, detailed in the mail : SolrCloud 4.6.0 - leader election issue). Related to Connection reset exception, is there any other tests that we can do in order to the cause of it ? Thank you, Anca On 12/20/2013 11:07 AM, Anca Kopetz wrote: Hi, We used to have many Client session timeout messages in solr logs. INFO org.apache.zookeeper.ClientCnxn:run:1083 - Client session timed out, have not heard from server in 18461ms for sessionid 0x242047fc6d77804, closing socket connection and attempting reconnect Then we set the zkClientTimeout to 30 seconds. Therefore there are no more messages of this type in the logs. But we get some other messages : Leader logs (solr-39): 2013-12-18 09:45:26,052 [http-8083-74] ERROR org.apache.solr.update.SolrCmdDistributor:log:119 - shard update error StdNode: http://solr-40/searchsolrnodees/es_blue/:org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://solr-40/searchsolrnodees/es_blue ... Caused by: java.net.SocketException: Connection reset 2013-12-18 09:45:26,060 [http-8083-74] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false 2013-12-18 09:45:26,140 [http-8083-49] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleWaitForStateAction:819 - Going to wait for coreNodeName: solr-40_searchsolrnodees_es_blue, state: recovering, checkLive: true, onlyIfLeader: true Replica logs (solr-40) : 2013-12-18 09:45:26,083 [http-8083-65] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-18 09:45:26,091 [http-8083-65] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=es_bluewt=javabinversion=2} status=0 QTime=8 ... 2013-12-18 09:45:29,190 [RecoveryThread] WARN org.apache.solr.update.PeerSync:sync:232 - PeerSync: core=es_blue url=http://dc2-s6-prod-solr-40.prod.dc2.kelkoo.net:8083/searchsolrnodees too many updates received since start - startingUpdates no longer overlaps with our currentUpdates 2013-12-18 09:45:29,191 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:394 - PeerSync Recovery was not successful - trying replication. core=es_blue 2013-12-18 09:45:29,191 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:397 - Starting Replication Recovery. core=es_blue Therefore, if I understand it right, the leader does not manage to foward the documents to the replica due to a Connection reset problem, and it asks the replica to recover. The replica tries to recover, it fails, and it starts a replication recovery. Why does the leader not retry to forward the documents to the replica when it gets an IOException in SolrCmdDistributor ? Solr version is 4.5.1 For the Garbage Collector, we use the settings defined here http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning Thank you, Anca On 12/19/2013 04:50 PM, Mark Miller wrote: Sounds like you need to raise your ZooKeeper connection timeout. Also, make sure you are using a concurrent garbage collector as a side note - stop the world pauses should be avoided. Just good advice :) - Mark On Dec 18, 2013, at 5:48 AM, Anca Kopetz anca.kop...@kelkoo.commailto:anca.kop...@kelkoo.commailto:anca.kop...@kelkoo.commailto:anca.kop...@kelkoo.com wrote: Hi, In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to time into recovering state, and it takes more than 10 minutes to finish to recover. In logs, we see that PeerSync Recovery fails with the message : PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Then Replication Recovery starts. Is there something we can do to avoid the failure of Peer Recovery so that the recovery process is more rapid (less than 10 minutes) ? The full trace log is here : 2013-12-05 13:51:53,740 [http-8080-46] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO
Re: PeerSync Recovery fails, starting Replication Recovery
Hi, We used to have many Client session timeout messages in solr logs. INFO org.apache.zookeeper.ClientCnxn:run:1083 - Client session timed out, have not heard from server in 18461ms for sessionid 0x242047fc6d77804, closing socket connection and attempting reconnect Then we set the zkClientTimeout to 30 seconds. Therefore there are no more messages of this type in the logs. But we get some other messages : Leader logs (solr-39): 2013-12-18 09:45:26,052 [http-8083-74] ERROR org.apache.solr.update.SolrCmdDistributor:log:119 - shard update error StdNode: http://solr-40/searchsolrnodees/es_blue/:org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://solr-40/searchsolrnodees/es_blue ... Caused by: java.net.SocketException: Connection reset 2013-12-18 09:45:26,060 [http-8083-74] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false 2013-12-18 09:45:26,140 [http-8083-49] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleWaitForStateAction:819 - Going to wait for coreNodeName: solr-40_searchsolrnodees_es_blue, state: recovering, checkLive: true, onlyIfLeader: true Replica logs (solr-40) : 2013-12-18 09:45:26,083 [http-8083-65] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-18 09:45:26,091 [http-8083-65] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=es_bluewt=javabinversion=2} status=0 QTime=8 ... 2013-12-18 09:45:29,190 [RecoveryThread] WARN org.apache.solr.update.PeerSync:sync:232 - PeerSync: core=es_blue url=http://dc2-s6-prod-solr-40.prod.dc2.kelkoo.net:8083/searchsolrnodees too many updates received since start - startingUpdates no longer overlaps with our currentUpdates 2013-12-18 09:45:29,191 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:394 - PeerSync Recovery was not successful - trying replication. core=es_blue 2013-12-18 09:45:29,191 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:397 - Starting Replication Recovery. core=es_blue Therefore, if I understand it right, the leader does not manage to foward the documents to the replica due to a Connection reset problem, and it asks the replica to recover. The replica tries to recover, it fails, and it starts a replication recovery. Why does the leader not retry to forward the documents to the replica when it gets an IOException in SolrCmdDistributor ? Solr version is 4.5.1 For the Garbage Collector, we use the settings defined here http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning Thank you, Anca On 12/19/2013 04:50 PM, Mark Miller wrote: Sounds like you need to raise your ZooKeeper connection timeout. Also, make sure you are using a concurrent garbage collector as a side note - stop the world pauses should be avoided. Just good advice :) - Mark On Dec 18, 2013, at 5:48 AM, Anca Kopetz anca.kop...@kelkoo.commailto:anca.kop...@kelkoo.com wrote: Hi, In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to time into recovering state, and it takes more than 10 minutes to finish to recover. In logs, we see that PeerSync Recovery fails with the message : PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Then Replication Recovery starts. Is there something we can do to avoid the failure of Peer Recovery so that the recovery process is more rapid (less than 10 minutes) ? The full trace log is here : 2013-12-05 13:51:53,740 [http-8080-46] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=0 2013-12-05 13:51:53,740 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,741 [http-8080-46] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=1 2013-12-05 13:51:53,740 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,743 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1021
Re: PeerSync Recovery fails, starting Replication Recovery
Hi, We do not use a NRT solution. We do a hardCommit (autocommit with openSearcher=false) every 15 minutes, and updates with commitWithin every 30 minutes. The updates are done one by one by many feeders, we do not send batches of documents. Solr version : 4.5.1 Yes, the problem is before the recovery, as I explained in the reply to Mark's mail. Best regards, Anca Kopetz On 12/19/2013 06:39 PM, Daniel Collins wrote: Are you using a NRT solution, how often do you commit? We see similar issues with PeerSync, but then we have a very active NRT system and we soft-commit sub-second, so since PeerSync has a limit of 100 versions before it decides its too much to do, if we try and PeerSync whilst indexing is running, we inevitably have to fallback to a full-sync as this does. What Solr version are you using? There were issues with early 4.X (up to 4.3 wasn't it, I can't find the ticket now?) whhereby if PeerSync failed, it did a fullCopy when what it should do is an incremental copy (i.e. only the segments that are missing) which again might hurt you. But surely your real problem isn't that the recovery taking a long time, your problem is why did the system enter recovery in the first place (which is the bit *just before* the trace you gave us!) The first line is It has been requested that we recover (aren't Solr trace message polite), which I know from recent experience means either the leader thinks this replica is out of date or you just had a leadership transfer. As mark says, the root cause of that is probably a ZK timeout issue On 19 December 2013 15:50, Mark Millermarkrmil...@gmail.com wrote: Sounds like you need to raise your ZooKeeper connection timeout. Also, make sure you are using a concurrent garbage collector as a side note - stop the world pauses should be avoided. Just good advice :) - Mark On Dec 18, 2013, at 5:48 AM, Anca Kopetzanca.kop...@kelkoo.com wrote: Hi, In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to time into recovering state, and it takes more than 10 minutes to finish to recover. In logs, we see that PeerSync Recovery fails with the message : PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Then Replication Recovery starts. Is there something we can do to avoid the failure of Peer Recovery so that the recovery process is more rapid (less than 10 minutes) ? The full trace log is here : 2013-12-05 13:51:53,740 [http-8080-46] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=0 2013-12-05 13:51:53,740 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,741 [http-8080-46] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=1 2013-12-05 13:51:53,740 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,743 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,746 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,755 [Thread-1543] WARN org.apache.solr.cloud.RecoveryStrategy:close:105 - Stopping recovery for zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green 2013-12-05 13:51:53,756 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery process. core=fr_green 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,767 [RecoveryThread] INFO
Re: PeerSync Recovery fails, starting Replication Recovery
Hi, Thank you for the link, it does not seem to be the same problem ... Best regards, Anca On 12/18/2013 11:41 PM, Furkan KAMACI wrote: Hi Anca; Could you check the conversation at here: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-td4061831.html Thanks; Furkan KAMACI 18 Aralık 2013 Çarşamba tarihinde Anca Kopetzanca.kop...@kelkoo.com adlı kullanıcı şöyle yazdı: Hi, In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to time into recovering state, and it takes more than 10 minutes to finish to recover. In logs, we see that PeerSync Recovery fails with the message : PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Then Replication Recovery starts. Is there something we can do to avoid the failure of Peer Recovery so that the recovery process is more rapid (less than 10 minutes) ? The full trace log is here : 2013-12-05 13:51:53,740 [http-8080-46] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=0 2013-12-05 13:51:53,740 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,741 [http-8080-46] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=1 2013-12-05 13:51:53,740 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,743 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,746 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,755 [Thread-1543] WARN org.apache.solr.cloud.RecoveryStrategy:close:105 - Stopping recovery for zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green 2013-12-05 13:51:53,756 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery process. core=fr_green 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,767 [RecoveryThread] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false 2013-12-05 13:51:54,777 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 18) 2013-12-05 13:51:56,804 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:356 - Attempting to PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green - recoveringAfterStartup=false 2013-12-05 13:51:56,806 [RecoveryThread] WARN org.apache.solr.update.PeerSync:sync:232 - PeerSync: core=fr_green url= http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:394 - PeerSync Recovery was not successful - trying replication. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:397 - Starting Replication Recovery. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:399 - Begin buffering updates. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:replicate:127 - Attempting to replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green
Re: PeerSync Recovery fails, starting Replication Recovery
Sounds like you need to raise your ZooKeeper connection timeout. Also, make sure you are using a concurrent garbage collector as a side note - stop the world pauses should be avoided. Just good advice :) - Mark On Dec 18, 2013, at 5:48 AM, Anca Kopetz anca.kop...@kelkoo.com wrote: Hi, In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to time into recovering state, and it takes more than 10 minutes to finish to recover. In logs, we see that PeerSync Recovery fails with the message : PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Then Replication Recovery starts. Is there something we can do to avoid the failure of Peer Recovery so that the recovery process is more rapid (less than 10 minutes) ? The full trace log is here : 2013-12-05 13:51:53,740 [http-8080-46] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=0 2013-12-05 13:51:53,740 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,741 [http-8080-46] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=1 2013-12-05 13:51:53,740 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,743 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,746 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,755 [Thread-1543] WARN org.apache.solr.cloud.RecoveryStrategy:close:105 - Stopping recovery for zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green 2013-12-05 13:51:53,756 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery process. core=fr_green 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,767 [RecoveryThread] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false 2013-12-05 13:51:54,777 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 18) 2013-12-05 13:51:56,804 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:356 - Attempting to PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green - recoveringAfterStartup=false 2013-12-05 13:51:56,806 [RecoveryThread] WARN org.apache.solr.update.PeerSync:sync:232 - PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:394 - PeerSync Recovery was not successful - trying replication. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:397 - Starting Replication Recovery. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:399 - Begin buffering updates. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:replicate:127 - Attempting to replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green
Re: PeerSync Recovery fails, starting Replication Recovery
Are you using a NRT solution, how often do you commit? We see similar issues with PeerSync, but then we have a very active NRT system and we soft-commit sub-second, so since PeerSync has a limit of 100 versions before it decides its too much to do, if we try and PeerSync whilst indexing is running, we inevitably have to fallback to a full-sync as this does. What Solr version are you using? There were issues with early 4.X (up to 4.3 wasn't it, I can't find the ticket now?) whhereby if PeerSync failed, it did a fullCopy when what it should do is an incremental copy (i.e. only the segments that are missing) which again might hurt you. But surely your real problem isn't that the recovery taking a long time, your problem is why did the system enter recovery in the first place (which is the bit *just before* the trace you gave us!) The first line is It has been requested that we recover (aren't Solr trace message polite), which I know from recent experience means either the leader thinks this replica is out of date or you just had a leadership transfer. As mark says, the root cause of that is probably a ZK timeout issue On 19 December 2013 15:50, Mark Miller markrmil...@gmail.com wrote: Sounds like you need to raise your ZooKeeper connection timeout. Also, make sure you are using a concurrent garbage collector as a side note - stop the world pauses should be avoided. Just good advice :) - Mark On Dec 18, 2013, at 5:48 AM, Anca Kopetz anca.kop...@kelkoo.com wrote: Hi, In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to time into recovering state, and it takes more than 10 minutes to finish to recover. In logs, we see that PeerSync Recovery fails with the message : PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Then Replication Recovery starts. Is there something we can do to avoid the failure of Peer Recovery so that the recovery process is more rapid (less than 10 minutes) ? The full trace log is here : 2013-12-05 13:51:53,740 [http-8080-46] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=0 2013-12-05 13:51:53,740 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,741 [http-8080-46] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=1 2013-12-05 13:51:53,740 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,743 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,746 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,755 [Thread-1543] WARN org.apache.solr.cloud.RecoveryStrategy:close:105 - Stopping recovery for zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green 2013-12-05 13:51:53,756 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery process. core=fr_green 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,767 [RecoveryThread] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false 2013-12-05 13:51:54,777 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred -
PeerSync Recovery fails, starting Replication Recovery
Hi, In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to time into recovering state, and it takes more than 10 minutes to finish to recover. In logs, we see that PeerSync Recovery fails with the message : PeerSync: core=fr_green url="" class="moz-txt-link-freetext" href="http://solr-08/searchsolrnodefr">http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Then Replication Recovery starts. Is there something we can do to avoid the failure of Peer Recovery so that the recovery process is more rapid (less than 10 minutes) ? The full trace log is here : 2013-12-05 13:51:53,740 [http-8080-46] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action="" status=0 QTime=0 2013-12-05 13:51:53,740 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,741 [http-8080-46] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action="" status=0 QTime=1 2013-12-05 13:51:53,740 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,743 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,746 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,755 [Thread-1543] WARN org.apache.solr.cloud.RecoveryStrategy:close:105 - Stopping recovery for zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green 2013-12-05 13:51:53,756 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery process. core=fr_green 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,767 [RecoveryThread] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false 2013-12-05 13:51:54,777 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 18) 2013-12-05 13:51:56,804 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:356 - Attempting to PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green - recoveringAfterStartup=false 2013-12-05 13:51:56,806 [RecoveryThread] WARN org.apache.solr.update.PeerSync:sync:232 - PeerSync: core=fr_green url="" class="moz-txt-link-freetext" href="http://solr-08/searchsolrnodefr">http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:394 - PeerSync Recovery was not successful - trying replication. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:397 - Starting Replication Recovery. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:399 - Begin buffering updates. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:replicate:127 - Attempting to replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false 2013-12-05 13:52:01,203 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller:init:211 - No value set for 'pollInterval'. Timer Task not started. 2013-12-05 13:52:01,209 [RecoveryThread] INFO
Re: PeerSync Recovery fails, starting Replication Recovery
Hi Anca; Could you check the conversation at here: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-td4061831.html Thanks; Furkan KAMACI 18 Aralık 2013 Çarşamba tarihinde Anca Kopetz anca.kop...@kelkoo.com adlı kullanıcı şöyle yazdı: Hi, In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to time into recovering state, and it takes more than 10 minutes to finish to recover. In logs, we see that PeerSync Recovery fails with the message : PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Then Replication Recovery starts. Is there something we can do to avoid the failure of Peer Recovery so that the recovery process is more rapid (less than 10 minutes) ? The full trace log is here : 2013-12-05 13:51:53,740 [http-8080-46] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 - It has been requested that we recover 2013-12-05 13:51:53,740 [http-8080-112] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=0 2013-12-05 13:51:53,740 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,741 [http-8080-46] INFO org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERYcore=fr_greenwt=javabinversion=2} status=0 QTime=1 2013-12-05 13:51:53,740 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,743 [Thread-1544] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,746 [Thread-1543] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,755 [Thread-1543] WARN org.apache.solr.cloud.RecoveryStrategy:close:105 - Stopping recovery for zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green 2013-12-05 13:51:53,756 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery process. core=fr_green 2013-12-05 13:51:53,762 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. core=fr_green recoveringAfterStartup=false 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green state=recovering 2013-12-05 13:51:53,765 [RecoveryThread] INFO org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on descriptor - reading it from system property 2013-12-05 13:51:53,767 [RecoveryThread] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false 2013-12-05 13:51:54,777 [main-EventThread] INFO org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 18) 2013-12-05 13:51:56,804 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:356 - Attempting to PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green - recoveringAfterStartup=false 2013-12-05 13:51:56,806 [RecoveryThread] WARN org.apache.solr.update.PeerSync:sync:232 - PeerSync: core=fr_green url= http://solr-08/searchsolrnodefr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:394 - PeerSync Recovery was not successful - trying replication. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:397 - Starting Replication Recovery. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:doRecovery:399 - Begin buffering updates. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy:replicate:127 - Attempting to replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green 2013-12-05 13:51:56,806 [RecoveryThread] INFO org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating new http client,