Re: PeerSync Recovery fails, starting Replication Recovery

Daniel Collins Thu, 19 Dec 2013 09:41:56 -0800

Are you using a NRT solution, how often do you commit?  We see similar
issues with PeerSync, but then we have a very active NRT system and we
soft-commit sub-second, so since PeerSync has a limit of 100 versions
before it decides its too much to do, if we try and PeerSync whilst
indexing is running, we inevitably have to fallback to a full-sync as this
does.


What Solr version are you using?  There were issues with early 4.X (up to
4.3 wasn't it, I can't find the ticket now?) whhereby if PeerSync failed,
it did a fullCopy when what it should do is an incremental copy (i.e. only
the segments that are missing) which again might hurt you.

But surely your real problem isn't that the recovery taking a long time,
your problem is why did the system enter recovery in the first place (which
is the bit *just before* the trace you gave us!)
The first line is "It has been requested that we recover" (aren't Solr
trace message polite), which I know from recent experience means either the
leader thinks this replica is out of date or you just had a leadership
transfer.  As mark says, the root cause of that is probably a ZK timeout
issue



On 19 December 2013 15:50, Mark Miller <markrmil...@gmail.com> wrote:

> Sounds like you need to raise your ZooKeeper connection timeout.
>
> Also, make sure you are using a concurrent garbage collector as a side
> note - stop the world pauses should be avoided. Just good advice :)
>
> - Mark
>
> On Dec 18, 2013, at 5:48 AM, Anca Kopetz <anca.kop...@kelkoo.com> wrote:
>
> > Hi,
> >
> > In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from
> time to time into recovering state, and it takes more than 10 minutes to
> finish to recover.
> >
> > In logs, we see that "PeerSync Recovery" fails with the message :
> >
> > PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many
> updates received since start - startingUpdates no longer overlaps with our
> currentUpdates
> >
> > Then "Replication Recovery" starts.
> >
> > Is there something we can do to avoid the failure of "Peer Recovery" so
> that the recovery process is more rapid (less than 10 minutes) ?
> >
> > The full trace log is here :
> >
> > 2013-12-05 13:51:53,740 [http-8080-46] INFO
>  
> org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
>  - It has been requested that we recover
> > 2013-12-05 13:51:53,740 [http-8080-112] INFO
>  
> org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
>  - It has been requested that we recover
> > 2013-12-05 13:51:53,740 [http-8080-112] INFO
>  org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
> [admin] webapp=null path=/admin/cores
> params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
> QTime=0
> > 2013-12-05 13:51:53,740 [Thread-1544] INFO
>  org.apache.solr.cloud.ZkController:publish:1017  - publishing
> core=fr_green state=recovering
> > 2013-12-05 13:51:53,741 [http-8080-46] INFO
>  org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
> [admin] webapp=null path=/admin/cores
> params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
> QTime=1
> > 2013-12-05 13:51:53,740 [Thread-1543] INFO
>  org.apache.solr.cloud.ZkController:publish:1017  - publishing
> core=fr_green state=recovering
> > 2013-12-05 13:51:53,743 [Thread-1544] INFO
>  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
> descriptor - reading it from system property
> > 2013-12-05 13:51:53,746 [Thread-1543] INFO
>  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
> descriptor - reading it from system property
> > 2013-12-05 13:51:53,755 [Thread-1543] WARN
>  org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for
> zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green
> > 2013-12-05 13:51:53,756 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
> process.  core=fr_green recoveringAfterStartup=false
> > 2013-12-05 13:51:53,762 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
> process. core=fr_green
> > 2013-12-05 13:51:53,762 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
> process.  core=fr_green recoveringAfterStartup=false
> > 2013-12-05 13:51:53,765 [RecoveryThread] INFO
>  org.apache.solr.cloud.ZkController:publish:1017  - publishing
> core=fr_green state=recovering
> > 2013-12-05 13:51:53,765 [RecoveryThread] INFO
>  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
> descriptor - reading it from system property
> > 2013-12-05 13:51:53,767 [RecoveryThread] INFO
>  org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
> Creating new http client,
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > 2013-12-05 13:51:54,777 [main-EventThread] INFO
>  org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
> change: WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
> > 2013-12-05 13:51:56,804 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:doRecovery:356  - Attempting to
> PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green -
> recoveringAfterStartup=false
> > 2013-12-05 13:51:56,806 [RecoveryThread] WARN
>  org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=fr_green url=
> http://solr-08/searchsolrnodefr too many updates received since start -
> startingUpdates no longer overlaps with our currentUpdates
> > 2013-12-05 13:51:56,806 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery
> was not successful - trying replication. core=fr_green
> > 2013-12-05 13:51:56,806 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting
> Replication Recovery. core=fr_green
> > 2013-12-05 13:51:56,806 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:doRecovery:399  - Begin buffering
> updates. core=fr_green
> > 2013-12-05 13:51:56,806 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:replicate:127  - Attempting to
> replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green
> > 2013-12-05 13:51:56,806 [RecoveryThread] INFO
>  org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
> Creating new http client,
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> > 2013-12-05 13:52:01,203 [RecoveryThread] INFO
>  org.apache.solr.handler.SnapPuller:<init>:211  -  No value set for
> 'pollInterval'. Timer Task not started.
> > 2013-12-05 13:52:01,209 [RecoveryThread] INFO
>  org.apache.solr.handler.SnapPuller:fetchLatestIndex:369  - Master's
> generation: 4382
> > 2013-12-05 13:52:01,209 [RecoveryThread] INFO
>  org.apache.solr.handler.SnapPuller:fetchLatestIndex:370  - Slave's
> generation: 4382
> > 2013-12-05 13:52:01,209 [RecoveryThread] INFO
>  org.apache.solr.handler.SnapPuller:fetchLatestIndex:371  - Starting
> replication process
> > 2013-12-05 13:52:01,213 [RecoveryThread] INFO
>  org.apache.solr.handler.SnapPuller:fetchLatestIndex:376  - Number of files
> in latest index in master: 158
> > 2013-12-05 13:52:01,213 [RecoveryThread] INFO
>  org.apache.solr.core.CachingDirectoryFactory:get:360  - return new
> directory for
> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213
> > 2013-12-05 13:52:01,213 [RecoveryThread] INFO
>  org.apache.solr.handler.SnapPuller:fetchLatestIndex:402  - Starting
> download to 
> org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@697d36b7fullCopy=true
> > 2013-12-05 13:52:21,970 [main-EventThread] INFO
>  org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
> change: WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
> > 2013-12-05 13:55:15,039 [main-EventThread] INFO
>  org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
> change: WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
> > 2013-12-05 13:56:27,160 [RecoveryThread] INFO
>  org.apache.solr.handler.SnapPuller:fetchLatestIndex:406  - Total time
> taken for download : 265 secs
> > 2013-12-05 13:56:27,540 [RecoveryThread] INFO
>  org.apache.solr.handler.SnapPuller:modifyIndexProps:886  - New index
> installed. Updating index properties... index=index.20131205135201213
> > 2013-12-05 13:56:27,546 [RecoveryThread] INFO
>  org.apache.solr.handler.SnapPuller:fetchLatestIndex:461  - removing old
> index directory 
> org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@53949838
> > 2013-12-05 13:56:27,765 [RecoveryThread] INFO
>  org.apache.solr.search.SolrIndexSearcher:<init>:197  - Opening
> Searcher@30a544f3 main
> > 2013-12-05 13:58:52,337 [RecoveryThread] INFO
>  org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:235  -
> looking to close
> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724[CachedDir<<refCount=0;path=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724;done=true>>]
> > 2013-12-05 13:58:52,337 [RecoveryThread] INFO
>  org.apache.solr.core.CachingDirectoryFactory:close:304  - Closing
> directory:
> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
> > 2013-12-05 13:58:52,337 [RecoveryThread] INFO
>  org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:279  -
> Removing directory before core close:
> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
> > 2013-12-05 13:58:54,172 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:replay:506  - Replaying buffered
> documents. core=fr_green
> > 2013-12-05 13:58:54,172 [recoveryExecutor-7-thread-2] WARN
>  org.apache.solr.update.UpdateLog:doReplay:1240  - Starting log replay
> tlog{file=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/tlog/tlog.0000000000000004091
> refcount=2} active=true starting pos=1103833347
> > 2013-12-05 13:59:16,660 [recoveryExecutor-7-thread-2] INFO
>  org.apache.solr.search.SolrIndexSearcher:<init>:197  - Opening
> Searcher@5e48fb12 main
> > 2013-12-05 14:01:11,908 [main-EventThread] INFO
>  org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
> change: WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
> > 2013-12-05 14:01:22,846 [recoveryExecutor-7-thread-2] WARN
>  org.apache.solr.update.UpdateLog:run:1230  - Log replay finished.
> recoveryInfo=RecoveryInfo{adds=39345 deletes=0 deleteByQuery=0 errors=0
> positionOfStart=1103833347}
> > 2013-12-05 14:01:22,846 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:doRecovery:410  - Replication
> Recovery was successful - registering as Active. core=fr_green
> > 2013-12-05 14:01:22,846 [RecoveryThread] INFO
>  org.apache.solr.cloud.ZkController:publish:1017  - publishing
> core=fr_green state=active
> > 2013-12-05 14:01:22,850 [RecoveryThread] INFO
>  org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
> descriptor - reading it from system property
> > 2013-12-05 14:01:22,860 [RecoveryThread] INFO
>  org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
> process. core=fr_green
> >
> > Best regards,
> > Anca Kopetz
> > --
> > <generic-logo-no-strap.gif>
> > Anca Kopetz Software engineer
> >
> > E anca.kop...@kelkoo.com  Y!Messenger kelkooancak
> > T +33 (0)4 56 09 07 55
> > A 4/6 Rue des Meridiens 38130 Echirolles
> > <x-bottom-left.gif>   <x-bottom-middle.gif>
> > Kelkoo SAS
> > Société par Actions Simplifiée
> > Au capital de € 4.168.964,30
> > Siège social : 8, rue du Sentier 75002 Paris
> > 425 093 069 RCS Paris
> >
> > Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>
>

Re: PeerSync Recovery fails, starting Replication Recovery

Reply via email to