Solr should _not_ have to replicate the index or go into peer sync on startup.

> are you stopping indexing before you shut your servers down?

> Be very sure you have passed your autocommit interval after you've stopped 
> indexing and before you stop Solr.

> How are you shutting down? bin/solr stop is the proper way, and you should 
> NOT be seeing any kind of message about force killing the Solr process. There 
> are timeouts in solr.in.sh (unix) that you can lengthen if you are seeing any 
> message about force killing.

> Bring all the nodes back up before you start indexing again.

With the above, you should not see peer sync on startup. Of course you
have to have all your Solr nodes in the "active" state before shutting
down.

 That said, some observations:

> your indexes are very large. Very very large  given your hardware. You have 
> roughly 700M docs per replica,  I'm not quite sure whether your 2.1T is 
> aggregate across all 6 replicas or not. Even if so (and that's the best 
> case), each replica is roughly 350G in size.You have, at best, 700G per 
> server trying to run with 16G of RAM. This is extremely undersized in my 
> experience.

> I suspect you're hitting a lot of GC. What kind of GC observations have you 
> made?

> I would turn off autoscaling, your index sizes and hardware are too small to 
> even consider automatically moving that much data around.

In summary, this seems on the surface to be an extremely undersized
system, and unless and until you properly size it you'll have problems

Best,
Erick
On Wed, Dec 5, 2018 at 10:12 AM lstusr 5u93n4 <lstusr...@gmail.com> wrote:
>
> Hi Kevin,
>
> We do have logs. Grepping for peersync, I can see
>
> solr               | 2018-12-05 03:31:41.301 INFO
> (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
> [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
> o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
> http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
> http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/]
> nUpdates=100
> solr               | 2018-12-05 03:31:41.307 INFO
> (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
> [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
> o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
> http://solr.node2.metaonly01.eca.local:8983/solr DONE.  We have no
> versions.  sync failed.
> solr               | 2018-12-05 03:31:44.007 INFO
> (coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7)
> [c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4]
> o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url=
> http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
> http://solr.node4.metaonly01.eca.local:8983/solr/iglshistory_shard2_replica_n6/]
> nUpdates=100
> solr               | 2018-12-05 03:31:44.008 INFO
> (coreZkRegister-1-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:iglshistory_shard2_replica_n4 c:iglshistory s:shard2 r:core_node7)
> [c:iglshistory s:shard2 r:core_node7 x:iglshistory_shard2_replica_n4]
> o.a.s.u.PeerSync PeerSync: core=iglshistory_shard2_replica_n4 url=
> http://solr.node2.metaonly01.eca.local:8983/solr DONE.  We have no
> versions.  sync failed.
> solr               | 2018-12-05 05:48:44.202 INFO
> (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
> r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync:
> core=igls_shard1_replica_n1 url=
> http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
> http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/]
> nUpdates=100
> solr               | 2018-12-05 05:48:44.646 WARN
> (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
> r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync:
> core=igls_shard1_replica_n1 url=
> http://solr.node2.metaonly01.eca.local:8983/solr  got a 503 from
> http://solr.node4.metaonly01.eca.local:8983/solr/igls_shard1_replica_n2/,
> counting as success
> solr               | 2018-12-05 05:48:44.682 INFO
> (coreZkRegister-1-thread-3-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard1_replica_n1 c:igls s:shard1 r:core_node3) [c:igls s:shard1
> r:core_node3 x:igls_shard1_replica_n1] o.a.s.u.PeerSync PeerSync:
> core=igls_shard1_replica_n1 url=
> http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync succeeded
> solr               | 2018-12-05 05:51:18.369 INFO  (qtp2131952342-10232)
> [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
> PeerSync: core=igls_shard2_replica_n6 url=
> http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
> http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/]
> nUpdates=100
> solr               | 2018-12-05 06:01:18.496 WARN  (qtp2131952342-10232)
> [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
> Replica did not return a fingerprint - possibly an older Solr version or
> exception
> solr               | 2018-12-05 06:01:31.126 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
> r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting
> to PeerSync from [
> http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] -
> recoveringAfterStartup=[true]
> solr               | 2018-12-05 06:01:31.129 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
> r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
> PeerSync: core=igls_shard2_replica_n6 url=
> http://solr.node2.metaonly01.eca.local:8983/solr START leader=
> http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
> nUpdates=100
> solr               | 2018-12-05 06:11:18.535 WARN  (qtp2131952342-10232)
> [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
> PeerSync: core=igls_shard2_replica_n6 url=
> http://solr.node2.metaonly01.eca.local:8983/solr  exception talking to
> http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/,
> failed
> solr               | 2018-12-05 06:11:18.583 INFO  (qtp2131952342-10232)
> [c:igls s:shard2 r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync
> PeerSync: core=igls_shard2_replica_n6 url=
> http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed
> solr               |    at
> org.apache.solr.update.PeerSyncWithLeader.request(PeerSyncWithLeader.java:321)
> solr               |    at
> org.apache.solr.update.PeerSyncWithLeader.alreadyInSync(PeerSyncWithLeader.java:341)
> solr               |    at
> org.apache.solr.update.PeerSyncWithLeader.sync(PeerSyncWithLeader.java:135)
> solr               | 2018-12-05 09:21:31.831 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
> r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy Attempting
> to PeerSync from [
> http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/] -
> recoveringAfterStartup=[true]
> solr               | 2018-12-05 09:21:31.831 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
> r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
> PeerSync: core=igls_shard2_replica_n6 url=
> http://solr.node2.metaonly01.eca.local:8983/solr START leader=
> http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
> nUpdates=100
> solr               | 2018-12-05 12:28:26.698 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
> r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
> Fingerprint comparison result: 1
> solr               | 2018-12-05 12:28:26.698 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
> r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
> fingerprint: {maxVersionSpecified=9223372036854775807,
> maxVersionEncountered=1619003108526915584, maxInHash=1619003108526915584,
> versionsHash=-2647181038600325382, numVersions=701868706,
> numDocs=701868706, maxDoc=13478974}, Our fingerprint:
> {maxVersionSpecified=9223372036854775807,
> maxVersionEncountered=1618994939968880640, maxInHash=1618994939968880640,
> versionsHash=-2861577316323569427, numVersions=701868706,
> numDocs=701868706, maxDoc=13478974}
> solr               | 2018-12-05 12:30:38.183 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
> r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader Leader
> fingerprint {maxVersionSpecified=9223372036854775807,
> maxVersionEncountered=1619014810634027008, maxInHash=1619014810634027008,
> versionsHash=-5367247708654169186, numVersions=701868706,
> numDocs=701868706, maxDoc=13478974}
> solr               | 2018-12-05 12:30:38.183 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
> r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
> PeerSync: core=igls_shard2_replica_n6 url=
> http://solr.node2.metaonly01.eca.local:8983/solr  Received 13 versions from
> http://solr.node3.metaonly01.eca.local:8983/solr/igls_shard2_replica_n4/
> solr               | 2018-12-05 12:30:38.184 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
> r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSync PeerSync:
> core=igls_shard2_replica_n6 url=
> http://solr.node2.metaonly01.eca.local:8983/solr  Our versions are too old.
> ourHighThreshold=1618976334275411968 otherLowThreshold=1619012608476577792
> ourHighest=1619014810634027008 otherHighest=1619014936356192256
> solr               | 2018-12-05 12:30:38.185 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
> r:core_node9 x:igls_shard2_replica_n6] o.a.s.u.PeerSyncWithLeader
> PeerSync: core=igls_shard2_replica_n6 url=
> http://solr.node2.metaonly01.eca.local:8983/solr DONE. sync failed
> solr               | 2018-12-05 12:30:38.185 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr.node2.metaonly01.eca.local:8983_solr
> x:igls_shard2_replica_n6 c:igls s:shard2 r:core_node9) [c:igls s:shard2
> r:core_node9 x:igls_shard2_replica_n6] o.a.s.c.RecoveryStrategy PeerSync
> Recovery was not successful - trying replication.
>
>
>
> Expanding one of them, we have:
>
>
> solr               | 2018-12-05 03:31:39.421 INFO
> (AutoscalingActionExecutor-8-thread-1) [   ] o.a.s.c.a.ExecutePlanAction No
> operations to execute for event: {
> solr               |   "id":"11402ae38efc8T1dr1vcl6ut11qh9fq6ncle94t",
> solr               |   "source":".auto_add_replicas",
> solr               |   "eventTime":303476722167752,
> solr               |   "eventType":"NODELOST",
> solr               |   "properties":{
> solr               |     "eventTimes":[303476722167752],
> solr               |     "preferredOperation":"movereplica",
> solr               |     "_enqueue_time_":303959110445696,
> solr               |
>  "nodeNames":["solr.node4.metaonly01.eca.local:8983_solr"]}}
> solr               | 2018-12-05 03:31:41.290 INFO
> (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
> [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
> o.a.s.c.ShardLeaderElectionContext Enough replicas found to continu
> e.
> solr               | 2018-12-05 03:31:41.291 INFO
> (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
> [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
> o.a.s.c.ShardLeaderElectionContext I may be the new leader - try an
> d sync
> solr               | 2018-12-05 03:31:41.291 INFO
> (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
> [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
> o.a.s.c.SyncStrategy Sync replicas to http://solr.node2.metaonly01.
> eca.local:8983/solr/iglshistory_shard3_replica_n10/
> solr               | 2018-12-05 03:31:41.301 INFO
> (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
> [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
> o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
> http://solr.node2.metaonly01.eca.local:8983/solr START replicas=[
> http://solr.node3.metaonly01.eca.local:8983/solr/iglshistory_shard3_replica_n8/]
> nUpdates=100
> solr               | 2018-12-05 03:31:41.307 INFO
> (coreZkRegister-1-thread-2-processing-n:solr.node2.metaonly01.eca.local:8983_solr)
> [c:iglshistory s:shard3 r:core_node12 x:iglshistory_shard3_replica_n10]
> o.a.s.u.PeerSync PeerSync: core=iglshistory_shard3_replica_n10 url=
> http://solr.node2.metaonly01.eca.local:8983/solr DONE.  We have no
> versions.  sync failed.
>
>
> Not quite sure why it would be executing the auto-add-replicas trigger
> since it's definitely set to false on the collection. (I checked).  Also,
> this log comes from very early in the startup procedure, probably before
> solr has been launched on the other servers.
>
>
>
> Kyle
>
> On Wed, 5 Dec 2018 at 12:58, Kevin Risden <kris...@apache.org> wrote:
>
> > Do you have logs right before the following?
> >
> > "we notice that the nodes go into "Recovering" state for about 10-12 hours
> > before finally coming alive."
> >
> > Is there a peersync failure or something else in the logs indicating why
> > there is a full recovery?
> >
> > Kevin Risden
> >
> >
> > On Wed, Dec 5, 2018 at 12:53 PM lstusr 5u93n4 <lstusr...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > We have a collection:
> > >   - solr 7.5
> > >   - 3 shards, replication factor 2 for a total of 6 NRT replicas
> > >   - 3 servers, 16GB ram each
> > >   - 2 billion documents
> > >   - autoAddReplicas: false
> > >   - 2.1 TB on-disk index size
> > >   - index stored on hdfs on separate servers.
> > >
> > > If we (gracefully) shut down solr on all 3 servers, when we re-launch
> > solr
> > > we notice that the nodes go into "Recovering" state for about 10-12 hours
> > > before finally coming alive.
> > >
> > > During this recovery time, we notice high network traffic outbound from
> > our
> > > HDFS servers to our solr servers. The sum total of which is roughly
> > > equivalent to the index size on disk.
> > >
> > > So it seems to us that on startup, solr has to re-read the entire index
> > > before coming back alive.
> > >
> > > 1. is this assumption correct?
> > > 2. is there any way to mitigate this, so that solr can launch faster?
> > >
> > > Thanks!
> > >
> > > Kyle
> > >
> >

Reply via email to