Hi all,

I've encountered a reproducible and confusing issue with our Solr 6.6
cluster. (Updating to 7.x is an option, but not an immediate one.) This is
in our staging environment, running on AWS. To save money, we scale our
entire stack down to zero instances every night and spin it back up every
morning. Here's the process:

SCALE DOWN:
1) Commit & Optimize all collections.
2) Back up each collection to a shared volume (using the Collections API).
3) Spin down all (3) solr instances.
4) Spin down all (2) zookeeper instances.

SPIN UP:
1) Spin up zookeeper instances; wait for the instances to find each other
and the ensemble to stabilize.
2) Spin up solr instances; wait for them all to stabilize and for zookeeper
to recognize them as live nodes.
3) Restore each collection (using the Collections API).

It works ALMOST perfectly. The restore operation reports success, and if I
look at the UI, everything looks great in the Cloud graph view. All green,
one leader and two other active instances per collection.

But once we start updating, we run into problems. The two NON-leaders in
each collection get the updates, but the leader never does. Since the
instances are behind a round robin load balancer, every third query hits an
out-of-date core, with unfortunate (for our near-real-time indexing
dependent app) results.

Reloading the collection doesn't seem to help, but if I use the Collections
API to DELETEREPLICA the leader of each collection and follow it with an
ADDREPLICA, everything syncs up (with a new leader) and stays in sync from
there on out.

I don't know what to look for in my settings or my logs to diagnose or try
to fix this issue. It only affects collections that have been restored from
backup. Any suggestions or guidance would be a big help.

Thanks,
Michael

-- 
Michael B. Klein
Lead Developer, Repository Development and Administration
Northwestern University Libraries

Reply via email to