Hi all, I've encountered a reproducible and confusing issue with our Solr 6.6 cluster. (Updating to 7.x is an option, but not an immediate one.) This is in our staging environment, running on AWS. To save money, we scale our entire stack down to zero instances every night and spin it back up every morning. Here's the process:
SCALE DOWN: 1) Commit & Optimize all collections. 2) Back up each collection to a shared volume (using the Collections API). 3) Spin down all (3) solr instances. 4) Spin down all (2) zookeeper instances. SPIN UP: 1) Spin up zookeeper instances; wait for the instances to find each other and the ensemble to stabilize. 2) Spin up solr instances; wait for them all to stabilize and for zookeeper to recognize them as live nodes. 3) Restore each collection (using the Collections API). It works ALMOST perfectly. The restore operation reports success, and if I look at the UI, everything looks great in the Cloud graph view. All green, one leader and two other active instances per collection. But once we start updating, we run into problems. The two NON-leaders in each collection get the updates, but the leader never does. Since the instances are behind a round robin load balancer, every third query hits an out-of-date core, with unfortunate (for our near-real-time indexing dependent app) results. Reloading the collection doesn't seem to help, but if I use the Collections API to DELETEREPLICA the leader of each collection and follow it with an ADDREPLICA, everything syncs up (with a new leader) and stays in sync from there on out. I don't know what to look for in my settings or my logs to diagnose or try to fix this issue. It only affects collections that have been restored from backup. Any suggestions or guidance would be a big help. Thanks, Michael -- Michael B. Klein Lead Developer, Repository Development and Administration Northwestern University Libraries