On 5/2/2018 3:52 PM, Michael B. Klein wrote:
> It works ALMOST perfectly. The restore operation reports success, and if I
> look at the UI, everything looks great in the Cloud graph view. All green,
> one leader and two other active instances per collection.
>
> But once we start updating, we run into problems. The two NON-leaders in
> each collection get the updates, but the leader never does. Since the
> instances are behind a round robin load balancer, every third query hits an
> out-of-date core, with unfortunate (for our near-real-time indexing
> dependent app) results.

That is completely backwards from what I would expect in a problem
report.  The leader coordinates all indexing, so if the two other
replicas are getting the updates, that means that at least part of the
functionality of the leader replica *IS* working.

Side FYI: Unless you're using preferLocalShards=true, Solr will actually
load balance your load balanced requests.  If your external load
balancer sends queries to replica1, replica1 may forward the request to
replica3 because of SolrCloud's own internal load balancing.  The
preferLocalShards parameter will keep that from happening *if* the
machine receiving the query has the replicas required to satisfy the query.

> Reloading the collection doesn't seem to help, but if I use the Collections
> API to DELETEREPLICA the leader of each collection and follow it with an
> ADDREPLICA, everything syncs up (with a new leader) and stays in sync from
> there on out.
>
> I don't know what to look for in my settings or my logs to diagnose or try
> to fix this issue. It only affects collections that have been restored from
> backup. Any suggestions or guidance would be a big help.

I don't know what to look for in the logs either, but the first thing to
check for is any messages at WARN or ERROR logging levels.  These kind
of messages should also show up in the admin UI logging tab, but
recovering the full text of those messages is much easier in the logfile
than the admin UI.

Have you tried restarting the Solr instances after restoring the
collection?  This shouldn't be required, but at this point I'm hoping to
at least get you limping along, even if it requires steps that are
obvious indications of a bug.

Since you're running 6.6 and 6.x is in maintenance mode, it's not likely
that any bugs revealed will be fixed on 6.x, but maybe we can track it
down and see if it's still a problem in 7.x.  How much pain will it
cause you to get upgraded?

Also FYI:  Two zookeeper servers is actually LESS fault tolerant than
only having one, because if either server goes down, quorum is lost. 
You need at least three for fault tolerance.

Thanks,
Shawn

Reply via email to