Hello,

We recently upgraded our clusters from 7.1 to 7.2.1. One collection (2 shard, 2 
replica) specifically is in a bad state almost continuously, After proper 
restart the cluster is all green. Within minutes the logs are flooded with many 
bad omens:

o.a.z.ClientCnxn Client session timed out, have not heard from server in 
22130ms (although zkClientTimeOut is 30000).
o.a.s.c.Overseer could not read the data
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /overseer_elect/leader
o.a.s.c.c.DefaultConnectionStrategy Reconnect to ZooKeeper 
failed:org.apache.solr.common.cloud.ZooKeeperException: A ZK error has occurred
o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error trying to 
proxy request for url
etc etc etc
2018-01-26 16:43:31.419 WARN  
(OverseerAutoScalingTriggerThread-171411573518537026-logs4.gr.nl.openindex.io:8983_solr-n_0000001853)
 [   ] o.a.s.c.a.OverseerTriggerThread OverseerTriggerThread woken up but we 
are closed, exiting.

Soon most nodes are gone, maybe one is still green or yellow (recovering from 
another dead node).

A point of interest is that this collection is always under maximum load, 
receiving  hundreds of queries per node per second. We disabled the querying of 
the cluster and restarted it again, this time it kept running fine, and it 
continued to run fine even when we slowly restarted the tons of queries that 
need to be fired.

We just reverted the modifications above, the cluster now receives full load of 
queries as soon as it is available, everything was restarted and everything is 
suddenly fine again.

We really have no clue why for a days everything is fine, then we suddenly come 
into some weird flow (loaded with o.a.z.ClientCnxn Client session timed out 
msgs) and it takes several full restarts for things to settle down. Then all is 
fine until this afternoon where for two hours long the cluster kept dying 
almost instantly. And at this moment, all is well, again, it seems. The only 
steady companion when things go bad are the time outs related to ZK.

Under normal circumstances, we do not time out due to GC, the heap is just 2 
GB. Query response times are ~10 ms even when under maximum load. We would like 
to know why and how it enters a 'bad state' for no apparent reason. Any ideas? 

Many thanks!
Markus

side note: This cluster always has been a pain but 7.2.1 made something worse, 
reverting to 7.1 is not possible due to index being too new (there were no 
notes in CHANGES indicateing an index incompatibility between these two minor 
versions).


Reply via email to