7.2.1 cluster dies within minutes after restart

Markus Jelsma Fri, 26 Jan 2018 09:03:29 -0800

Hello,

We recently upgraded our clusters from 7.1 to 7.2.1. One collection (2 shard, 2 
replica) specifically is in a bad state almost continuously, After proper 
restart the cluster is all green. Within minutes the logs are flooded with many 
bad omens:

o.a.z.ClientCnxn Client session timed out, have not heard from server in
22130ms (although zkClientTimeOut is 30000).
o.a.s.c.Overseer could not read the data
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /overseer_elect/leader
o.a.s.c.c.DefaultConnectionStrategy Reconnect to ZooKeeper
failed:org.apache.solr.common.cloud.ZooKeeperException: A ZK error has occurred
o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error trying to
proxy request for url
etc etc etc
2018-01-26 16:43:31.419 WARN
(OverseerAutoScalingTriggerThread-171411573518537026-logs4.gr.nl.openindex.io:8983_solr-n_0000001853)
[ ] o.a.s.c.a.OverseerTriggerThread OverseerTriggerThread woken up but we
are closed, exiting.

Soon most nodes are gone, maybe one is still green or yellow (recovering from
another dead node).

A point of interest is that this collection is always under maximum load,
receiving hundreds of queries per node per second. We disabled the querying of
the cluster and restarted it again, this time it kept running fine, and it
continued to run fine even when we slowly restarted the tons of queries that
need to be fired.

We just reverted the modifications above, the cluster now receives full load of
queries as soon as it is available, everything was restarted and everything is
suddenly fine again.

We really have no clue why for a days everything is fine, then we suddenly come
into some weird flow (loaded with o.a.z.ClientCnxn Client session timed out
msgs) and it takes several full restarts for things to settle down. Then all is
fine until this afternoon where for two hours long the cluster kept dying
almost instantly. And at this moment, all is well, again, it seems. The only
steady companion when things go bad are the time outs related to ZK.

Under normal circumstances, we do not time out due to GC, the heap is just 2
GB. Query response times are ~10 ms even when under maximum load. We would like
to know why and how it enters a 'bad state' for no apparent reason. Any ideas?

Many thanks!
Markus

side note: This cluster always has been a pain but 7.2.1 made something worse,
reverting to 7.1 is not possible due to index being too new (there were no
notes in CHANGES indicateing an index incompatibility between these two minor
versions).

7.2.1 cluster dies within minutes after restart

Reply via email to