To clear things up, this has been resolved; the problem was present in our custom analyzers where we loaded dictionaries in the wrong method. If i remember correctly, we loaded them in createComponents (or the other one, don't have the code here), so it was per-thread loading of dictionaries.
Apparently, in some cases, Solr creates a lot of threads, causing our code to load the same dictionaries over and over. Although we were at fault, i still don't understand the need Solr sometimes had for creating so many threads at unpredictable times. Be is just after start up a few times, to hours or days since start up, while our query load and document ingestion rate stayed the same. Thanks, Markus -----Original message----- > From:Markus Jelsma <markus.jel...@openindex.io> > Sent: Friday 26th January 2018 18:03 > To: Solr-user <solr-user@lucene.apache.org> > Subject: 7.2.1 cluster dies within minutes after restart > > Hello, > > We recently upgraded our clusters from 7.1 to 7.2.1. One collection (2 shard, > 2 replica) specifically is in a bad state almost continuously, After proper > restart the cluster is all green. Within minutes the logs are flooded with > many bad omens: > > o.a.z.ClientCnxn Client session timed out, have not heard from server in > 22130ms (although zkClientTimeOut is 30000). > o.a.s.c.Overseer could not read the data > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for /overseer_elect/leader > o.a.s.c.c.DefaultConnectionStrategy Reconnect to ZooKeeper > failed:org.apache.solr.common.cloud.ZooKeeperException: A ZK error has > occurred > o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error trying > to proxy request for url > etc etc etc > 2018-01-26 16:43:31.419 WARN > (OverseerAutoScalingTriggerThread-171411573518537026-logs4.gr.nl.openindex.io:8983_solr-n_0000001853) > [ ] o.a.s.c.a.OverseerTriggerThread OverseerTriggerThread woken up but we > are closed, exiting. > > Soon most nodes are gone, maybe one is still green or yellow (recovering from > another dead node). > > A point of interest is that this collection is always under maximum load, > receiving hundreds of queries per node per second. We disabled the querying > of the cluster and restarted it again, this time it kept running fine, and it > continued to run fine even when we slowly restarted the tons of queries that > need to be fired. > > We just reverted the modifications above, the cluster now receives full load > of queries as soon as it is available, everything was restarted and > everything is suddenly fine again. > > We really have no clue why for a days everything is fine, then we suddenly > come into some weird flow (loaded with o.a.z.ClientCnxn Client session timed > out msgs) and it takes several full restarts for things to settle down. Then > all is fine until this afternoon where for two hours long the cluster kept > dying almost instantly. And at this moment, all is well, again, it seems. The > only steady companion when things go bad are the time outs related to ZK. > > Under normal circumstances, we do not time out due to GC, the heap is just 2 > GB. Query response times are ~10 ms even when under maximum load. We would > like to know why and how it enters a 'bad state' for no apparent reason. Any > ideas? > > Many thanks! > Markus > > side note: This cluster always has been a pain but 7.2.1 made something > worse, reverting to 7.1 is not possible due to index being too new (there > were no notes in CHANGES indicateing an index incompatibility between these > two minor versions). > > >