To clear things up, this has been resolved; the problem was present in our 
custom analyzers where we loaded dictionaries in the wrong method. If i 
remember correctly, we loaded them in createComponents (or the other one, don't 
have the code here), so it was per-thread loading of dictionaries.

Apparently, in some cases, Solr creates a lot of threads, causing our code to 
load the same dictionaries over and over.

Although we were at fault, i still don't understand the need Solr sometimes had 
for creating so many threads at unpredictable times. Be is just after start up 
a few times, to hours or days since start up, while our query load and document 
ingestion rate stayed the same.

Thanks,
Markus

 
 
-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Friday 26th January 2018 18:03
> To: Solr-user <solr-user@lucene.apache.org>
> Subject: 7.2.1 cluster dies within minutes after restart
> 
> Hello,
> 
> We recently upgraded our clusters from 7.1 to 7.2.1. One collection (2 shard, 
> 2 replica) specifically is in a bad state almost continuously, After proper 
> restart the cluster is all green. Within minutes the logs are flooded with 
> many bad omens:
> 
> o.a.z.ClientCnxn Client session timed out, have not heard from server in 
> 22130ms (although zkClientTimeOut is 30000).
> o.a.s.c.Overseer could not read the data
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /overseer_elect/leader
> o.a.s.c.c.DefaultConnectionStrategy Reconnect to ZooKeeper 
> failed:org.apache.solr.common.cloud.ZooKeeperException: A ZK error has 
> occurred
> o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error trying 
> to proxy request for url
> etc etc etc
> 2018-01-26 16:43:31.419 WARN  
> (OverseerAutoScalingTriggerThread-171411573518537026-logs4.gr.nl.openindex.io:8983_solr-n_0000001853)
>  [   ] o.a.s.c.a.OverseerTriggerThread OverseerTriggerThread woken up but we 
> are closed, exiting.
> 
> Soon most nodes are gone, maybe one is still green or yellow (recovering from 
> another dead node).
> 
> A point of interest is that this collection is always under maximum load, 
> receiving  hundreds of queries per node per second. We disabled the querying 
> of the cluster and restarted it again, this time it kept running fine, and it 
> continued to run fine even when we slowly restarted the tons of queries that 
> need to be fired.
> 
> We just reverted the modifications above, the cluster now receives full load 
> of queries as soon as it is available, everything was restarted and 
> everything is suddenly fine again.
> 
> We really have no clue why for a days everything is fine, then we suddenly 
> come into some weird flow (loaded with o.a.z.ClientCnxn Client session timed 
> out msgs) and it takes several full restarts for things to settle down. Then 
> all is fine until this afternoon where for two hours long the cluster kept 
> dying almost instantly. And at this moment, all is well, again, it seems. The 
> only steady companion when things go bad are the time outs related to ZK.
> 
> Under normal circumstances, we do not time out due to GC, the heap is just 2 
> GB. Query response times are ~10 ms even when under maximum load. We would 
> like to know why and how it enters a 'bad state' for no apparent reason. Any 
> ideas? 
> 
> Many thanks!
> Markus
> 
> side note: This cluster always has been a pain but 7.2.1 made something 
> worse, reverting to 7.1 is not possible due to index being too new (there 
> were no notes in CHANGES indicateing an index incompatibility between these 
> two minor versions).
> 
> 
> 

Reply via email to