Hello, Dominique. What did it log? Which exception? Do you have a chance to review heap dump? What did consume whole heap?
On Sun, May 17, 2020 at 11:05 AM Dominique Bejean <dominique.bej...@eolya.fr> wrote: > Hi, > > I have a six node Solrcoud that suddenly has its six nodes failed with OOM > at the same time. > This can happen even when the Solrcloud is not under heavy load and there > is no indexing. > > I do not see any raison for this to happen. Here are the description of the > issue. Thank you for your suggestions and advices. > > > One or two hours before the nodes stop with OOM, we see this scenario on > all six nodes during the same five minutes time frame : > * a little bit more young gc : from one each second (duration<0.05secs) to > one each two or three seconds (duration <0.15 sec) > * full gc start occurs each 5sec with 0 bytes reclaimed > * young gc start reclaim less bytes > * long full gc start reclaim bytes but with less and less reclaimed bytes > * then no more young GC > Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png > > > Just before the problem occurs : > * there is no more requests per seconds > * no update/commit/merge > * CPU usage and load are low > * disk I/O are low > After the problem starts, requests become longer and longer but still no > increase of CPU usage or disk I/O > > > During last issue, we dumped the threads on one node just before OOM but > unfortunately, more than one hour after the problem starts. > 85% of threads (more than 3000) are BLOCKED and related to log4j > Solr either try to log slow query or try to log problems in requesthandler > at org.apache.solr.common.SolrException.log(SolrException.java:148) > at > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:204) > > This high count of BLOCKED threads is more a consequence than a cause. We > will dump threads each minute until the next issue. > > > About Solr environment : > * Solr 6.6 > * Java Oracle 1.8.0_112 25.112-b15 > > * 1 collection with 10 millions small documents > * 3 shards x 2 replicas > * 3.5 millions docs per core > * 90 Gb index size per core > > * Server with 6 processors and 90 Gb of RAM > * Swappiness set to 1, nearly no swap used > * 4Gb Heap used nearly between 25 to 60% before young GC and one full GC (3 > seconds) each 15 to 30 minutes when all is fine. > > * Default JVM settings with CMS GC > * JMX enabled > * Average Request per seconds in pic on one core : 170, but during the last > issue the Average Request per seconds was 30 !!! > * Average Time per seconds : < 30 ms > > About updates : > * Very few add/updates in general > * Some deleteByQuery (nearly 2000 per day) but not before the problem > occurs > * autocommit maxTime:15000ms > > About queries : > * Queries are standard queries or suggesters > * Queries generate facets but there is no fields with very high number of > unique values > * No grouping > * High usage of function query for relevance computing > > > Thank you. > > Dominique > -- Sincerely yours Mikhail Khludnev