Hi, A few months ago, I reported an issue with Solr nodes crashing due to the old generation heap growing suddenly and generating OOM. This problem occurred again this week. I have threads dumps for each minute during the 3 minutes the problem occured. I am using fastthread.io in order to analyse these dumps.
The threads scenario on failing node is === 15h54 -> it looks fine Old gen heap : 0,5 Gb (3Gb max) 67 threads TIMED_WAITING 26 threads RUNNABLE 7 threads WAITING === 15h55 -> fastthreads reports few suspects Old gen heap stars growing : from 0,5 Gb to 2 Gb (3Gb max) 42 threads TIMED_WAITING 41 threads RUNNABLE 10 threads WAITING The first symptom is 8 runnable threads are stuck (same stack trace) waiting for response from some other nodes java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) at org.apache.http.impl.io.SocketInputBuffer.isDataAvailable(SocketInputBuffer.java:95) at org.apache.http.impl.AbstractHttpClientConnection.isStale(AbstractHttpClientConnection.java:310) at org.apache.http.impl.conn.ManagedClientConnectionImpl.isStale(ManagedClientConnectionImpl.java:158) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:433) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:515) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268) at org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:447) at org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:388) at org.apache.solr.handler.component.HttpShardHandlerFactory.makeLoadBalancedRequest(HttpShardHandlerFactory.java:302) at org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:166) at org.apache.solr.handler.component.HttpShardHandler$$Lambda$192/1788637481.call(Unknown Source) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$15/986729174.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) === 15h56 -> fastthreads reports issue Old gen heap full : from 3Gb (3Gb max) 57 threads TIMED_WAITING 126 threads RUNNABLE 18 threads WAITING 14 threads BLOCKED 7 runnable threads are still stuck (same stack trace) waiting for response from some other nodes 1 BLOCKED thread obtained org.apache.log4j.Logger's lock & did not release it due to that 13 threads are BLOCKED (same stack trace) on org.apache.log4j.Category.callAppenders java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.log4j.Category.callAppenders(Category.java:204) - waiting to lock <0x00000007005a6f08> (a org.apache.log4j.Logger) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:304) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2482) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:534) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:745) === 15h57 -> fastthreads reports issue Old gen heap full : from 3Gb (3Gb max) 43 threads TIMED_WAITING 250 threads RUNNABLE 110 threads WAITING 112 threads BLOCKED 18 runnable threads are still stuck (same stack trace) waiting for response from some other nodes 22 threads are in BLOCKED state (same stack trace) java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.log4j.Category.callAppenders(Category.java:204) - waiting to lock <0x00000007005a4900> (a org.apache.log4j.Logger) 90 threads are in BLOCKED state (same stack trace) java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:816) - waiting to lock <0x000000070087b910> (a org.apache.solr.util.stats.InstrumentedHttpClient) === 15h58 -> Young Gen heap full -> OOM The global scenario is (6 solr nodes) 1/ The problem starts with very slow queries (30 seconds to 2 minutes) on all nodes. 2/ On the failing node : * Some threads are stucked on at java.net.SocketInputStream.socketRead0(Native Method) waiting for responses from other nodes * The Log4j issue starts ( https://blog.fastthread.io/2020/01/24/log4j-bug-slows-down-your-app/) * The Old generation heap grows in few seconds * More and more threads handling incoming request are blocked * The Young generation heap is full -> OOM Ok, slow queries have to be fixed, but even if some nodes are not responding to another node, this node shouldn't finally crash with this Log4J issue. Before slow queries are fixed, any suggestion in order to avoid this crash ? Regards. Dominique