We run 8.3.1 in prod without any problems, but we're having issues with trying to upgrade.
I've created an 8.9.0 leader & follower, imported our live data into it, and am testing it via replaying requests made to prod. We're seeing a big problem where fairly moderate request rates are causing the instance to become so slow it fails healthcheck. The logs showed a lot of errors around creating threads: solr[4507]: [124136.511s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 256k, guardsize: 0k, detached. WARN (qtp178604517-3891) [ ] o.e.j.i.ManagedSelector => java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached So I monitored thread count for the process whilst running the test suite and saw a persistent pattern: Threads increased until maxed out, the logs flooded with errors as it tried to create still more threads, and the instance slowed down until terminated as unhealthy. The DefaultTasksMax is set to 4915, I've tried raising and lowering it but regardless of value the result is the same: it gets maxed and everything slows down. Is there anything I can do to stop solr spinning up so many threads it ceases to function? There have been a few test passes where it spontaneously dropped threadcount from thousands to hundreds and stayed up longer, but there seems no pattern to when this happens. Running the same tests on 8.3.1 results in a much slower increase in threads and it never quite maxes them so things continue to function. See below for the thread count and healthcheck times seen on a (fairly harsh) test run of 100 requests/sec Thanks Dominic Threadcount: ubuntu@ip-10-40-22-166:~$ while [ 1 ]; do date; ps -eLF | grep 'start.jar' | wc -l; sleep 10s; done Tue Oct 12 14:27:33 UTC 2021 52 Tue Oct 12 14:27:43 UTC 2021 52 Tue Oct 12 14:27:54 UTC 2021 52 Tue Oct 12 14:28:04 UTC 2021 52 Tue Oct 12 14:28:14 UTC 2021 569 Tue Oct 12 14:28:24 UTC 2021 899 Tue Oct 12 14:28:34 UTC 2021 1198 Tue Oct 12 14:28:44 UTC 2021 1589 Tue Oct 12 14:28:54 UTC 2021 2016 Tue Oct 12 14:29:05 UTC 2021 2451 Tue Oct 12 14:29:15 UTC 2021 2851 Tue Oct 12 14:29:26 UTC 2021 2934 Tue Oct 12 14:29:36 UTC 2021 3249 Tue Oct 12 14:29:46 UTC 2021 3501 Tue Oct 12 14:29:57 UTC 2021 3734 Tue Oct 12 14:30:07 UTC 2021 4128 Tue Oct 12 14:30:18 UTC 2021 4374 Tue Oct 12 14:30:29 UTC 2021 4637 Tue Oct 12 14:30:39 UTC 2021 4693 Tue Oct 12 14:30:50 UTC 2021 4807 Tue Oct 12 14:31:01 UTC 2021 4916 Tue Oct 12 14:31:11 UTC 2021 4916 Tue Oct 12 14:31:22 UTC 2021 Connection to 10.40.22.166 closed by remote host. Healthcheck: ubuntu@ip-10-40-22-166:~$ while [ 1 ]; do date; curl -v localhost:8983/solr/ 2>&1 | grep HTTP; date; echo '----'; sleep 10s; done Tue Oct 12 14:27:34 UTC 2021 > GET /solr/ HTTP/1.1 < HTTP/1.1 200 OK Tue Oct 12 14:27:34 UTC 2021 ---- Tue Oct 12 14:27:44 UTC 2021 > GET /solr/ HTTP/1.1 < HTTP/1.1 200 OK Tue Oct 12 14:27:44 UTC 2021 ---- Tue Oct 12 14:27:54 UTC 2021 > GET /solr/ HTTP/1.1 < HTTP/1.1 200 OK Tue Oct 12 14:27:54 UTC 2021 ---- Tue Oct 12 14:28:04 UTC 2021 > GET /solr/ HTTP/1.1 < HTTP/1.1 200 OK Tue Oct 12 14:28:04 UTC 2021 ---- Tue Oct 12 14:28:14 UTC 2021 > GET /solr/ HTTP/1.1 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0< HTTP/1.1 200 OK Tue Oct 12 14:28:16 UTC 2021 ---- Tue Oct 12 14:28:26 UTC 2021 > GET /solr/ HTTP/1.1 0 0 0 0 0 0 0 0 --:--:-- 0:00:12 --:--:-- 0< HTTP/1.1 200 OK Tue Oct 12 14:28:39 UTC 2021 ---- Tue Oct 12 14:28:49 UTC 2021 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0> GET /solr/ HTTP/1.1 0 0 0 0 0 0 0 0 --:--:-- 0:00:23 --:--:-- 0< HTTP/1.1 200 OK Tue Oct 12 14:29:13 UTC 2021 ---- Tue Oct 12 14:29:23 UTC 2021 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0> GET /solr/ HTTP/1.1 < HTTP/1.1 200 OK Tue Oct 12 14:29:25 UTC 2021 ---- Tue Oct 12 14:29:35 UTC 2021 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0> GET /solr/ HTTP/1.1 0 0 0 0 0 0 0 0 --:--:-- 0:00:09 --:--:-- 0< HTTP/1.1 200 OK Tue Oct 12 14:29:44 UTC 2021 ---- Tue Oct 12 14:29:54 UTC 2021 > GET /solr/ HTTP/1.1 0 0 0 0 0 0 0 0 --:--:-- 0:00:11 --:--:-- 0< HTTP/1.1 200 OK Tue Oct 12 14:30:06 UTC 2021 ---- Tue Oct 12 14:30:16 UTC 2021 > GET /solr/ HTTP/1.1 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0< HTTP/1.1 200 OK Tue Oct 12 14:30:20 UTC 2021 ---- Tue Oct 12 14:30:30 UTC 2021 > GET /solr/ HTTP/1.1 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0< HTTP/1.1 200 OK Tue Oct 12 14:30:33 UTC 2021 ---- Tue Oct 12 14:30:43 UTC 2021 > GET /solr/ HTTP/1.1 < HTTP/1.1 200 OK Tue Oct 12 14:30:43 UTC 2021 ---- Tue Oct 12 14:30:53 UTC 2021 > GET /solr/ HTTP/1.1 Tue Oct 12 14:30:55 UTC 2021 ---- Tue Oct 12 14:31:05 UTC 2021 > GET /solr/ HTTP/1.1 < HTTP/1.1 200 OK Tue Oct 12 14:31:05 UTC 2021 ---- Tue Oct 12 14:31:15 UTC 2021 > GET /solr/ HTTP/1.1 < HTTP/1.1 200 OK Tue Oct 12 14:31:15 UTC 2021 ---- Connection to 10.40.22.166 closed by remote host.
