There is a thread dump on the Solr admin. You can use that to determine
what all those threads are doing and where they are getting stuck. You can
post parts of the thread dump back to this email thread as well.



Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Oct 12, 2021 at 11:15 AM Dominic Humphries
<[email protected]> wrote:

> We run 8.3.1 in prod without any problems, but we're having issues with
> trying to upgrade.
>
> I've created an 8.9.0 leader & follower, imported our live data into it,
> and am testing it via replaying requests made to prod. We're seeing a big
> problem where fairly moderate request rates are causing the instance to
> become so slow it fails healthcheck. The logs showed a lot of errors around
> creating threads:
>
> solr[4507]: [124136.511s][warning][os,thread] Failed to start thread -
> pthread_create failed (EAGAIN) for attributes: stacksize: 256k, guardsize:
> 0k, detached.
>
> WARN  (qtp178604517-3891) [   ] o.e.j.i.ManagedSelector  =>
> java.lang.OutOfMemoryError: unable to create native thread: possibly out of
> memory or process/resource limits reached
>
> So I monitored thread count for the process whilst running the test suite
> and saw a persistent pattern: Threads increased until maxed out, the logs
> flooded with errors as it tried to create still more threads, and the
> instance slowed down until terminated as unhealthy.
>
> The DefaultTasksMax is set to 4915, I've tried raising and lowering it but
> regardless of value the result is the same: it gets maxed and everything
> slows down.
>
> Is there anything I can do to stop solr spinning up so many threads it
> ceases to function? There have been a few test passes where it
> spontaneously dropped threadcount from thousands to hundreds and stayed up
> longer, but there seems no pattern to when this happens. Running the same
> tests on 8.3.1 results in a much slower increase in threads and it never
> quite maxes them so things continue to function.
>
> See below for the thread count and healthcheck times seen on a (fairly
> harsh) test run of 100 requests/sec
>
> Thanks
>
> Dominic
>
>
> Threadcount:
>
> ubuntu@ip-10-40-22-166:~$ while [ 1 ]; do date; ps -eLF | grep 'start.jar'
> | wc -l; sleep 10s; done
> Tue Oct 12 14:27:33 UTC 2021
> 52
> Tue Oct 12 14:27:43 UTC 2021
> 52
> Tue Oct 12 14:27:54 UTC 2021
> 52
> Tue Oct 12 14:28:04 UTC 2021
> 52
> Tue Oct 12 14:28:14 UTC 2021
> 569
> Tue Oct 12 14:28:24 UTC 2021
> 899
> Tue Oct 12 14:28:34 UTC 2021
> 1198
> Tue Oct 12 14:28:44 UTC 2021
> 1589
> Tue Oct 12 14:28:54 UTC 2021
> 2016
> Tue Oct 12 14:29:05 UTC 2021
> 2451
> Tue Oct 12 14:29:15 UTC 2021
> 2851
> Tue Oct 12 14:29:26 UTC 2021
> 2934
> Tue Oct 12 14:29:36 UTC 2021
> 3249
> Tue Oct 12 14:29:46 UTC 2021
> 3501
> Tue Oct 12 14:29:57 UTC 2021
> 3734
> Tue Oct 12 14:30:07 UTC 2021
> 4128
> Tue Oct 12 14:30:18 UTC 2021
> 4374
> Tue Oct 12 14:30:29 UTC 2021
> 4637
> Tue Oct 12 14:30:39 UTC 2021
> 4693
> Tue Oct 12 14:30:50 UTC 2021
> 4807
> Tue Oct 12 14:31:01 UTC 2021
> 4916
> Tue Oct 12 14:31:11 UTC 2021
> 4916
> Tue Oct 12 14:31:22 UTC 2021
> Connection to 10.40.22.166 closed by remote host.
>
>
> Healthcheck:
>
> ubuntu@ip-10-40-22-166:~$ while [ 1 ]; do date; curl -v
> localhost:8983/solr/ 2>&1 | grep HTTP; date; echo '----'; sleep
> 10s; done
> Tue Oct 12 14:27:34 UTC 2021
> > GET /solr/ HTTP/1.1
> < HTTP/1.1 200 OK
> Tue Oct 12 14:27:34 UTC 2021
> ----
> Tue Oct 12 14:27:44 UTC 2021
> > GET /solr/ HTTP/1.1
> < HTTP/1.1 200 OK
> Tue Oct 12 14:27:44 UTC 2021
> ----
> Tue Oct 12 14:27:54 UTC 2021
> > GET /solr/ HTTP/1.1
> < HTTP/1.1 200 OK
> Tue Oct 12 14:27:54 UTC 2021
> ----
> Tue Oct 12 14:28:04 UTC 2021
> > GET /solr/ HTTP/1.1
> < HTTP/1.1 200 OK
> Tue Oct 12 14:28:04 UTC 2021
> ----
> Tue Oct 12 14:28:14 UTC 2021
> > GET /solr/ HTTP/1.1
>   0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--
>   0< HTTP/1.1 200 OK
> Tue Oct 12 14:28:16 UTC 2021
> ----
> Tue Oct 12 14:28:26 UTC 2021
> > GET /solr/ HTTP/1.1
>   0     0    0     0    0     0      0      0 --:--:--  0:00:12 --:--:--
>   0< HTTP/1.1 200 OK
> Tue Oct 12 14:28:39 UTC 2021
> ----
> Tue Oct 12 14:28:49 UTC 2021
>   0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--
>   0> GET /solr/ HTTP/1.1
>   0     0    0     0    0     0      0      0 --:--:--  0:00:23 --:--:--
>   0< HTTP/1.1 200 OK
> Tue Oct 12 14:29:13 UTC 2021
> ----
> Tue Oct 12 14:29:23 UTC 2021
>   0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--
>   0> GET /solr/ HTTP/1.1
> < HTTP/1.1 200 OK
> Tue Oct 12 14:29:25 UTC 2021
> ----
> Tue Oct 12 14:29:35 UTC 2021
>   0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--
>   0> GET /solr/ HTTP/1.1
>   0     0    0     0    0     0      0      0 --:--:--  0:00:09 --:--:--
>   0< HTTP/1.1 200 OK
> Tue Oct 12 14:29:44 UTC 2021
> ----
> Tue Oct 12 14:29:54 UTC 2021
> > GET /solr/ HTTP/1.1
>   0     0    0     0    0     0      0      0 --:--:--  0:00:11 --:--:--
>   0< HTTP/1.1 200 OK
> Tue Oct 12 14:30:06 UTC 2021
> ----
> Tue Oct 12 14:30:16 UTC 2021
> > GET /solr/ HTTP/1.1
>   0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--
>   0< HTTP/1.1 200 OK
> Tue Oct 12 14:30:20 UTC 2021
> ----
> Tue Oct 12 14:30:30 UTC 2021
> > GET /solr/ HTTP/1.1
>   0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--
>   0< HTTP/1.1 200 OK
> Tue Oct 12 14:30:33 UTC 2021
> ----
> Tue Oct 12 14:30:43 UTC 2021
> > GET /solr/ HTTP/1.1
> < HTTP/1.1 200 OK
> Tue Oct 12 14:30:43 UTC 2021
> ----
> Tue Oct 12 14:30:53 UTC 2021
> > GET /solr/ HTTP/1.1
> Tue Oct 12 14:30:55 UTC 2021
> ----
> Tue Oct 12 14:31:05 UTC 2021
> > GET /solr/ HTTP/1.1
> < HTTP/1.1 200 OK
> Tue Oct 12 14:31:05 UTC 2021
> ----
> Tue Oct 12 14:31:15 UTC 2021
> > GET /solr/ HTTP/1.1
> < HTTP/1.1 200 OK
> Tue Oct 12 14:31:15 UTC 2021
> ----
> Connection to 10.40.22.166 closed by remote host.
>

Reply via email to