This is a tricky one. I have come a bit further, since we reproduced the deadlock in an internal test environment and grabbed some thread dumps that confirmed that all AsyncTracker Semaphore permits are depleted and several threads are waiting for acquire().
Thought it's time to open a JIRA at this time, here it is: https://issues.apache.org/jira/browse/SOLR-18174 I have made a (somewhat artificial) Junit test reproducing the deadlock state, happening on request failure. But I have still not been successful in explaining how most of the 1000 permits may have leaked over time. If anyone has anything to add to this case, please answer here or in the JIRA/PR. Jan > 18. mars 2026 kl. 15:31 skrev Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) > <[email protected]>: > > My immediate thought was HTTP/2 but I see you are running with > HTTP/1 (although interestingly some changes made for the sake of > HTTP/2 may have contributed since there is shared code). I am > sort of skeptical of the third finding. The application of the > idle timeout as the default request timeout isn't *that* old. > I remember researching this because of an issue with the index > fetcher (which incidentally should *not* have this behavior) > > https://issues.apache.org/jira/browse/SOLR-17711 > > The thought of a bunch of requests trickling little bits of > data for arbitrarily long, just enough to reset idle timeout > seems unlikely at first blush. > > > From: [email protected] At: 03/18/26 07:39:49 UTC-4:00To: > [email protected] > Subject: Deadlock observed for distributed search in Solr 9.10.1 > > We recently upgraded some Solr clusters from version 9.7 to 9.10.1. > Collection > have multiple shards and run distributed requests continously. After a few > days, distributed requests would start timing out and all clients would fail, > requiring a full solr cluster restart to recover. No sign of overload. > Downgrading back to Solr 9.7 fixed the issues. This has been observed in > several different environments. > > Have anyone else seen similar behavior in your own clusters? > > As there is no errors in Solr logs, no exceptions, no high load or scary > Grafana graphs in GC or otherwise, we have spent several days investigating > and > trying to reproduce, with limited luck. > > The best I have is an LLM analysis of the issue and a theory of what might > cause it. It think the analysis is interesting and the suspect is leaking > semaphores in Http2SolrClient.AsyncTracker which would eventually cause a > full > stop. > > The analysis is here https://cwiki.apache.org/confluence/x/AZM8G - it > contains > a description, executive summary, tech details and some questions for > ocmmitters. You may comment inline in Confluence if you have an account, or > here in this thread. > > I have not yet filed a bug in JIRA, as I want to discuss here and still hope > to > reproduce the issue in a pristine environment. > > Jan >
