Hi Justin I realize that my fix is not the best, I just don't know the codebase well enough to do a better job. That said, the methods that I've marked as synchronized (MQTTConnectionManager connect and disconnect) already synchronize themselves on the session state, but the disconnect method synchronizes on the sessionState, which might get swapped before it has a chance to call (synchronized) MQTTSession.stop (which in turn, calls the synchronized MQTTSession.clear but on a different instance of the session). On top of that, since a freshly created MQTTSession uses a global MQTTSessionState to begin with (and some methods synchronizes over the state), I think this causes more (unintended) contention and may be another source of deadlocks.
I've looked at your PR and, unfortunately, the issue still remains as far as I can tell. I will be happy to help if you need anything and I will do my best to create a testcase that can at least sometimes reproduce the problem (which unfortunately depends all too much on the thread scheduling order). Thanks for the response, takeshi On Thu, Jan 20, 2022 at 4:09 PM Justin Bertram <jbert...@apache.org> wrote: > Making a method synchronized typically isn't the preferred solution to > concurrency issues due to the performance penalty it typically imposes. > Generally targeted locks are better. However, concurrency issues are often > quite difficult to diagnose and resolve so sometimes a less-than-ideal > solution is better than broken code. > > I currently have a large PR [1] for MQTT 5 support that is yet to be > merged. Once that is merged I'll take a closer look at the deadlock issue > from ARTEMIS-3622 that you reported. > > > Justin > > [1] https://github.com/apache/activemq-artemis/pull/3907 > > On Thu, Jan 20, 2022 at 11:08 AM Takeshi Fukushima <takesh...@gmail.com> > wrote: > > > Hello everyone > > > > Last month I've opened up a ticket regarding a deadlock that was > occurring > > on one of our servers. > > https://issues.apache.org/jira/browse/ARTEMIS-3622 > > > > After much trial and error, I think I've narrowed it down the cause and > > it's in the comments, but being a racing condition I cannot reliably > > reproduce in a test case. I can reproduce manually (carefully advancing > the > > threads in the debugger). > > > > I also did patch our server with the quickfix indicated in the comments > and > > I will let you know if that fixes the problem (so far so good, but its > been > > only about 18 hours). If everything goes well, I will create a PR but I > > would like to know if a simple solution like that (marking a couple of > > methods as synchronized) would be acceptable, even though I think it > > doesn't fixes all the cases (that one would require much in depth > changes I > > think). > > > > Thanks for the great software anyway, > > takeshi > > > > -- > > http://mapsdev.blogspot.com/ > > Marcelo Takeshi Fukushima > > > -- http://mapsdev.blogspot.com/ Marcelo Takeshi Fukushima