Hi all I've managed to write a test that somewhat reliably reproduces the error. It also, sometimes, throws a NPE inside MQTTConnectionManager.connect (and I think they have the same source, a racing condition on MQTTSession.state) It cannot be merged into anything serious because it has some number of Thread.sleep( ) calls to interleave calls in just the right times to cause the deadlock. If there's any interest, I'd be happy to push it to somewhere public so its easier to try to fix the error. For what's worth, our patched version has been running with my 'fix' for about a week without any issue (it was happening anywhere between 1 and 5 times a week).
Again, thanks for the great software takeshi On Thu, Jan 20, 2022 at 5:58 PM Takeshi Fukushima <takesh...@gmail.com> wrote: > Oh I understand that you didn't specifically target the issue. I was just > hoping that it was a 'collateral damage' of a large refactor =P. > > I also agree 100% with your comment about removing synchronization and I > double that if we are talking about global variables. > > Thanks for hearing me out, > takeshi > > On Thu, Jan 20, 2022 at 5:04 PM Justin Bertram <jbert...@apache.org> > wrote: > >> To be clear, I didn't expect my PR to resolve the issue. My point simply >> was that it's a large change, and I don't want to have to resolve the >> conflicts that may arise from fixing the deadlock before the PR is merged. >> >> In my opinion, all the synchronization around the session state is not >> ideal. I'd like to find a way to refactor it completely to employ >> something >> more elegant. >> >> >> Justin >> >> On Thu, Jan 20, 2022 at 1:32 PM Takeshi Fukushima <takesh...@gmail.com> >> wrote: >> >> > Hi Justin >> > >> > I realize that my fix is not the best, I just don't know the codebase >> well >> > enough to do a better job. That said, the methods that I've marked as >> > synchronized (MQTTConnectionManager connect and disconnect) already >> > synchronize themselves on the session state, but the disconnect method >> > synchronizes on the sessionState, which might get swapped before it has >> a >> > chance to call (synchronized) MQTTSession.stop (which in turn, calls the >> > synchronized MQTTSession.clear but on a different instance of the >> session). >> > On top of that, since a freshly created MQTTSession uses a global >> > MQTTSessionState to begin with (and some methods synchronizes over the >> > state), I think this causes more (unintended) contention and may be >> another >> > source of deadlocks. >> > >> > I've looked at your PR and, unfortunately, the issue still remains as >> far >> > as I can tell. I will be happy to help if you need anything and I will >> do >> > my best to create a testcase that can at least sometimes reproduce the >> > problem (which unfortunately depends all too much on the thread >> scheduling >> > order). >> > >> > Thanks for the response, >> > takeshi >> > >> > On Thu, Jan 20, 2022 at 4:09 PM Justin Bertram <jbert...@apache.org> >> > wrote: >> > >> > > Making a method synchronized typically isn't the preferred solution to >> > > concurrency issues due to the performance penalty it typically >> imposes. >> > > Generally targeted locks are better. However, concurrency issues are >> > often >> > > quite difficult to diagnose and resolve so sometimes a less-than-ideal >> > > solution is better than broken code. >> > > >> > > I currently have a large PR [1] for MQTT 5 support that is yet to be >> > > merged. Once that is merged I'll take a closer look at the deadlock >> issue >> > > from ARTEMIS-3622 that you reported. >> > > >> > > >> > > Justin >> > > >> > > [1] https://github.com/apache/activemq-artemis/pull/3907 >> > > >> > > On Thu, Jan 20, 2022 at 11:08 AM Takeshi Fukushima < >> takesh...@gmail.com> >> > > wrote: >> > > >> > > > Hello everyone >> > > > >> > > > Last month I've opened up a ticket regarding a deadlock that was >> > > occurring >> > > > on one of our servers. >> > > > https://issues.apache.org/jira/browse/ARTEMIS-3622 >> > > > >> > > > After much trial and error, I think I've narrowed it down the cause >> and >> > > > it's in the comments, but being a racing condition I cannot reliably >> > > > reproduce in a test case. I can reproduce manually (carefully >> advancing >> > > the >> > > > threads in the debugger). >> > > > >> > > > I also did patch our server with the quickfix indicated in the >> comments >> > > and >> > > > I will let you know if that fixes the problem (so far so good, but >> its >> > > been >> > > > only about 18 hours). If everything goes well, I will create a PR >> but I >> > > > would like to know if a simple solution like that (marking a couple >> of >> > > > methods as synchronized) would be acceptable, even though I think it >> > > > doesn't fixes all the cases (that one would require much in depth >> > > changes I >> > > > think). >> > > > >> > > > Thanks for the great software anyway, >> > > > takeshi >> > > > >> > > > -- >> > > > http://mapsdev.blogspot.com/ >> > > > Marcelo Takeshi Fukushima >> > > > >> > > >> > >> > >> > -- >> > http://mapsdev.blogspot.com/ >> > Marcelo Takeshi Fukushima >> > >> > > > -- > http://mapsdev.blogspot.com/ > Marcelo Takeshi Fukushima > -- http://mapsdev.blogspot.com/ Marcelo Takeshi Fukushima