[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 badlink andrea.9...@gmail.com changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution|--- |FIXED --- Comment #35 from badlink andrea.9...@gmail.com --- I took my time and started digging trough the druntime source to find the problem. I discovered that core.thread.suspend() sends SIGUSR1 to the thread to suspend so I launched GDB and tried to catch the signal but GDB never caught it. I couldn't explain that behavior so I googled linux sigusr1 not sent and Bam! -- https://bbs.archlinux.org/viewtopic.php?id=181142 The people on there already figured that it's a bug in the Gnome Display Manager which blocks SIGUSR1 for all child applications ! The bug is present in the current package gdm-3.12.2-1 which I am using and is causing this nasty deadlock in the D garbage collector. As a quick test I tried to run the testcase in a tty and it worked as expected. Hopefully the problem will solve itself in the next gdm update. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #34 from badlink andrea.9...@gmail.com --- (In reply to Sean Kelly from comment #33) The GC was in use for probably 5 years without a reported deadlock. Though history isn't exactly proof. I don't suppose someone wants to regress this and find the offending release? Isn't there a D tool for this? Or would we be stuck with git bisect? I just have tried a few old versions: - 2.054 (self-compiled) deadlocks - 2.042 (self-compiled) deadlocks - 1.076 (http://dlang.org/download.html) fullCollect never returns - 1.030 (http://dlang.org/download.html) fullCollect never returns code used for D1: http://pastebin.com/wu9guHA6 stacktrace (DMDv1.030): http://pastebin.com/CfNStRvm Does anyone else have this issue ? Right now I'm on Arch Linux 3.14.19-1-lts x86_64, GNU libc 2.20. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #32 from Martin Nowak c...@dawg.eu --- Can we first confirm that this is a regression. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #33 from Sean Kelly s...@invisibleduck.org --- The GC was in use for probably 5 years without a reported deadlock. Though history isn't exactly proof. I don't suppose someone wants to regress this and find the offending release? Isn't there a D tool for this? Or would we be stuck with git bisect? --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #31 from Sobirari Muhomori dfj1es...@sneakemail.com --- Hmm... stack trace in issue 11806 is quite different. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #29 from badlink andrea.9...@gmail.com --- Also present in DMD 2.067.0-b1. Stacktrace of the sample program in comment 10: http://pastebin.com/4mudSeEX --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 badlink andrea.9...@gmail.com changed: What|Removed |Added CC||christ...@nerdtools.de --- Comment #30 from badlink andrea.9...@gmail.com --- *** Issue 11806 has been marked as a duplicate of this issue. *** --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 Brad Roberts bra...@puremagic.com changed: What|Removed |Added CC||bra...@puremagic.com --- Comment #28 from Brad Roberts bra...@puremagic.com --- Might not be related, but for reference, bug 13416 --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #27 from Sean Kelly s...@invisibleduck.org --- Earlier than that. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #23 from Tomash Brechko tomash.brec...@gmail.com --- I think the order of events is such that pthread_create() is followed by pthread_kill() from main thread before the new thread had any chance to run. In this case there are reports that the new thread may miss signals on Linux: http://stackoverflow.com/questions/14827509/does-the-new-thread-exist-when-pthread-create-returns . I think POSIX intent is such that pthread_kill() should work once you have thread ID, i.e. it's a bug with (some versions of) Linux kernel (maybe the signal is first raised and then pending signals are cleared (per POSIX) for the new thread when it starts, or the signal is not become pending as it is not blocked, but is not delivered either because the thread is not really running yet; though on my 3.15.10 pthread_kill() after pthread_create() always works in C, and I don't have D compiler at the moment to check if I'm still able to reproduce original problem). OTOH issue 10351 is marked as duplicate, but it's not clear if the threads involved there are newly created. On a side note, in thread_entryPoint() there's a place: // NOTE: isRunning should be set to false after the thread is // removed or a double-removal could occur between this // function and thread_suspendAll. Thread.remove( obj ); obj.m_isRunning = false; Note that if thread_suspendAll() is called after remove() but before assignment you still will have double removal. This shouldn't relate to bug in question however. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #24 from Tomash Brechko tomash.brec...@gmail.com --- Now I see that I was wrong about double removal, please ignore that part. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #25 from Sean Kelly s...@invisibleduck.org --- Hrm... at one point thread_entryPoint called Thread.add to add itself, but I think the add was moved to Thread.start at some point to deal with a race. I had a comment in Thread.start explaining the rationale, but it looks like Thread.start has been heavily edited and the comment is gone. Either way, having Thread.start call Thread.add *after* pthread_create is totally wrong, as it leaves a window for the thread to exist and be allocating memory but be unknown to the GC. I think I'll have to roll back thread.d to find my original comments and see how it used to be implemented. Something was clearly changed here, but there's no longer enough info to tell exactly what. I've got to say that seeing these and other changes in core.thread without careful documentation of what was changed and why it was done is very frustrating. There's simply no way to unit test for the existence or lack of deadlocks, and the comments in this module were built up over years of bug fixes to explain each situation and why the code was the way it was. If someone changes the code in this module they *must* be absolutely sure of what they are doing and document accordingly. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #26 from safety0ff.bugz safety0ff.b...@gmail.com --- (In reply to Sean Kelly from comment #25) I think I'll have to roll back thread.d to find my original comments and see how it used to be implemented. Something was clearly changed here, but there's no longer enough info to tell exactly what. This change? https://github.com/D-Programming-Language/druntime/commit/7a731ffe0869dc --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #16 from badlink andrea.9...@gmail.com --- (In reply to Sean Kelly from comment #15) Okay, I can't reproduce this using the provided code on Oracle Linux 64-bit. If someone has a reliable repro, please let me know. My Linux machine is using Arch Linux, 3.14.17-1-lts x86_64 kernel, GNU libc 2.19. Oracle Linux is completely different as it is using the 3.8.13 x86_64 kernel and glibc 2.17 (http://www.oracle.com/us/technologies/linux/product/specifications/index.html). Try Manjaro Linux wich is based on Arch but come with a ready desktop environment (just run `pacman -S dlang-dmd` to get DMD) --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #17 from badlink andrea.9...@gmail.com --- Created attachment 1416 -- https://issues.dlang.org/attachment.cgi?id=1416action=edit stack trace --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 Marco Leise marco.le...@gmx.de changed: What|Removed |Added CC||marco.le...@gmx.de --- Comment #18 from Marco Leise marco.le...@gmx.de --- *** Issue 10351 has been marked as a duplicate of this issue. *** --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #19 from Sobirari Muhomori dfj1es...@sneakemail.com --- (In reply to badlink from comment #17) stack trace Hmm... if a thread hangs on a mutex, does it handle signals? --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #20 from Sean Kelly s...@invisibleduck.org --- It should. Not doing so seems pretty broken. But it this particular kernel it seems like maybe signals are ignored in this situation. What's happening specifically is that the one thread is blocked on the mutex protecting the GC, and another thread holds that lock and is attempting a collection. I could change this code to use a spin lock instead, but the same problem could crop up with any mutex if I understand the problem correctly. I'm kind of curious to see whether the Boehm GC deadlocks in a similar situation with this kernel. It should, since last time I checked it coordinated collections the exact same way on Linux. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #21 from Sobirari Muhomori dfj1es...@sneakemail.com --- This mutex protects various global data like the list of threads in core.thread, not GC. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #22 from Sean Kelly s...@invisibleduck.org --- Yes I misspoke somewhat. The GC acquires the lock to the global thread list while collecting to ensure that everything remains in a consistent state while the collection takes place. In this case the GC already holds this lock and Thread.start() is blocked on it waiting to add the new thread to the list. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 andrea.9...@gmail.com changed: What|Removed |Added Keywords||industry Status|RESOLVED|REOPENED CC||andrea.9...@gmail.com Resolution|FIXED |--- Severity|normal |regression --- Comment #10 from andrea.9...@gmail.com --- This bug is present in DMD 2.066 on Arch Linux 3.14.17-1-lts x86_64 (GNU libc 2.19). The code posted originally still deadlocks (and even with j.sleep uncommented, it never prints a . which means GC.collect never returns): import core.thread, core.memory, std.stdio; class Job : Thread { this() { super(run); } private void run() { while (true) write(*); } } void main() { Job j = new Job; j.start(); //j.sleep(dur!msecs(1)); GC.collect(); while(true) write(.); } --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #11 from Sean Kelly s...@invisibleduck.org --- My initial guess is that this has something to do with the changes for critical regions, as the algorithm for collection before that seemed quite solid. I'll try for a repro on my end though. What would be really useful from whoever encounters this is to trap it in a debugger and include stack traces of all relevant threads. Something has to be blocked on a lock or signal somewhere, but without knowing which one there's little that can be done. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #12 from Sean Kelly s...@invisibleduck.org --- Um... I may be wrong in what I just said. It looks like someone added a delegate call within the signal handler for coordinating collections on Linux. There's a decent chance that a dynamic stack frame is being allocated by the GC within that signal handler, which would be Very Bad. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #13 from andrea.9...@gmail.com --- Just tested, the bug is not present on Windows (DMD 2.066) --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #14 from Sean Kelly s...@invisibleduck.org --- It's likely as I said. The way GC collections work is different on different platforms. Both Windows and OSX use a kernel call to suspend threads and inspect their stacks. On other Unix platforms (like Linux), the suspending is done via signals, and signal handlers are VERY restrictive in what can safely be done inside them. And either way, having one thread try to allocate something from the GC inside this suspend handler is a guaranteed deadlock. If this is really what's going on I'm amazed that D on Linux works at all. Maybe it really is something else... I'm setting up a new Linux VM and so should hopefully be able to repro this shortly. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://issues.dlang.org/show_bug.cgi?id=4890 --- Comment #15 from Sean Kelly s...@invisibleduck.org --- Okay, I can't reproduce this using the provided code on Oracle Linux 64-bit. If someone has a reliable repro, please let me know. --
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://d.puremagic.com/issues/show_bug.cgi?id=4890 Stanislav Blinov stanislav.bli...@gmail.com changed: What|Removed |Added CC||stanislav.bli...@gmail.com Platform|x86 |x86_64 --- Comment #7 from Stanislav Blinov stanislav.bli...@gmail.com 2014-02-08 14:05:36 PST --- A quick search lead me to this issue. It would appear the deadlock still occurs. I've been encountering it now and then, first when running singleton tests from http://forum.dlang.org/thread/mailman.158.1391156715.13884.digitalmar...@puremagic.com, then when running druntime unittests (more specifically, test/shared/host) while working on providing shared qualifiers for core.sync primitives. At first I though it had to do with my changes to druntime, but after testing on a clean druntime I encoutered it as well. The deadlock doesn't happen on every run though, so may be tricky to track down. It's in this piece of test/shared/src/plugin.d: 23 launchThread(); 24 GC.collect(); 25 joinThread(); GC.collect() simply doesn't return. I haven't investigated deeper yet. Maybe it has something to do with GC trying to pause/resume an exiting/finished thread? This is on 64-bit Linux. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email --- You are receiving this mail because: ---
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://d.puremagic.com/issues/show_bug.cgi?id=4890 safety0ff.bugz safety0ff.b...@gmail.com changed: What|Removed |Added CC||safety0ff.b...@gmail.com --- Comment #8 from safety0ff.bugz safety0ff.b...@gmail.com 2014-02-08 15:04:51 PST --- (In reply to comment #7) A quick search lead me to this issue. It would appear the deadlock still occurs. [...SNIP...] The deadlock doesn't happen on every run though, so may be tricky to track down. It's in this piece of test/shared/src/plugin.d: 23 launchThread(); 24 GC.collect(); 25 joinThread(); GC.collect() simply doesn't return. I haven't investigated deeper yet. Maybe it has something to do with GC trying to pause/resume an exiting/finished thread? This is on 64-bit Linux. Does the code in the first post deadlock for you? If not then issues #11981 / #10351 also look relevant. For #11981 / #10351, we REALLY need a way to reproduce the deadlock, along with information about the system it is running on (glibc version, linux kernel version, etc, as much as we can get.) -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email --- You are receiving this mail because: ---
[Issue 4890] GC.collect() deadlocks multithreaded program.
https://d.puremagic.com/issues/show_bug.cgi?id=4890 --- Comment #9 from Stanislav Blinov stanislav.bli...@gmail.com 2014-02-08 15:09:32 PST --- (In reply to comment #8) Does the code in the first post deadlock for you? No it doesn't. If not then issues #11981 / #10351 also look relevant. For #11981 / #10351, we REALLY need a way to reproduce the deadlock, along with information about the system it is running on (glibc version, linux kernel version, etc, as much as we can get.) I'll go look at those issues then. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email --- You are receiving this mail because: ---
[Issue 4890] GC.collect() deadlocks multithreaded program.
http://d.puremagic.com/issues/show_bug.cgi?id=4890 Sean Kelly s...@invisibleduck.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution||FIXED --- Comment #6 from Sean Kelly s...@invisibleduck.org 2011-09-06 11:39:15 PDT --- A thread will be added to the global thread list before its TLS range is set, but the range will be set before the thread ever actually uses TLS data. I think this one can be closed. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email --- You are receiving this mail because: ---
[Issue 4890] GC.collect() deadlocks multithreaded program.
http://d.puremagic.com/issues/show_bug.cgi?id=4890 Jakob Bornecrantz wallbra...@gmail.com changed: What|Removed |Added CC||wallbra...@gmail.com --- Comment #5 from Jakob Bornecrantz wallbra...@gmail.com 2011-07-11 18:01:31 PDT --- This looks fixed with 2.054 on MacOSX, at least I can repro this. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email --- You are receiving this mail because: ---
[Issue 4890] GC.collect() deadlocks multithreaded program.
http://d.puremagic.com/issues/show_bug.cgi?id=4890 Steven Schveighoffer schvei...@yahoo.com changed: What|Removed |Added CC||schvei...@yahoo.com --- Comment #4 from Steven Schveighoffer schvei...@yahoo.com 2011-01-24 06:03:14 PST --- (In reply to comment #3) I've also stumbled over the racing condition in thread_processGCMarks() where a thread was already added to the global thread list but didn't had it's m_tls set yet. It seems fine to test for m_tls being null at that specific place. That's something that I recently added. Sean, can you confirm that if a thread's m_tls is not yet set, then it's actual TLS can not have been used yet? It seems reasonable to check the tls block for null at that point. (will have to start using github soon...) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email --- You are receiving this mail because: ---
[Issue 4890] GC.collect() deadlocks multithreaded program.
http://d.puremagic.com/issues/show_bug.cgi?id=4890 d...@dawgfoto.de changed: What|Removed |Added CC||d...@dawgfoto.de --- Comment #3 from d...@dawgfoto.de 2011-01-21 15:12:13 PST --- I've also stumbled over the racing condition in thread_processGCMarks() where a thread was already added to the global thread list but didn't had it's m_tls set yet. It seems fine to test for m_tls being null at that specific place. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email --- You are receiving this mail because: ---
[Issue 4890] GC.collect() deadlocks multithreaded program.
http://d.puremagic.com/issues/show_bug.cgi?id=4890 --- Comment #2 from Sean Kelly s...@invisibleduck.org 2011-01-04 13:41:41 PST --- It turns out that the fix I applied produces a race condition with the GC. I'll have to re-wrap Thread.start() in a synchronized block as per the code prior to rev 392. This may re-introduce the deadlock, in which case it will be necessary to replace the isRunning flag with a state field that distinguishes starting from running. A starting thread should be suspended/resumed but not scanned. Or perhaps something else can be sorted out to deal with a thread being in the list that doesn't have its TLS section set, getThis() doesn't work, etc. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email --- You are receiving this mail because: ---
[Issue 4890] GC.collect() deadlocks multithreaded program.
http://d.puremagic.com/issues/show_bug.cgi?id=4890 Sean Kelly s...@invisibleduck.org changed: What|Removed |Added Status|NEW |ASSIGNED CC||s...@invisibleduck.org --- Comment #1 from Sean Kelly s...@invisibleduck.org 2010-09-21 11:35:29 PDT --- Fixed in druntime changeset 392. Will be in DMD-2.050. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email --- You are receiving this mail because: ---