Hey, In a Mono bug report, we noticed a very rare race in the GC when restarting the world. GC_restart_handler states:
/* Let the GC_suspend_handler() know that we got a SIG_THR_RESTART. */ /* The lookup here is safe, since I'm doing this on behalf */ /* of a thread which holds the allocation lock in order */ /* to stop the world. Thus concurrent modification of the */ /* data structure is impossible. */ However, this comment is not always true. When starting the world, the thread that does the restarting does *not* wait for all threads to get past the point where they need the structures used by the lookup for it to release the GC_lock. So the sequence of events looked something like: * T1 signals T2 to restart the world * T1 releases the GC_lock * T3 is a newborn thread and adds itself to the table * T2 gets the signal and sees a corrupt table because T3 is concurrently modifying it. What would end up happening when we experienced the race was either a deadlock or a SIGSEGV. The race was extremely rare. It took 1-2 hours to reproduce on an SMP machine. With the attached patch, it has not segfaulted or hung for 21 hrs. -- Ben
Index: pthread_stop_world.c =================================================================== --- pthread_stop_world.c (revision 46881) +++ pthread_stop_world.c (working copy) @@ -163,6 +163,12 @@ /* to accidentally leave a RESTART signal pending, thus causing us to */ /* continue prematurely in a future round. */ + /* Tell the thread that wants to start the world that this */ + /* thread has been started. Note that sem_post() is */ + /* the only async-signal-safe primitive in LinuxThreads. */ + sem_post(&GC_suspend_ack_sem); + + #if DEBUG_THREADS GC_printf1("Continuing 0x%lx\n", my_thread); #endif @@ -421,6 +427,7 @@ register GC_thread p; register int n_live_threads = 0; register int result; + int code; # if DEBUG_THREADS GC_printf0("World starting\n"); @@ -450,7 +457,21 @@ } } } + #if DEBUG_THREADS + GC_printf0 ("All threads signaled"); + #endif + + for (i = 0; i < n_live_threads; i++) { + while (0 != (code = sem_wait(&GC_suspend_ack_sem))) { + if (errno != EINTR) { + GC_err_printf1("Sem_wait returned %ld\n", (unsigned long)code); + ABORT("sem_wait for handler failed"); + } + } + } + + #if DEBUG_THREADS GC_printf0("World started\n"); #endif }
_______________________________________________ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list