Hey,

In a Mono bug report, we noticed a very rare race in the GC when
restarting the world. GC_restart_handler states:

    /* Let the GC_suspend_handler() know that we got a SIG_THR_RESTART. */
    /* The lookup here is safe, since I'm doing this on behalf  */
    /* of a thread which holds the allocation lock in order     */
    /* to stop the world.  Thus concurrent modification of the  */
    /* data structure is impossible.                            */

However, this comment is not always true. When starting the world, the
thread that does the restarting does *not* wait for all threads to get
past the point where they need the structures used by the lookup for it
to release the GC_lock.

So the sequence of events looked something like:

      * T1 signals T2 to restart the world
      * T1 releases the GC_lock
      * T3 is a newborn thread and adds itself to the table
      * T2 gets the signal and sees a corrupt table because T3 is
        concurrently modifying it.

What would end up happening when we experienced the race was either a
deadlock or a SIGSEGV.

The race was extremely rare. It took 1-2 hours to reproduce on an SMP
machine. With the attached patch, it has not segfaulted or hung for 21
hrs.

-- Ben
Index: pthread_stop_world.c
===================================================================
--- pthread_stop_world.c	(revision 46881)
+++ pthread_stop_world.c	(working copy)
@@ -163,6 +163,12 @@
     /* to accidentally leave a RESTART signal pending, thus causing us to   */
     /* continue prematurely in a future round.				    */ 
 
+    /* Tell the thread that wants to start the world that this  */
+    /* thread has been started.  Note that sem_post() is  	*/
+    /* the only async-signal-safe primitive in LinuxThreads.    */
+    sem_post(&GC_suspend_ack_sem);
+
+
 #if DEBUG_THREADS
     GC_printf1("Continuing 0x%lx\n", my_thread);
 #endif
@@ -421,6 +427,7 @@
     register GC_thread p;
     register int n_live_threads = 0;
     register int result;
+    int code;
 
 #   if DEBUG_THREADS
       GC_printf0("World starting\n");
@@ -450,7 +457,21 @@
         }
       }
     }
+
     #if DEBUG_THREADS
+    GC_printf0 ("All threads signaled");
+    #endif
+
+    for (i = 0; i < n_live_threads; i++) {
+	while (0 != (code = sem_wait(&GC_suspend_ack_sem))) {
+	    if (errno != EINTR) {
+		GC_err_printf1("Sem_wait returned %ld\n", (unsigned long)code);
+		ABORT("sem_wait for handler failed");
+	    }
+	}
+    }
+  
+    #if DEBUG_THREADS
       GC_printf0("World started\n");
     #endif
 }
_______________________________________________
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

Reply via email to