In December, I found the mono port in a state where it only built with the
(deprecated) mcs compiler, and it would catch a SIGABRT predominantly when
exiting applications (including any call to the newer csc/roslyn compiler).

I've since made what seems like partial progress with mono's functionality and
reliability, but hit a dead end. It appears to be related to the thread local
storage. Here's a brief summary, hoping for some additional input from those
with more knowledge about these matters:

- I found that the SIGABRT was consistently associated with a SIGTTIN signal
  that mono sets in mono-threads-posix-signals.c as the abort signal for
  threads. Setting it to SIGUSR1 resolved this and allows for applications
  to exit without aborting. (I found some reference that Linux uses SIGUSR1/2,
  and on platforms with SIGRTMIN, the first one available from there is chosen)

- After this, trying to build with csc/roslyn still fails - it deadlocks when
  building the first dll (mscorlib). Attaching it to gdb shows ~16-20 parallel
  threads with no clear pattern (to me) that would lead anywhere.

- However, setting build option '--with-cooperative-gc=yes' allows the build
  process to run further. This is also what upstream is planning to make the
  default (https://github.com/mono/mono/issues/6921).
  Now it builds several dozen if not more dll's with csc/roslyn, but eventually
  segfaults. This occurs at different times and with different dll's each time.
  I've tried to troubleshoot it, but this seems to be beyond my capabilities.
  Every time, the backtrace of the core file looks like this:

[...]
Core was generated by `mono-sgen'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  mono_get_lmf () at mini-runtime.c:741
741                     return jit_tls->lmf;
(gdb) where
#0  mono_get_lmf () at mini-runtime.c:741
#1  0x00001f5d55296988 in mono_thread_state_init (ctx=0x1f5f82f08070) at 
mini-exceptions.c:3127
#2  0x00001f5d554d9b19 in 
mono_threads_enter_gc_safe_region_unbalanced_with_info (info=0x1f5f82f08000, 
stackdata=<optimized out>) at mono-threads-coop.c:245
#3  0x00001f5d554d69b3 in mono_thread_info_suspend_lock_with_info 
(info=0x1f5f82f08000) at mono-threads.c:1093
#4  0x00001f5d55479a79 in acquire_gc_locks () at sgen-stw.c:87
#5  sgen_client_stop_world (generation=0) at sgen-stw.c:111
#6  0x00001f5d5548869e in sgen_stop_world (generation=0) at sgen-gc.c:3790
#7  0x00001f5d5545d5a5 in mono_gc_clear_domain (domain=0x1f5fc211d400) at 
sgen-mono.c:865
#8  0x00001f5d55362a88 in mono_domain_free (domain=0x1f5fc211d400, 
force=<optimized out>) at domain.c:1088
#9  0x00001f5d55209556 in mini_cleanup (domain=0x1f5fc211d400) at 
mini-runtime.c:4445
#10 0x00001f5d5525c5eb in mono_main (argc=26, argv=<optimized out>) at 
driver.c:2374
#11 0x00001f5d552016f0 in mono_main_with_options (argc=2, argv=0x7f7ffffd4e78) 
at ./main.c:47
#12 main (argc=26, argv=<optimized out>) at ./main.c:344
(gdb) info locals
jit_tls = 0x1f5f8df48000
(gdb) p *jit_tls
Cannot access memory at address 0x1f5f8df48000
(gdb)

I've checked with a printf and mono_get_lmf gets called countless times with no
apparent errors/bugs before the segfault. I tried running the build command
that failed in gdb, but that lead to very inconsistent results. Sometimes
the program would complete normally, sometimes catch SIGSEGV (in a different
place), sometimes catch SIGABRT elsewhere.

In mini-runtime.c, jit_tls is obtained via mono_tls_get_jit_tls which in the
end calls pthread_getspecific(3) (in mono-tls.h). It is unclear to me why it
would call an unaccessible memory address. The key is created by
pthread_key_create(3), also in mono-tls.h.

It appears that finding out why mono_get_lmf segfaults might be a way to
restore mono to working condition. It otherwise has been running mostly stable
with these changes (with SIGUSR1 and cooperative gc via MONO_ENABLE_GC env var),
including hour-long application test sessions and building C# projects with
xbuild. Not sure if this behavior is indicative of a race condition, or an edge
case that is regularly triggered.

Any advice on how to get to the bottom of this?

Reply via email to