On Thu, May 06, 2021 at 09:32:28AM +0200, Sebastien Marie wrote:
> Hi,
> 
> Anindya, did a good analysis of the problem with mpv using gpu video
> output backend (it is using EGL and mesa if I correctly followed).
> 
> 
> For people not reading ports@ here a resume: the destructor function
> used in pthread_key_create() needs to be present in memory until
> _rthread_tls_destructors() is called.
> 
> in the case of mesa, eglInitialize() function could load, via
> dlopen(), code which will use pthread_key_create() with destructor.
> 
> once dlclose() is called, the object is unloaded from memory, but a
> reference to destructor is kept, leading to segfault when
> _rthread_tls_destructors() run and use the destructor (because
> pointing to unloaded code).
> 
> We already take care of such situation with __cxa_thread_atexit_impl
> (in libc/stdlib/thread_atexit.c), by keeping an additionnal reference
> on object loaded (it makes ld.so aware that it is still used and so
> dlclose() doesn't unload it).
> 
> I used the same idiom for pthread_key_create() and used dlctl(3) in
> the same way with the destructor address.
> 
> With the following diff, I am not able to reproduce the segfault
> anymore with mpv.
> 
> diff 393e7b397988bb6abe46729de1794883d2b9d5cf /home/semarie/repos/openbsd/src
> blob - 58b39bb7df70f54c708d2e2b11a3978806a86005
> file + lib/libc/thread/rthread_tls.c
> --- lib/libc/thread/rthread_tls.c
> +++ lib/libc/thread/rthread_tls.c
> @@ -22,10 +22,8 @@
>  #include <errno.h>
>  #include <pthread.h>
>  #include <stdlib.h>
> +#include <dlfcn.h>
>  
> -#include <pthread.h>
> -#include <stdlib.h>
> -
>  #include "rthread.h"
>  
>  
> @@ -58,6 +56,9 @@ pthread_key_create(pthread_key_t *key, void (*destruct
>       rkeys[hint].used = 1;
>       rkeys[hint].destructor = destructor;
>  
> +     if (destructor != NULL)
> +             dlctl(NULL, DL_REFERENCE, destructor);
> +     
>       *key = hint++;
>       if (hint >= PTHREAD_KEYS_MAX)
>               hint = 0;
> 
> 
> I am still unsure if we want to solve this problem this way, and if
> this diff would introduce more problems than it tries to solve.
> 
> At first place, it might be better to not register a destructor when
> using dlopen()...
> 
> Comments would be appreciated.
> 
> Thanks.
> -- 
> Sebastien Marie
> 
> 
> On Wed, May 05, 2021 at 01:20:02AM -0700, Anindya Mukherjee wrote:
> > Hi,
> > 
> > I have been investigating the crash on exit problem with mpv in ports
> > with vo=gpu. I think I made a little bit of progress and thought I'd
> > share my findings.
> > 
> > The crash (SIGSEGV) happens when thread local destructors
> > are called from /usr/src/lib/libc/thread/rthread_tls.c:182 in
> > _rthread_tls_destructors after the gpu thread exits: vo_thread in
> > video/out/vo.c:1067. The crashing call stack looks like this:
> > 
> > #0  0x00000176ffdc9680 in ?? ()
> > #1  0x0000017748d347b5 in _rthread_tls_destructors (thread=0x17798917840) 
> > at /usr/src/lib/libc/thread/rthread_tls.c:182
> > #2  0x0000017748d98623 in _libc_pthread_exit (retval=<error reading 
> > variable: Unhandled dwarf expression opcode 0xa3>) at 
> > /usr/src/lib/libc/thread/rthread.c:150
> > #3  0x0000017795b22189 in _rthread_start (v=<error reading variable: 
> > Unhandled dwarf expression opcode 0xa3>) at 
> > /usr/src/lib/librthread/rthread.c:97
> > #4  0x0000017748d0c5ba in __tfork_thread () at 
> > /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:84
> > 
> > Note that some of the traces were taken from different runs so there
> > might be some mismatch between the handles/addresses.
> > 
> > It crashes because the destructor is dangling. This mystified me because
> > if I look at the mpv source, there is no thread local data for the gpu
> > thread. Indeed, right after the gpu thread starts running if we look
> > inside the thread structure, local_storage is null. However, if we look
> > at the same thread at the point of the crash, its local_storage is
> > populated:
> > 
> > (gdb) p *(*thread).local_storage
> > $3 = {
> >   keyid = 7,
> >   next = 0x177353442e0,
> >   data = 0x1770c276000
> > }
> > 
> > The keys are indexed by the keyid in the rkeys array, from where the
> > destructor is fetched in _rthread_tls_destructors:
> > 
> > (gdb) p rkeys[7]
> > $6 = {
> >   used = 1,
> >   destructor = 0x43dd33e2680
> > }
> > 
> > This destructor now points to invalid memory. It turns out the thread
> > local storage is being initialised here:
> > 
> > #0  _libc_pthread_key_create (key=0x43dd380da08, destructor=0x43dd33e2680) 
> > at /usr/src/lib/libc/thread/rthread_tls.c:42
> > #1  0x0000043dd33e2667 in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #2  0x0000043e793f82f7 in pthread_once (once_control=0x43dd380d9f8, 
> > init_routine=0x43e793db3c0 <_libc_pthread_key_create>) at 
> > /usr/src/lib/libc/thread/rthread_once.c:26
> > #3  0x0000043dd33e24bd in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #4  0x0000043dd305475f in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #5  0x0000043dd3036c70 in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #6  0x0000043dd30e3ca3 in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #7  0x0000043dd30e4b96 in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #8  0x0000043dd302d89a in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #9  0x0000043dd3031162 in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #10 0x0000043dd3031ec6 in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #11 0x0000043dd30f7a8b in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #12 0x0000043dd311a94e in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #13 0x0000043dd311addf in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #14 0x0000043dd33ae4a6 in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #15 0x0000043dd33ad6a2 in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #16 0x0000043dd276e1d3 in ?? () from /usr/X11R6/lib/modules/dri/iris_dri.so
> > #17 0x0000043b8f0816c6 in gl_clear (ra=0x43e50a61a50, dst=0x43e399aa950, 
> > color=0x43e0e382370, scissor=0x43e0e382390) at 
> > ../mpv-0.33.1/video/out/opengl/ra_gl.c:684
> > #18 0x0000043b8f061db8 in gl_video_render_frame (p=0x43db938c050, 
> > frame=0x43e399bb350, fbo=..., flags=3) at 
> > ../mpv-0.33.1/video/out/gpu/video.c:3251
> > #19 0x0000043b8f089a8f in draw_frame (vo=0x43e32b9f450, 
> > frame=0x43e399bb350) at ../mpv-0.33.1/video/out/vo_gpu.c:87
> > #20 0x0000043b8f0882a4 in render_frame (vo=0x43e32b9f450) at 
> > ../mpv-0.33.1/video/out/vo.c:957
> > #21 0x0000043b8f087735 in vo_thread (ptr=0x43e32b9f450) at 
> > ../mpv-0.33.1/video/out/vo.c:1095
> > #22 0x0000043da1682181 in _rthread_start (v=<error reading variable: 
> > Unhandled dwarf expression opcode 0xa3>) at 
> > /usr/src/lib/librthread/rthread.c:96
> > #23 0x0000043e793b35ba in __tfork_thread () at 
> > /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:84
> > 
> > So the mpv code is not directly aware of the TLS, and it is being
> > allocated in iris_dri.so. This trace was taken at the start of video
> > playback, and at this point iris_dri.so is loaded (using dlopen) and the
> > destructor is valid.
> > 
> > #0  dlopen (libname=0xbbddfb93320 "/usr/X11R6/lib/modules/dri/iris_dri.so", 
> > flags=258) at /usr/src/libexec/ld.so/dlfcn.c:51
> > #1  0x00000bbd76e126e6 in loader_open_driver (driver_name=0xbbd52b277e0 
> > "iris", out_driver_handle=0xbbd417eda28, search_path_vars=<optimized out>) 
> > at /usr/xenocara/lib/mesa/mk/libloader/../../src/loader/loader.c:579
> > #2  0x00000bbd76e0a7a8 in dri2_open_driver (disp=<optimized out>) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/drivers/dri2/egl_dri2.c:771
> > #3  dri2_load_driver_common (disp=<optimized out>, 
> > driver_extensions=<optimized out>) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/drivers/dri2/egl_dri2.c:783
> > #4  dri2_load_driver_dri3 (disp=<error reading variable: Unhandled dwarf 
> > expression opcode 0xa3>) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/drivers/dri2/egl_dri2.c:808
> > #5  0x00000bbd76e0331c in dri2_initialize_x11_dri3 (drv=<optimized out>, 
> > disp=0xbbd4180c000) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/drivers/dri2/platform_x11.c:1393
> > #6  dri2_initialize_x11 (drv=<error reading variable: Unhandled dwarf 
> > expression opcode 0xa3>, disp=0xbbd4180c000) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/drivers/dri2/platform_x11.c:1554
> > #7  0x00000bbd76e0c352 in dri2_initialize (drv=0xbbd417fd200, 
> > disp=0xbbd4180c000) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/drivers/dri2/egl_dri2.c:1143
> > #8  0x00000bbd76e0649e in _eglMatchAndInitialize (disp=0xbbd4180c000) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/main/egldriver.c:75
> > #9  _eglMatchDriver (disp=0xbbd4180c000) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/main/egldriver.c:98
> > #10 0x00000bbd76dfa1c1 in eglInitialize (dpy=<optimized out>, major=0x0, 
> > minor=0x0) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/main/eglapi.c:617
> > #11 0x00000bbb39afbca0 in mpegl_init (ctx=0xbbd41809350) at 
> > ../mpv-0.33.1/video/out/opengl/context_x11egl.c:109
> > #12 0x00000bbb39ad0c96 in ra_ctx_create (vo=0xbbd5b2df050, 
> > context_type=0x0, context_name=0x0, opts=...) at 
> > ../mpv-0.33.1/video/out/gpu/context.c:185
> > #13 0x00000bbb39b083a7 in preinit (vo=0xbbd5b2df050) at 
> > ../mpv-0.33.1/video/out/vo_gpu.c:298
> > #14 0x00000bbb39b06679 in vo_thread (ptr=0xbbd5b2df050) at 
> > ../mpv-0.33.1/video/out/vo.c:1080
> > #15 0x00000bbe30c09181 in _rthread_start (v=<error reading variable: 
> > Unhandled dwarf expression opcode 0xa3>) at 
> > /usr/src/lib/librthread/rthread.c:96
> > #16 0x00000bbdb38aa5ba in __tfork_thread () at 
> > /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:84
> > 
> > However, when mpv is shutting down, iris_dri.so is
> > unloaded here:
> > 
> > #0  dlclose (handle=0xb79c1fa5800) at /usr/src/libexec/ld.so/dlfcn.c:274
> > #1  0x00000b7a51b541e0 in dri2_display_destroy (disp=0xb7969481000) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/drivers/dri2/egl_dri2.c:1204
> > #2  0x00000b7a51b55407 in dri2_display_release (disp=0xb7969481000) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/drivers/dri2/egl_dri2.c:1188
> > #3  dri2_terminate (drv=<error reading variable: Unhandled dwarf expression 
> > opcode 0xa3>, disp=0xb7969481000) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/drivers/dri2/egl_dri2.c:1285
> > #4  0x00000b7a51b43db7 in eglTerminate (dpy=<optimized out>) at 
> > /usr/xenocara/lib/mesa/mk/libEGL/../../src/egl/main/eglapi.c:675
> > #5  0x00000b775b50a174 in mpegl_uninit (ctx=0xb796948f550) at 
> > ../mpv-0.33.1/video/out/opengl/context_x11egl.c:51
> > #6  0x00000b775b4dedc8 in ra_ctx_destroy (ctx_ptr=0xb796faf1158) at 
> > ../mpv-0.33.1/video/out/gpu/context.c:211
> > #7  0x00000b775b516d8d in uninit (vo=0xb796fabb650) at 
> > ../mpv-0.33.1/video/out/vo_gpu.c:286
> > #8  0x00000b775b514994 in vo_thread (ptr=0xb796fabb650) at 
> > ../mpv-0.33.1/video/out/vo.c:1136
> > #9  0x00000b796775c181 in _rthread_start (v=<error reading variable: 
> > Unhandled dwarf expression opcode 0xa3>) at 
> > /usr/src/lib/librthread/rthread.c:96
> > #10 0x00000b7a3e8ef5ba in __tfork_thread () at 
> > /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:84
> > 
> > So iris_dri.so is unloaded before the _rthread_tls_destructors function
> > gets to the destructor. This is the cause of the crash. I did a quick
> > and dirty test by doing this:
> > LD_PRELOAD=/usr/X11R6/lib/modules/dri/iris_dri.so ./mpv -v file.mp4
> > and indeed now mpv does not crash on exit (vo=gpu is being used by
> > default), because the destructor is being resolved from LD_PRELOAD.
> > 
> > I intend to look at this on Linux to see why it does not crash there,
> > but haven't gotten to it yet. In the meanwhile, I wonder if we can
> > patch the OpenBSD port in some way to prevent the dangling TLS
> > destructor. If anyone has a clean solution based on the above
> > information please feel free to chime in. I'd love to get this fixed.
> > 
> > Regards,
> > Anindya
> > 
> 

Hi Sebastien,

Thanks for the patch! Just briefly tested this and it works (no more crash).

I'll run this for a while and report if there are issues.

-- 
Kind regards,
Hiltjo

Reply via email to