Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)
Le 31/03/2019 à 22:53, Alexis Murzeau a écrit : > Le 31/03/2019 à 15:19, Aurelien Jarno a écrit : >> This bug is very likely a bug present in old glibc versions. It has been >> brought to light when enabling TLS support in openblas and not by a new >> glibc version. >> >> Right now the bug has been workarounded by disabling TLS support in >> openblas. The way to handle this bug is to write a small testcase that >> can be forwarded upstream. It's not an easy task though. >> > > Hi, > > I've made a test case here [0]. > I've not tested it against latest glibc commit. > But it does reproduce the deadlock with glibc 2.28 on Linux. > > To run the test case, do this: > ``` > gcc test_compiler_tls.c -o test_compiler_tls -ldl -g -pthread > gcc test_compiler_tls_lib.c -shared -o test_compiler_tls_lib.so \ > -g -pthread -fPIC > ./test_compiler_tls ./test_compiler_tls_lib & > gdb --pid $! -ex 'thr a a bt' > ``` > > This reproduce the deadlock that I've found in openblas: > 1- The test_thread open the library which call its constructor > 2- The library's constructor create a thread >`thread_that_use_tls_after_sleep` > 3- The thread `thread_that_use_tls_after_sleep` sleep for 100ms (this >needs to be enough so dl_close is called before the sleep ends) > 3- The test_thread close the library with dl_close > 4- dl_close lock `dl_load_lock` and call the library's destructor > 5- The library's destructor wait `thread_that_use_tls_after_sleep` to >finish > 6- The `thread_that_use_tls_after_sleep` thread try to read the TLS >variable which cause a call to `__tls_get_addr` > 7- `__tls_get_addr` cause a deadlock in `tls_get_addr_tail` trying to >lock the same `dl_load_lock` as dl_close does > 8- Nothing happen because dl_close thread is waiting for the >`thread_that_use_tls_after_sleep` thread to finish which having the >lock and the latter thread try to lock the same lock as dl_close and >so never exit. > > See [1] for the stacktrace. > > Thread 3 is the library's thread created in its constructor and joined > in its destructor. > Thread 2 is the thread that does dl_open and dl_close. > Thread 1 is a "monitoring" thread to implement a timeout of 10s (useful > if this tests need to run on a CI system) > > Where dl_close lock the `dl_load_lock`: [2] > Where tls_get_addr_tail lock the `dl_load_lock`: [3] > > [0]: https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7 > [1]: > https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7#file-gdb_stacktrace-txt > [2]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-close.c#L812 > [3]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-tls.c#L761 > Related links: https://bugzilla.redhat.com/show_bug.cgi?id=1409899 https://sourceware.org/bugzilla/show_bug.cgi?id=2377 Actually, the hang is caused by a C++ here, but that's the same deadlock (the C++ exception require the `dl_load_lock´ lock). It seems from the first link that using thread stuff in constructor and destructor is risky and not well supported and that applications should just avoid doing this. I didn't find a really related bug in sourceware bugzilla, maybe we should forward our bug to them ? -- Alexis Murzeau PGP: B7E6 0EBB 9293 7B06 BDBC 2787 E7BD 1904 F480 937F signature.asc Description: OpenPGP digital signature
Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)
Le 31/03/2019 à 15:19, Aurelien Jarno a écrit : > This bug is very likely a bug present in old glibc versions. It has been > brought to light when enabling TLS support in openblas and not by a new > glibc version. > > Right now the bug has been workarounded by disabling TLS support in > openblas. The way to handle this bug is to write a small testcase that > can be forwarded upstream. It's not an easy task though. > Hi, I've made a test case here [0]. I've not tested it against latest glibc commit. But it does reproduce the deadlock with glibc 2.28 on Linux. To run the test case, do this: ``` gcc test_compiler_tls.c -o test_compiler_tls -ldl -g -pthread gcc test_compiler_tls_lib.c -shared -o test_compiler_tls_lib.so \ -g -pthread -fPIC ./test_compiler_tls ./test_compiler_tls_lib & gdb --pid $! -ex 'thr a a bt' ``` This reproduce the deadlock that I've found in openblas: 1- The test_thread open the library which call its constructor 2- The library's constructor create a thread `thread_that_use_tls_after_sleep` 3- The thread `thread_that_use_tls_after_sleep` sleep for 100ms (this needs to be enough so dl_close is called before the sleep ends) 3- The test_thread close the library with dl_close 4- dl_close lock `dl_load_lock` and call the library's destructor 5- The library's destructor wait `thread_that_use_tls_after_sleep` to finish 6- The `thread_that_use_tls_after_sleep` thread try to read the TLS variable which cause a call to `__tls_get_addr` 7- `__tls_get_addr` cause a deadlock in `tls_get_addr_tail` trying to lock the same `dl_load_lock` as dl_close does 8- Nothing happen because dl_close thread is waiting for the `thread_that_use_tls_after_sleep` thread to finish which having the lock and the latter thread try to lock the same lock as dl_close and so never exit. See [1] for the stacktrace. Thread 3 is the library's thread created in its constructor and joined in its destructor. Thread 2 is the thread that does dl_open and dl_close. Thread 1 is a "monitoring" thread to implement a timeout of 10s (useful if this tests need to run on a CI system) Where dl_close lock the `dl_load_lock`: [2] Where tls_get_addr_tail lock the `dl_load_lock`: [3] [0]: https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7 [1]: https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7#file-gdb_stacktrace-txt [2]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-close.c#L812 [3]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-tls.c#L761 -- Alexis Murzeau PGP: B7E6 0EBB 9293 7B06 BDBC 2787 E7BD 1904 F480 937F signature.asc Description: OpenPGP digital signature
Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)
This bug is very likely a bug present in old glibc versions. It has been brought to light when enabling TLS support in openblas and not by a new glibc version. Right now the bug has been workarounded by disabling TLS support in openblas. The way to handle this bug is to write a small testcase that can be forwarded upstream. It's not an easy task though. -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net signature.asc Description: PGP signature
Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)
Hi, On 07/09/2018 16:57, Sébastien Villemot wrote: > > I have just uploaded openblas 0.3.3+ds-1, which has TLS disabled. > > I think this should fix the original issue, i.e. the gimp+openblas > deadlock. Please let me know if this is not the case. > > Best, > Thanks for your update. I tried to start gimp with this openblas version installed and it did not crashed or hanged. But there's still a possible crash that can occur, when I do a test that does dl_open followed by dl_close of libopenblas, I get a segfault when stopping the thread that does the dl_open/dl_close. This crash doesn't seem to cause issues to gimp but might on some machines (maybe no threads are used by gimp when indirectly loading openblas and so the crash doesn't occur, but not sure at all). More extensive information here: [0]. If no one object that gimp doesn't crash anymore with that 3.3 version, maybe this bug can be closed (letting the crash of the dl_open/dl_close test be handled by upstream only [0]). [0] https://github.com/xianyi/OpenBLAS/issues/1720#issuecomment-418538099 -- Alexis Murzeau PGP: B7E6 0EBB 9293 7B06 BDBC 2787 E7BD 1904 F480 937F signature.asc Description: OpenPGP digital signature
Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)
Le vendredi 10 août 2018 à 23:27 +0200, Alexis Murzeau a écrit : > On 09/08/2018 00:22, Alexis Murzeau wrote: > > On 08/08/2018 00:59, Alexis Murzeau wrote: > > > severity 903514 important > > > thanks > > > > > > > Reassigning to glibc with affects on openblas and gimp as this is caused > > > > by a deadlock inside glibc. > > I've posted a issue on openblas upstream project [0] and they suggested > > some solutions. > > One of them is to disable the use of compiler supported TLS and instead > > use pthreads. I have just uploaded openblas 0.3.3+ds-1, which has TLS disabled. I think this should fix the original issue, i.e. the gimp+openblas deadlock. Please let me know if this is not the case. Best, -- ⢀⣴⠾⠻⢶⣦⠀ Sébastien Villemot ⣾⠁⢠⠒⠀⣿⡁ Debian Developer ⢿⡄⠘⠷⠚⠋⠀ http://sebastien.villemot.name ⠈⠳⣄ http://www.debian.org signature.asc Description: This is a digitally signed message part
Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)
On 12/08/2018 15:55, Jackie wrote: > Great. It works. > To make it clear, this is what I did to test: > > 1. Install libopenblas-base:amd64 (0.3.2+ds-1) and libopenblas-dev:amd64 > (0.3.2+ds-1) from official repo. After installation complete, open a > terminal and type gimp + ENTER. The cmd hangs forever and Ctrl-C gets a > segment fault. > 2. sudo apt purge libopenblas-base libopenblas-dev. Upon finish, open a > terminal and type gimp + ENTER. Gimp lanuches without problem. > 3. Add the repo you offered and install libopenblas-base:amd64 > (0.3.2+ds-1.1~1patchTLS) and libopenblas-dev:amd64 > (0.3.2+ds-1.1~1patchTLS). The installations go well, after that, open a > terminal and type gimp + ENTER. Gimp again lanuches without any problem. > > I think this validates both the existence of the bug and your solution. > Thank you very much. Thank you too for testing that patch :) Good to see it works for others having the gimp hang too. > > Jiang Jun -- Alexis Murzeau PGP: B7E6 0EBB 9293 7B06 BDBC 2787 E7BD 1904 F480 937F signature.asc Description: OpenPGP digital signature
Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)
Great. It works. To make it clear, this is what I did to test: 1. Install libopenblas-base:amd64 (0.3.2+ds-1) and libopenblas-dev:amd64 (0.3.2+ds-1) from official repo. After installation complete, open a terminal and type gimp + ENTER. The cmd hangs forever and Ctrl-C gets a segment fault. 2. sudo apt purge libopenblas-base libopenblas-dev. Upon finish, open a terminal and type gimp + ENTER. Gimp lanuches without problem. 3. Add the repo you offered and install libopenblas-base:amd64 (0.3.2+ds-1.1~1patchTLS) and libopenblas-dev:amd64 (0.3.2+ds-1.1~1patchTLS). The installations go well, after that, open a terminal and type gimp + ENTER. Gimp again lanuches without any problem. I think this validates both the existence of the bug and your solution. Thank you very much. Jiang Jun On Sun, Aug 12, 2018 at 9:31 PM, Alexis Murzeau wrote: On 10/08/2018 23:27, Alexis Murzeau wrote: I can provide a binary package that include this patch, but I'm not sure this is the best thing to do (I'm not the official maintainer, nor know a good place to upload it). As I was requested to provide binary package to test the patch, I'm providing this as version `0.3.2+ds-1.1~1patchTLS` built with sbuild at [0]. Instructions are on the same page [0]: execute: `apt-key adv --keyserver hkps://hkps.pool.sks-keyservers.net --recv-keys 0x3F7A2FA142E434FE06622560B05266B2EB68F001` and add `deb https://amurzeau.github.io/apt-repository unstable main` to sources.list. This APT repository contains both the source and amd64 binary packages. If you try it, please tell if it works. [0] https://amurzeau.github.io/apt-repository/ -- Alexis Murzeau PGP: B7E6 0EBB 9293 7B06 BDBC 2787 E7BD 1904 F480 937F
Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)
On 10/08/2018 23:27, Alexis Murzeau wrote: > I can provide a binary package that include this patch, but I'm not sure > this is the best thing to do (I'm not the official maintainer, nor know > a good place to upload it). > As I was requested to provide binary package to test the patch, I'm providing this as version `0.3.2+ds-1.1~1patchTLS` built with sbuild at [0]. Instructions are on the same page [0]: execute: `apt-key adv --keyserver hkps://hkps.pool.sks-keyservers.net --recv-keys 0x3F7A2FA142E434FE06622560B05266B2EB68F001` and add `deb https://amurzeau.github.io/apt-repository unstable main` to sources.list. This APT repository contains both the source and amd64 binary packages. If you try it, please tell if it works. [0] https://amurzeau.github.io/apt-repository/ -- Alexis Murzeau PGP: B7E6 0EBB 9293 7B06 BDBC 2787 E7BD 1904 F480 937F signature.asc Description: OpenPGP digital signature
Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)
On 09/08/2018 00:22, Alexis Murzeau wrote: > On 08/08/2018 00:59, Alexis Murzeau wrote: >> severity 903514 important >> thanks >> >>> Reassigning to glibc with affects on openblas and gimp as this is caused >>> by a deadlock inside glibc. >> >> Done. >> >> Lowering severity as this does not render any package unusable by >> themselves, but only a combination of them (GIMP + OpenBLAS). >> >> I think a workaround solution against GIMP OpenBLAS should be done as >> I'm not sure a good solution will emerge in glibc given attempts done in >> the past. The work to be done seems non trivial. >> >> My though on possible solutions: >> * Add a breaks between GIMP and OpenBLAS >> * Disable TLS in OpenBLAS build (if possible, but this would cause a >> performance loss for users that use OpenBLAS without gimp) >> * Add a delay in GIMP to not load then close libraries too fast (so >> OpenBLAS threads are fully initialized when dl_close is called on it) >> > > Hi, > > I've posted a issue on openblas upstream project [0] and they suggested > some solutions. > One of them is to disable the use of compiler supported TLS and instead > use pthreads. > > I tested this and it seems to fix deadlocks while starting gimp (I tried > without arguments, with a non existing file and with an existing file). > > I've pushed a merge request with the patch at [1]. > I've also asked openblas upstream if this patch could be a good solution. > > In that case would it be possible to have this patch tested for ones who > have major instabilities with gimp + openblas ? > > Thanks :) > Hi, I've updated the merge request [0] with the upstream proposed patch [1]. @openblas maintainers, maybe someone can build a package with this patch and upload to experimental so others can check if gimp works fine with it ? I've myself tested it and gimp does not deadlock. I can provide a binary package that include this patch, but I'm not sure this is the best thing to do (I'm not the official maintainer, nor know a good place to upload it). [0] https://salsa.debian.org/science-team/openblas/merge_requests/1 [1] https://github.com/xianyi/OpenBLAS/pull/1726 -- Alexis Murzeau PGP: B7E6 0EBB 9293 7B06 BDBC 2787 E7BD 1904 F480 937F signature.asc Description: OpenPGP digital signature
Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)
On 08/08/2018 00:59, Alexis Murzeau wrote: > severity 903514 important > thanks > >> Reassigning to glibc with affects on openblas and gimp as this is caused >> by a deadlock inside glibc. > > Done. > > Lowering severity as this does not render any package unusable by > themselves, but only a combination of them (GIMP + OpenBLAS). > > I think a workaround solution against GIMP OpenBLAS should be done as > I'm not sure a good solution will emerge in glibc given attempts done in > the past. The work to be done seems non trivial. > > My though on possible solutions: > * Add a breaks between GIMP and OpenBLAS > * Disable TLS in OpenBLAS build (if possible, but this would cause a > performance loss for users that use OpenBLAS without gimp) > * Add a delay in GIMP to not load then close libraries too fast (so > OpenBLAS threads are fully initialized when dl_close is called on it) > Hi, I've posted a issue on openblas upstream project [0] and they suggested some solutions. One of them is to disable the use of compiler supported TLS and instead use pthreads. I tested this and it seems to fix deadlocks while starting gimp (I tried without arguments, with a non existing file and with an existing file). I've pushed a merge request with the patch at [1]. I've also asked openblas upstream if this patch could be a good solution. In that case would it be possible to have this patch tested for ones who have major instabilities with gimp + openblas ? Thanks :) -- Alexis Murzeau PGP: B7E6 0EBB 9293 7B06 BDBC 2787 E7BD 1904 F480 937F signature.asc Description: OpenPGP digital signature
Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)
severity 903514 important thanks > Reassigning to glibc with affects on openblas and gimp as this is caused > by a deadlock inside glibc. Done. Lowering severity as this does not render any package unusable by themselves, but only a combination of them (GIMP + OpenBLAS). I think a workaround solution against GIMP OpenBLAS should be done as I'm not sure a good solution will emerge in glibc given attempts done in the past. The work to be done seems non trivial. My though on possible solutions: * Add a breaks between GIMP and OpenBLAS * Disable TLS in OpenBLAS build (if possible, but this would cause a performance loss for users that use OpenBLAS without gimp) * Add a delay in GIMP to not load then close libraries too fast (so OpenBLAS threads are fully initialized when dl_close is called on it) -- Alexis Murzeau PGP: B7E6 0EBB 9293 7B06 BDBC 2787 E7BD 1904 F480 937F signature.asc Description: OpenPGP digital signature