Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)

2019-03-31 Thread Alexis Murzeau
Le 31/03/2019 à 22:53, Alexis Murzeau a écrit :
> Le 31/03/2019 à 15:19, Aurelien Jarno a écrit :
>> This bug is very likely a bug present in old glibc versions. It has been
>> brought to light when enabling TLS support in openblas and not by a new
>> glibc version.
>>
>> Right now the bug has been workarounded by disabling TLS support in
>> openblas. The way to handle this bug is to write a small testcase that
>> can be forwarded upstream. It's not an easy task though.
>>
> 
> Hi,
> 
> I've made a test case here [0].
> I've not tested it against latest glibc commit.
> But it does reproduce the deadlock with glibc 2.28 on Linux.
> 
> To run the test case, do this:
> ```
> gcc test_compiler_tls.c -o test_compiler_tls -ldl -g -pthread
> gcc test_compiler_tls_lib.c -shared -o test_compiler_tls_lib.so \
>  -g -pthread -fPIC
> ./test_compiler_tls ./test_compiler_tls_lib &
> gdb --pid $! -ex 'thr a a bt'
> ```
> 
> This reproduce the deadlock that I've found in openblas:
> 1- The test_thread open the library which call its constructor
> 2- The library's constructor create a thread
>`thread_that_use_tls_after_sleep`
> 3- The thread `thread_that_use_tls_after_sleep` sleep for 100ms (this
>needs to be enough so dl_close is called before the sleep ends)
> 3- The test_thread close the library with dl_close
> 4- dl_close lock `dl_load_lock` and call the library's destructor
> 5- The library's destructor wait `thread_that_use_tls_after_sleep` to
>finish
> 6- The `thread_that_use_tls_after_sleep` thread try to read the TLS
>variable which cause a call to `__tls_get_addr`
> 7- `__tls_get_addr` cause a deadlock in `tls_get_addr_tail` trying to
>lock the same `dl_load_lock` as dl_close does
> 8- Nothing happen because dl_close thread is waiting for the
>`thread_that_use_tls_after_sleep` thread to finish which having the
>lock and the latter thread try to lock the same lock as dl_close and
>so never exit.
> 
> See [1] for the stacktrace.
> 
> Thread 3 is the library's thread created in its constructor and joined
> in its destructor.
> Thread 2 is the thread that does dl_open and dl_close.
> Thread 1 is a "monitoring" thread to implement a timeout of 10s (useful
> if this tests need to run on a CI system)
> 
> Where dl_close lock the `dl_load_lock`: [2]
> Where tls_get_addr_tail lock the `dl_load_lock`: [3]
> 
> [0]: https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7
> [1]:
> https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7#file-gdb_stacktrace-txt
> [2]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-close.c#L812
> [3]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-tls.c#L761
> 

Related links:
https://bugzilla.redhat.com/show_bug.cgi?id=1409899
https://sourceware.org/bugzilla/show_bug.cgi?id=2377


Actually, the hang is caused by a C++ here, but that's the same deadlock
(the C++ exception require the `dl_load_lock´ lock).

It seems from the first link that using thread stuff in constructor and
destructor is risky and not well supported and that applications should
just avoid doing this.

I didn't find a really related bug in sourceware bugzilla, maybe we
should forward our bug to them ?

-- 
Alexis Murzeau
PGP: B7E6 0EBB 9293 7B06 BDBC  2787 E7BD 1904 F480 937F



signature.asc
Description: OpenPGP digital signature


Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)

2019-03-31 Thread Alexis Murzeau
Le 31/03/2019 à 15:19, Aurelien Jarno a écrit :
> This bug is very likely a bug present in old glibc versions. It has been
> brought to light when enabling TLS support in openblas and not by a new
> glibc version.
> 
> Right now the bug has been workarounded by disabling TLS support in
> openblas. The way to handle this bug is to write a small testcase that
> can be forwarded upstream. It's not an easy task though.
> 

Hi,

I've made a test case here [0].
I've not tested it against latest glibc commit.
But it does reproduce the deadlock with glibc 2.28 on Linux.

To run the test case, do this:
```
gcc test_compiler_tls.c -o test_compiler_tls -ldl -g -pthread
gcc test_compiler_tls_lib.c -shared -o test_compiler_tls_lib.so \
 -g -pthread -fPIC
./test_compiler_tls ./test_compiler_tls_lib &
gdb --pid $! -ex 'thr a a bt'
```

This reproduce the deadlock that I've found in openblas:
1- The test_thread open the library which call its constructor
2- The library's constructor create a thread
   `thread_that_use_tls_after_sleep`
3- The thread `thread_that_use_tls_after_sleep` sleep for 100ms (this
   needs to be enough so dl_close is called before the sleep ends)
3- The test_thread close the library with dl_close
4- dl_close lock `dl_load_lock` and call the library's destructor
5- The library's destructor wait `thread_that_use_tls_after_sleep` to
   finish
6- The `thread_that_use_tls_after_sleep` thread try to read the TLS
   variable which cause a call to `__tls_get_addr`
7- `__tls_get_addr` cause a deadlock in `tls_get_addr_tail` trying to
   lock the same `dl_load_lock` as dl_close does
8- Nothing happen because dl_close thread is waiting for the
   `thread_that_use_tls_after_sleep` thread to finish which having the
   lock and the latter thread try to lock the same lock as dl_close and
   so never exit.

See [1] for the stacktrace.

Thread 3 is the library's thread created in its constructor and joined
in its destructor.
Thread 2 is the thread that does dl_open and dl_close.
Thread 1 is a "monitoring" thread to implement a timeout of 10s (useful
if this tests need to run on a CI system)

Where dl_close lock the `dl_load_lock`: [2]
Where tls_get_addr_tail lock the `dl_load_lock`: [3]

[0]: https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7
[1]:
https://gist.github.com/amurzeau/26f045bdfea407528dd7de3102fb4be7#file-gdb_stacktrace-txt
[2]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-close.c#L812
[3]: https://github.com/bminor/glibc/blob/glibc-2.28/elf/dl-tls.c#L761

-- 
Alexis Murzeau
PGP: B7E6 0EBB 9293 7B06 BDBC  2787 E7BD 1904 F480 937F



signature.asc
Description: OpenPGP digital signature


Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)

2019-03-31 Thread Aurelien Jarno
This bug is very likely a bug present in old glibc versions. It has been
brought to light when enabling TLS support in openblas and not by a new
glibc version.

Right now the bug has been workarounded by disabling TLS support in
openblas. The way to handle this bug is to write a small testcase that
can be forwarded upstream. It's not an easy task though.

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net


signature.asc
Description: PGP signature


Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)

2018-09-07 Thread Alexis Murzeau
Hi,

On 07/09/2018 16:57, Sébastien Villemot wrote:
> 
> I have just uploaded openblas 0.3.3+ds-1, which has TLS disabled.
> 
> I think this should fix the original issue, i.e. the gimp+openblas
> deadlock. Please let me know if this is not the case.
> 
> Best,
> 

Thanks for your update.

I tried to start gimp with this openblas version installed and it did
not crashed or hanged.

But there's still a possible crash that can occur, when I do a test that
does dl_open followed by dl_close of libopenblas, I get a segfault when
stopping the thread that does the dl_open/dl_close.

This crash doesn't seem to cause issues to gimp but might on some
machines (maybe no threads are used by gimp when indirectly loading
openblas and so the crash doesn't occur, but not sure at all).

More extensive information here: [0].

If no one object that gimp doesn't crash anymore with that 3.3 version,
maybe this bug can be closed (letting the crash of the dl_open/dl_close
test be handled by upstream only [0]).

[0] https://github.com/xianyi/OpenBLAS/issues/1720#issuecomment-418538099

-- 
Alexis Murzeau
PGP: B7E6 0EBB 9293 7B06 BDBC  2787 E7BD 1904 F480 937F



signature.asc
Description: OpenPGP digital signature


Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)

2018-09-07 Thread Sébastien Villemot
Le vendredi 10 août 2018 à 23:27 +0200, Alexis Murzeau a écrit :
> On 09/08/2018 00:22, Alexis Murzeau wrote:
> > On 08/08/2018 00:59, Alexis Murzeau wrote:
> > > severity 903514 important
> > > thanks
> > > 
> > > > Reassigning to glibc with affects on openblas and gimp as this is caused
> > > > by a deadlock inside glibc.

> > I've posted a issue on openblas upstream project [0] and they suggested
> > some solutions.
> > One of them is to disable the use of compiler supported TLS and instead
> > use pthreads.

I have just uploaded openblas 0.3.3+ds-1, which has TLS disabled.

I think this should fix the original issue, i.e. the gimp+openblas
deadlock. Please let me know if this is not the case.

Best,

-- 
⢀⣴⠾⠻⢶⣦⠀  Sébastien Villemot
⣾⠁⢠⠒⠀⣿⡁  Debian Developer
⢿⡄⠘⠷⠚⠋⠀  http://sebastien.villemot.name
⠈⠳⣄  http://www.debian.org


signature.asc
Description: This is a digitally signed message part


Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)

2018-08-12 Thread Alexis Murzeau
On 12/08/2018 15:55, Jackie wrote:
> Great. It works. 
> To make it clear, this is what I did to test:
> 
> 1. Install libopenblas-base:amd64 (0.3.2+ds-1) and libopenblas-dev:amd64
> (0.3.2+ds-1) from official repo. After installation complete, open a
> terminal and type gimp + ENTER. The cmd hangs forever and Ctrl-C gets a
> segment fault.
> 2. sudo apt purge libopenblas-base libopenblas-dev. Upon finish, open a
> terminal and type gimp + ENTER. Gimp lanuches without problem.
> 3. Add the repo you offered and install libopenblas-base:amd64
> (0.3.2+ds-1.1~1patchTLS) and libopenblas-dev:amd64
> (0.3.2+ds-1.1~1patchTLS). The installations go well, after that, open a
> terminal and type gimp + ENTER.  Gimp again lanuches without any problem.
> 
> I think this validates both the existence of the bug and your solution.
> Thank you very much.

Thank you too for testing that patch :)
Good to see it works for others having the gimp hang too.

> 
> Jiang Jun


-- 
Alexis Murzeau
PGP: B7E6 0EBB 9293 7B06 BDBC  2787 E7BD 1904 F480 937F



signature.asc
Description: OpenPGP digital signature


Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)

2018-08-12 Thread Jackie

Great. It works.
To make it clear, this is what I did to test:

1. Install libopenblas-base:amd64 (0.3.2+ds-1) and 
libopenblas-dev:amd64 (0.3.2+ds-1) from official repo. After 
installation complete, open a terminal and type gimp + ENTER. The cmd 
hangs forever and Ctrl-C gets a segment fault.
2. sudo apt purge libopenblas-base libopenblas-dev. Upon finish, open a 
terminal and type gimp + ENTER. Gimp lanuches without problem.
3. Add the repo you offered and install libopenblas-base:amd64 
(0.3.2+ds-1.1~1patchTLS) and libopenblas-dev:amd64 
(0.3.2+ds-1.1~1patchTLS). The installations go well, after that, open a 
terminal and type gimp + ENTER.  Gimp again lanuches without any 
problem.


I think this validates both the existence of the bug and your solution. 
Thank you very much.


Jiang Jun

On Sun, Aug 12, 2018 at 9:31 PM, Alexis Murzeau  
wrote:

On 10/08/2018 23:27, Alexis Murzeau wrote:
 I can provide a binary package that include this patch, but I'm not 
sure
 this is the best thing to do (I'm not the official maintainer, nor 
know

 a good place to upload it).



As I was requested to provide binary package to test the patch, I'm
providing this as version `0.3.2+ds-1.1~1patchTLS` built with sbuild 
at [0].


Instructions are on the same page [0]:
execute: `apt-key adv --keyserver hkps://hkps.pool.sks-keyservers.net
--recv-keys 0x3F7A2FA142E434FE06622560B05266B2EB68F001`
and add `deb https://amurzeau.github.io/apt-repository unstable main` 
to

sources.list.

This APT repository contains both the source and amd64 binary 
packages.

If you try it, please tell if it works.

[0] https://amurzeau.github.io/apt-repository/

--
Alexis Murzeau
PGP: B7E6 0EBB 9293 7B06 BDBC  2787 E7BD 1904 F480 937F



Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)

2018-08-12 Thread Alexis Murzeau
On 10/08/2018 23:27, Alexis Murzeau wrote:
> I can provide a binary package that include this patch, but I'm not sure
> this is the best thing to do (I'm not the official maintainer, nor know
> a good place to upload it).
> 

As I was requested to provide binary package to test the patch, I'm
providing this as version `0.3.2+ds-1.1~1patchTLS` built with sbuild at [0].

Instructions are on the same page [0]:
execute: `apt-key adv --keyserver hkps://hkps.pool.sks-keyservers.net
--recv-keys 0x3F7A2FA142E434FE06622560B05266B2EB68F001`
and add `deb https://amurzeau.github.io/apt-repository unstable main` to
sources.list.

This APT repository contains both the source and amd64 binary packages.
If you try it, please tell if it works.

[0] https://amurzeau.github.io/apt-repository/

-- 
Alexis Murzeau
PGP: B7E6 0EBB 9293 7B06 BDBC  2787 E7BD 1904 F480 937F



signature.asc
Description: OpenPGP digital signature


Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)

2018-08-10 Thread Alexis Murzeau
On 09/08/2018 00:22, Alexis Murzeau wrote:
> On 08/08/2018 00:59, Alexis Murzeau wrote:
>> severity 903514 important
>> thanks
>>
>>> Reassigning to glibc with affects on openblas and gimp as this is caused
>>> by a deadlock inside glibc.
>>
>> Done.
>>
>> Lowering severity as this does not render any package unusable by
>> themselves, but only a combination of them (GIMP + OpenBLAS).
>>
>> I think a workaround solution against GIMP OpenBLAS should be done as
>> I'm not sure a good solution will emerge in glibc given attempts done in
>> the past. The work to be done seems non trivial.
>>
>> My though on possible solutions:
>>  * Add a breaks between GIMP and OpenBLAS
>>  * Disable TLS in OpenBLAS build (if possible, but this would cause a
>> performance loss for users that use OpenBLAS without gimp)
>>  * Add a delay in GIMP to not load then close libraries too fast (so
>> OpenBLAS threads are fully initialized when dl_close is called on it)
>>
> 
> Hi,
> 
> I've posted a issue on openblas upstream project [0] and they suggested
> some solutions.
> One of them is to disable the use of compiler supported TLS and instead
> use pthreads.
> 
> I tested this and it seems to fix deadlocks while starting gimp (I tried
> without arguments, with a non existing file and with an existing file).
> 
> I've pushed a merge request with the patch at [1].
> I've also asked openblas upstream if this patch could be a good solution.
> 
> In that case would it be possible to have this patch tested for ones who
> have major instabilities with gimp + openblas ?
> 
> Thanks :)
> 

Hi,

I've updated the merge request [0] with the upstream proposed patch [1].

@openblas maintainers, maybe someone can build a package with this patch
and upload to experimental so others can check if gimp works fine with it ?

I've myself tested it and gimp does not deadlock.

I can provide a binary package that include this patch, but I'm not sure
this is the best thing to do (I'm not the official maintainer, nor know
a good place to upload it).

[0] https://salsa.debian.org/science-team/openblas/merge_requests/1
[1] https://github.com/xianyi/OpenBLAS/pull/1726

-- 
Alexis Murzeau
PGP: B7E6 0EBB 9293 7B06 BDBC  2787 E7BD 1904 F480 937F



signature.asc
Description: OpenPGP digital signature


Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)

2018-08-08 Thread Alexis Murzeau
On 08/08/2018 00:59, Alexis Murzeau wrote:
> severity 903514 important
> thanks
> 
>> Reassigning to glibc with affects on openblas and gimp as this is caused
>> by a deadlock inside glibc.
> 
> Done.
> 
> Lowering severity as this does not render any package unusable by
> themselves, but only a combination of them (GIMP + OpenBLAS).
> 
> I think a workaround solution against GIMP OpenBLAS should be done as
> I'm not sure a good solution will emerge in glibc given attempts done in
> the past. The work to be done seems non trivial.
> 
> My though on possible solutions:
>  * Add a breaks between GIMP and OpenBLAS
>  * Disable TLS in OpenBLAS build (if possible, but this would cause a
> performance loss for users that use OpenBLAS without gimp)
>  * Add a delay in GIMP to not load then close libraries too fast (so
> OpenBLAS threads are fully initialized when dl_close is called on it)
> 

Hi,

I've posted a issue on openblas upstream project [0] and they suggested
some solutions.
One of them is to disable the use of compiler supported TLS and instead
use pthreads.

I tested this and it seems to fix deadlocks while starting gimp (I tried
without arguments, with a non existing file and with an existing file).

I've pushed a merge request with the patch at [1].
I've also asked openblas upstream if this patch could be a good solution.

In that case would it be possible to have this patch tested for ones who
have major instabilities with gimp + openblas ?

Thanks :)

-- 
Alexis Murzeau
PGP: B7E6 0EBB 9293 7B06 BDBC  2787 E7BD 1904 F480 937F



signature.asc
Description: OpenPGP digital signature


Bug#903514: Deadlock in _dl_close join-ing threads accessing TLS (was Re: gimp won't launch)

2018-08-07 Thread Alexis Murzeau
severity 903514 important
thanks

> Reassigning to glibc with affects on openblas and gimp as this is caused
> by a deadlock inside glibc.

Done.

Lowering severity as this does not render any package unusable by
themselves, but only a combination of them (GIMP + OpenBLAS).

I think a workaround solution against GIMP OpenBLAS should be done as
I'm not sure a good solution will emerge in glibc given attempts done in
the past. The work to be done seems non trivial.

My though on possible solutions:
 * Add a breaks between GIMP and OpenBLAS
 * Disable TLS in OpenBLAS build (if possible, but this would cause a
performance loss for users that use OpenBLAS without gimp)
 * Add a delay in GIMP to not load then close libraries too fast (so
OpenBLAS threads are fully initialized when dl_close is called on it)

-- 
Alexis Murzeau
PGP: B7E6 0EBB 9293 7B06 BDBC  2787 E7BD 1904 F480 937F



signature.asc
Description: OpenPGP digital signature