Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On 18/07/2023 13:09, Corinna Vinschen wrote: On Jul 18 12:20, Jon Turney wrote: On 17/07/2023 16:41, Corinna Vinschen wrote: Looking into pthread::cancel we have this order of things: // cancel deferred mutex.unlock (); canceled = true; SetEvent (cancel_event); return 0; The canceled var is set before the SetEvent call. What if the thread is terminated after canceled is set to true but before SetEvent is called? pthread::testcancel claims: We check for the canceled flag first. [...] Only if the thread is marked as canceled, we wait for cancel_event being really set, on the off-chance that pthread_cancel gets interrupted before calling SetEvent. Neat idea to speed up the code, but doesn't that mean we have a potential deadlock, especially given that pthread::testcancel calls WFSO with an INFINITE timeout? I'm not sure I follow: another thread sets cancelled = true, just before we hit pthread::testcancel(), so we go into the WFSO, but then the other thread continues, signals cancel_event and everything's fine. What meaning are you assigning to "interrupted" here? Are we worried about the thread calling pthread_cancel being cancelled itself? Yes. My concern is if the thread gets terminated between setting canceled and setting the event object. Prior to commit 42faed412857, we didn't wait infinitely, just tested the event object. Only with adding the canceled variable, we (better: I) added the the infinite timeout. I don't see a real reason to do that. I think this should be changed to just checking the event object, see the below patch. I see now. Yes, this makes perfect sense. And if so, how do we fix this? Theoretically, the most simple solution might be to call SetEvent before setting the canceled variable, but in fact we would have to make setting canceld and cancel_event an atomic operation. Well, yeah, that is required for them to be coherent. But we have a mutex on the thread object for that purpose, and I don't quite see why it's released so early here. The mutex is not guarding canceled or the event object. Thus it's not used in testcancel either, otherwise introducing the canceled var to speed up stuff wouldn't have made any sense. Corinna commit 518e5e46f064de41d3ef6d6ef743e2e760a46282 Author: Corinna Vinschen AuthorDate: Mon Jul 17 18:02:04 2023 +0200 Commit: Corinna Vinschen CommitDate: Tue Jul 18 10:11:30 2023 +0200 Cygwin: don't wait infinitely on a pthread cancel event Starting with commit 42faed412857 ("* thread.h (class pthread): Add bool member canceled."), pthread::testcancel waits infinitely on cancel_event after it checked if the canceled variable is set. However, this might introduce a deadlock, if the thread calling pthread_cancel is terminated after setting canceled to true, but before calling SetEvent on cancel_event. In fact, it's not at all necessary to wait infinitely. By definition, the thread is only canceled if cancel_event is set. The canceled variable is just a helper to speed up code. We can safely assume that the thread hasn't been canceled yet, if canceled is set, but cancel_event isn't. Fixes: 42faed412857 ("* thread.h (class pthread): Add bool member canceled.") Signed-off-by: Corinna Vinschen diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc index f614e01c42f6..21e89e146e0a 100644 --- a/winsup/cygwin/thread.cc +++ b/winsup/cygwin/thread.cc @@ -961,12 +961,9 @@ pthread::testcancel () pthread_testcancel function a lot without adding the overhead of an OS call. Only if the thread is marked as canceled, we wait for cancel_event being really set, on the off-chance that pthread_cancel - gets interrupted before calling SetEvent. */ - if (canceled) -{ - WaitForSingleObject (cancel_event, INFINITE); - cancel_self (); -} + gets interrupted or terminated before calling SetEvent. */ + if (canceled && IsEventSignalled (cancel_event)) +cancel_self (); } /* Return cancel event handle if it exists *and* cancel is not disabled.
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On Jul 18 12:20, Jon Turney wrote: > On 17/07/2023 16:41, Corinna Vinschen wrote: > > > Looking into pthread::cancel we have this order of things: > > > > > > // cancel deferred > > > mutex.unlock (); > > > canceled = true; > > > SetEvent (cancel_event); > > > return 0; > > > > > > The canceled var is set before the SetEvent call. > > > What if the thread is terminated after canceled is set to true but > > > before SetEvent is called? > > > > > > pthread::testcancel claims: > > > > > >We check for the canceled flag first. [...] > > >Only if the thread is marked as canceled, we wait for cancel_event > > >being really set, on the off-chance that pthread_cancel gets > > >interrupted before calling SetEvent. > > > > > > Neat idea to speed up the code, but doesn't that mean we have a > > > potential deadlock, especially given that pthread::testcancel calls WFSO > > > with an INFINITE timeout? > > I'm not sure I follow: another thread sets cancelled = true, just before we > hit pthread::testcancel(), so we go into the WFSO, but then the other thread > continues, signals cancel_event and everything's fine. > > What meaning are you assigning to "interrupted" here? > > Are we worried about the thread calling pthread_cancel being cancelled > itself? Yes. My concern is if the thread gets terminated between setting canceled and setting the event object. Prior to commit 42faed412857, we didn't wait infinitely, just tested the event object. Only with adding the canceled variable, we (better: I) added the the infinite timeout. I don't see a real reason to do that. I think this should be changed to just checking the event object, see the below patch. > > > And if so, how do we fix this? Theoretically, the most simple > > > solution might be to call SetEvent before setting the canceled > > > variable, but in fact we would have to make setting canceld > > > and cancel_event an atomic operation. > > Well, yeah, that is required for them to be coherent. But we have a mutex on > the thread object for that purpose, and I don't quite see why it's released > so early here. The mutex is not guarding canceled or the event object. Thus it's not used in testcancel either, otherwise introducing the canceled var to speed up stuff wouldn't have made any sense. Corinna commit 518e5e46f064de41d3ef6d6ef743e2e760a46282 Author: Corinna Vinschen AuthorDate: Mon Jul 17 18:02:04 2023 +0200 Commit: Corinna Vinschen CommitDate: Tue Jul 18 10:11:30 2023 +0200 Cygwin: don't wait infinitely on a pthread cancel event Starting with commit 42faed412857 ("* thread.h (class pthread): Add bool member canceled."), pthread::testcancel waits infinitely on cancel_event after it checked if the canceled variable is set. However, this might introduce a deadlock, if the thread calling pthread_cancel is terminated after setting canceled to true, but before calling SetEvent on cancel_event. In fact, it's not at all necessary to wait infinitely. By definition, the thread is only canceled if cancel_event is set. The canceled variable is just a helper to speed up code. We can safely assume that the thread hasn't been canceled yet, if canceled is set, but cancel_event isn't. Fixes: 42faed412857 ("* thread.h (class pthread): Add bool member canceled.") Signed-off-by: Corinna Vinschen diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc index f614e01c42f6..21e89e146e0a 100644 --- a/winsup/cygwin/thread.cc +++ b/winsup/cygwin/thread.cc @@ -961,12 +961,9 @@ pthread::testcancel () pthread_testcancel function a lot without adding the overhead of an OS call. Only if the thread is marked as canceled, we wait for cancel_event being really set, on the off-chance that pthread_cancel - gets interrupted before calling SetEvent. */ - if (canceled) -{ - WaitForSingleObject (cancel_event, INFINITE); - cancel_self (); -} + gets interrupted or terminated before calling SetEvent. */ + if (canceled && IsEventSignalled (cancel_event)) +cancel_self (); } /* Return cancel event handle if it exists *and* cancel is not disabled.
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On 17/07/2023 16:41, Corinna Vinschen wrote: On Jul 17 16:21, Corinna Vinschen wrote: On Jul 17 12:51, Jon Turney wrote: On 17/07/2023 12:05, Corinna Vinschen wrote: diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc index f614e01c42f6..fceb9bda1806 100644 --- a/winsup/cygwin/thread.cc +++ b/winsup/cygwin/thread.cc @@ -546,6 +546,13 @@ pthread::exit (void *value_ptr) class pthread *thread = this; _cygtls *tls = cygtls; /* Save cygtls before deleting this. */ + /* Deferred cancellation still pending? */ + if (canceled) +{ + WaitForSingleObject (cancel_event, INFINITE); + value_ptr = PTHREAD_CANCELED; +} + // run cleanup handlers pop_all_cleanup_handlers (); What do you think? I mean, by your own interpretation of the standard, this isn't required, because we're allowed to take arbitrarily long to deliver the async cancellation, and in this case, we took so long that the thread exited before it happened, too bad... True enough! It doesn't seem a bad addition, Actually, it seems we actually *have* to do this. I just searched for more info on that problem and, to my surprise, I found this in the most obvious piece of documentation: https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_exit.html Quote: As the meaning of the status is determined by the application (except when the thread has been canceled, in which case it is PTHREAD_CANCELED), [...] On second thought... One thing bugging me is this: This is still a bit fuzzy, though. I'd appreciate any input. Looking into pthread::cancel we have this order of things: // cancel deferred mutex.unlock (); canceled = true; SetEvent (cancel_event); return 0; The canceled var is set before the SetEvent call. What if the thread is terminated after canceled is set to true but before SetEvent is called? pthread::testcancel claims: We check for the canceled flag first. [...] Only if the thread is marked as canceled, we wait for cancel_event being really set, on the off-chance that pthread_cancel gets interrupted before calling SetEvent. Neat idea to speed up the code, but doesn't that mean we have a potential deadlock, especially given that pthread::testcancel calls WFSO with an INFINITE timeout? I'm not sure I follow: another thread sets cancelled = true, just before we hit pthread::testcancel(), so we go into the WFSO, but then the other thread continues, signals cancel_event and everything's fine. What meaning are you assigning to "interrupted" here? Are we worried about the thread calling pthread_cancel being cancelled itself? And if so, how do we fix this? Theoretically, the most simple solution might be to call SetEvent before setting the canceled variable, but in fact we would have to make setting canceld and cancel_event an atomic operation. Well, yeah, that is required for them to be coherent. But we have a mutex on the thread object for that purpose, and I don't quite see why it's released so early here.
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On Jul 17 17:41, Corinna Vinschen wrote: > On Jul 17 16:21, Corinna Vinschen wrote: > > On Jul 17 12:51, Jon Turney wrote: > > > On 17/07/2023 12:05, Corinna Vinschen wrote: > > > > diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc > > > > index f614e01c42f6..fceb9bda1806 100644 > > > > --- a/winsup/cygwin/thread.cc > > > > +++ b/winsup/cygwin/thread.cc > > > > @@ -546,6 +546,13 @@ pthread::exit (void *value_ptr) > > > > class pthread *thread = this; > > > > _cygtls *tls = cygtls; /* Save cygtls before deleting this. */ > > > > + /* Deferred cancellation still pending? */ > > > > + if (canceled) > > > > +{ > > > > + WaitForSingleObject (cancel_event, INFINITE); > > > > + value_ptr = PTHREAD_CANCELED; > > > > +} > > > > + > > > > // run cleanup handlers > > > > pop_all_cleanup_handlers (); > > > > What do you think? > > > > > > I mean, by your own interpretation of the standard, this isn't required, > > > because we're allowed to take arbitrarily long to deliver the async > > > cancellation, and in this case, we took so long that the thread exited > > > before it happened, too bad... > > > > True enough! > > > > > It doesn't seem a bad addition, > > > Actually, it seems we actually *have* to do this. I just searched > for more info on that problem and, to my surprise, I found this in the > most obvious piece of documentation: > > https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_exit.html > > Quote: > > As the meaning of the status is determined by the application (except > when the thread has been canceled, in which case it is > PTHREAD_CANCELED), [...] FTR, apparently I have overinterpreted this sentence. I performed the following crude test on Linux,, the idea being to call pthread_cancel and then pthread_exit without hitting a cancallation point in between. cat > pt.c < #include #include #include #include #include int marker = 0; void * thread (void *arg) { int oldval; pthread_setcanceltype (PTHREAD_CANCEL_DEFERRED, ); pthread_setcancelstate (PTHREAD_CANCEL_ENABLE, ); marker = 1; while (marker < 2) ; pthread_exit ((void *) 42); } int main () { int error; pthread_t pt; void *retval; if ((error = pthread_create (, NULL, thread, NULL))) { printf ("pthread_create: %d %s\n", error, strerror (error)); return 1; } while (marker < 1) ; if ((error = pthread_cancel (pt))) { marker = 2; printf ("pthread_cancel: %d %s\n", error, strerror (error)); pthread_detach (pt); return 1; } marker = 2; if ((error = pthread_join (pt, ))) { printf ("pthread_join: %d %s\n", error, strerror (error)); pthread_detach (pt); return 1; } printf ("retval = %ld (%d)\n", (uintptr_t) retval, retval == PTHREAD_CANCELED); return 0; } EOF $ gcc -g -o pt pt.c -lpthread $ ./pt retval = 42 (0) So retval is the one set by the application, not PTHREAD_CANCELED, despite the pthread_cancel call. Looks like handling cancellation inside pthread_exit is really not the right thing to do... Corinna
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On Jul 17 16:21, Corinna Vinschen wrote: > On Jul 17 12:51, Jon Turney wrote: > > On 17/07/2023 12:05, Corinna Vinschen wrote: > > > diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc > > > index f614e01c42f6..fceb9bda1806 100644 > > > --- a/winsup/cygwin/thread.cc > > > +++ b/winsup/cygwin/thread.cc > > > @@ -546,6 +546,13 @@ pthread::exit (void *value_ptr) > > > class pthread *thread = this; > > > _cygtls *tls = cygtls;/* Save cygtls before deleting this. */ > > > + /* Deferred cancellation still pending? */ > > > + if (canceled) > > > +{ > > > + WaitForSingleObject (cancel_event, INFINITE); > > > + value_ptr = PTHREAD_CANCELED; > > > +} > > > + > > > // run cleanup handlers > > > pop_all_cleanup_handlers (); > > > What do you think? > > > > I mean, by your own interpretation of the standard, this isn't required, > > because we're allowed to take arbitrarily long to deliver the async > > cancellation, and in this case, we took so long that the thread exited > > before it happened, too bad... > > True enough! > > > It doesn't seem a bad addition, > Actually, it seems we actually *have* to do this. I just searched for more info on that problem and, to my surprise, I found this in the most obvious piece of documentation: https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_exit.html Quote: As the meaning of the status is determined by the application (except when the thread has been canceled, in which case it is PTHREAD_CANCELED), [...] > On second thought... > > One thing bugging me is this: This is still a bit fuzzy, though. I'd appreciate any input. > Looking into pthread::cancel we have this order of things: > > // cancel deferred > mutex.unlock (); > canceled = true; > SetEvent (cancel_event); > return 0; > > The canceled var is set before the SetEvent call. > What if the thread is terminated after canceled is set to true but > before SetEvent is called? > > pthread::testcancel claims: > > We check for the canceled flag first. [...] > Only if the thread is marked as canceled, we wait for cancel_event > being really set, on the off-chance that pthread_cancel gets > interrupted before calling SetEvent. > > Neat idea to speed up the code, but doesn't that mean we have a > potential deadlock, especially given that pthread::testcancel calls WFSO > with an INFINITE timeout? > > And if so, how do we fix this? Theoretically, the most simple > solution might be to call SetEvent before setting the canceled > variable, but in fact we would have to make setting canceld > and cancel_event an atomic operation. > > Another idea is never to wait for an INFINITE time. Logically, if > canceled is set and cancel_event isn't, the thread just hasn't been > canceled yet. > > Any better idea? > > > but this turns pthread_exit() into a > > deferred cancellation point as well, so it should be added to the list of > > "an implementation may also mark other functions not specified in the > > standard as cancellation points" in our documentation^W the huge comment in > > threads.cc. > > Yes, indeed. Thanks, Corinna
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On Jul 17 16:04, Corinna Vinschen wrote: > On Jul 17 12:51, Jon Turney wrote: > > Perhaps there is a better way to write a test that async cancellation works > > in the absence of cancellation points, but it eludes me... > > Same here, so just go ahead. Actually, it's not just that. I think this is the right thing to do. Corinna
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On Jul 17 12:51, Jon Turney wrote: > On 17/07/2023 12:05, Corinna Vinschen wrote: > > On Jul 14 20:57, Corinna Vinschen wrote: > > > What if Cygwin checks for a deferred cancellation in pthread::exit, > > > too? It needs to do this by its own, not calling pthread::testcancel, > > > otherwise we're in an infinite loop. Since cancel is basically like > > > exit, just with a PTHREAD_CANCELED return value, the only additional > > > action would be to set retval to PTHREAD_CANCELED explicitely. > > > > Kind of like this: > > > > diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc > > index f614e01c42f6..fceb9bda1806 100644 > > --- a/winsup/cygwin/thread.cc > > +++ b/winsup/cygwin/thread.cc > > @@ -546,6 +546,13 @@ pthread::exit (void *value_ptr) > > class pthread *thread = this; > > _cygtls *tls = cygtls; /* Save cygtls before deleting this. */ > > + /* Deferred cancellation still pending? */ > > + if (canceled) > > +{ > > + WaitForSingleObject (cancel_event, INFINITE); > > + value_ptr = PTHREAD_CANCELED; > > +} > > + > > // run cleanup handlers > > pop_all_cleanup_handlers (); > > What do you think? > > I mean, by your own interpretation of the standard, this isn't required, > because we're allowed to take arbitrarily long to deliver the async > cancellation, and in this case, we took so long that the thread exited > before it happened, too bad... True enough! > It doesn't seem a bad addition, On second thought... One thing bugging me is this: Looking into pthread::cancel we have this order of things: // cancel deferred mutex.unlock (); canceled = true; SetEvent (cancel_event); return 0; The canceled var is set before the SetEvent call. What if the thread is terminated after canceled is set to true but before SetEvent is called? pthread::testcancel claims: We check for the canceled flag first. [...] Only if the thread is marked as canceled, we wait for cancel_event being really set, on the off-chance that pthread_cancel gets interrupted before calling SetEvent. Neat idea to speed up the code, but doesn't that mean we have a potential deadlock, especially given that pthread::testcancel calls WFSO with an INFINITE timeout? And if so, how do we fix this? Theoretically, the most simple solution might be to call SetEvent before setting the canceled variable, but in fact we would have to make setting canceld and cancel_event an atomic operation. Another idea is never to wait for an INFINITE time. Logically, if canceled is set and cancel_event isn't, the thread just hasn't been canceled yet. Any better idea? > but this turns pthread_exit() into a > deferred cancellation point as well, so it should be added to the list of > "an implementation may also mark other functions not specified in the > standard as cancellation points" in our documentation^W the huge comment in > threads.cc. Yes, indeed. Thanks, Corinna
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On Jul 17 12:51, Jon Turney wrote: > On 14/07/2023 14:04, Jon Turney wrote: > > On 13/07/2023 19:53, Corinna Vinschen wrote: > > > > > > normally after 10 seconds. (See the commentary in pthread::cancel() > > > > > > in > > > > > > thread.cc, where it checks if the target thread is inside the > > > > > > kernel, > > > > > > and silently converts the cancellation into a deferred one) > > > > > > > > > > Nevertheless, I think this is ok to do. The description of > > > > > pthread_cancel > > > > > contains this: > > > > > > > > > > Asynchronous cancelability means that the thread can be canceled at > > > > > any time (usually immediately, but the system does not > > > > > guarantee this). > > > > > > > > > > And > > > > > > > > > > The above steps happen asynchronously with respect to the > > > > > pthread_cancel() call; the return status of pthread_cancel() merely > > > > > informs the caller whether the cancellation request was > > > > > successfully > > > > > queued. > > > > > > > > > > So any assumption *when* the cancallation takes place is may be wrong. > > > > Yeah. > > > > I think the flakiness is when we happen to try to async cancel while in > > the Windows kernel, which implicitly converts to a deferred > > cancellation, but there are no cancellation points in the the thread, so > > it arrives at pthread_exit() and returns a exit code other than > > PTHREAD_CANCELED. > > > > I did consider making the test non-flaky by adding a final call to > > pthread_testcancel(), to notice any failed async cancellation which has > > been converted to a deferred one. > > > > But then that is just the same as the deferred cancellation tests, and > > confirms the cancellation happens, but not that it's async, which is > > part of the point of the test. > > > > I guess this could also check that not all of the threads ran for all 10 > > seconds, which would indicate that at least some of them were cancelled > > asynchronously. > > I wrote this, attached, which does indeed make this test more reliable. > > You point is well made that this is making assumptions about how quickly > async cancellation works that are not required by the standard > > (It would be a valid, if strange implementation, if async cancellation > always took 10 seconds to take effect, which this test assumes isn't the > case) > > Perhaps there is a better way to write a test that async cancellation works > in the absence of cancellation points, but it eludes me... Same here, so just go ahead. Thanks, Corinna
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On 17/07/2023 12:05, Corinna Vinschen wrote: On Jul 14 20:57, Corinna Vinschen wrote: On Jul 14 14:04, Jon Turney wrote: On 13/07/2023 19:53, Corinna Vinschen wrote: Nevertheless, I think this is ok to do. The description of pthread_cancel contains this: Asynchronous cancelability means that the thread can be canceled at any time (usually immediately, but the system does not guarantee this). And The above steps happen asynchronously with respect to the pthread_cancel() call; the return status of pthread_cancel() merely informs the caller whether the cancellation request was successfully queued. So any assumption *when* the cancallation takes place is may be wrong. Yeah. I think the flakiness is when we happen to try to async cancel while in the Windows kernel, which implicitly converts to a deferred cancellation, but there are no cancellation points in the the thread, so it arrives at pthread_exit() and returns a exit code other than PTHREAD_CANCELED. In pthread_join(), right? I did consider making the test non-flaky by adding a final call to pthread_testcancel(), to notice any failed async cancellation which has been converted to a deferred one. But then that is just the same as the deferred cancellation tests, and confirms the cancellation happens, but not that it's async, which is part of the point of the test. What if Cygwin checks for a deferred cancellation in pthread::exit, too? It needs to do this by its own, not calling pthread::testcancel, otherwise we're in an infinite loop. Since cancel is basically like exit, just with a PTHREAD_CANCELED return value, the only additional action would be to set retval to PTHREAD_CANCELED explicitely. Kind of like this: diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc index f614e01c42f6..fceb9bda1806 100644 --- a/winsup/cygwin/thread.cc +++ b/winsup/cygwin/thread.cc @@ -546,6 +546,13 @@ pthread::exit (void *value_ptr) class pthread *thread = this; _cygtls *tls = cygtls; /* Save cygtls before deleting this. */ + /* Deferred cancellation still pending? */ + if (canceled) +{ + WaitForSingleObject (cancel_event, INFINITE); + value_ptr = PTHREAD_CANCELED; +} + // run cleanup handlers pop_all_cleanup_handlers (); What do you think? I mean, by your own interpretation of the standard, this isn't required, because we're allowed to take arbitrarily long to deliver the async cancellation, and in this case, we took so long that the thread exited before it happened, too bad... It doesn't seem a bad addition, but this turns pthread_exit() into a deferred cancellation point as well, so it should be added to the list of "an implementation may also mark other functions not specified in the standard as cancellation points" in our documentation^W the huge comment in threads.cc.
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On 14/07/2023 14:04, Jon Turney wrote: On 13/07/2023 19:53, Corinna Vinschen wrote: normally after 10 seconds. (See the commentary in pthread::cancel() in thread.cc, where it checks if the target thread is inside the kernel, and silently converts the cancellation into a deferred one) Nevertheless, I think this is ok to do. The description of pthread_cancel contains this: Asynchronous cancelability means that the thread can be canceled at any time (usually immediately, but the system does not guarantee this). And The above steps happen asynchronously with respect to the pthread_cancel() call; the return status of pthread_cancel() merely informs the caller whether the cancellation request was successfully queued. So any assumption *when* the cancallation takes place is may be wrong. Yeah. I think the flakiness is when we happen to try to async cancel while in the Windows kernel, which implicitly converts to a deferred cancellation, but there are no cancellation points in the the thread, so it arrives at pthread_exit() and returns a exit code other than PTHREAD_CANCELED. I did consider making the test non-flaky by adding a final call to pthread_testcancel(), to notice any failed async cancellation which has been converted to a deferred one. But then that is just the same as the deferred cancellation tests, and confirms the cancellation happens, but not that it's async, which is part of the point of the test. I guess this could also check that not all of the threads ran for all 10 seconds, which would indicate that at least some of them were cancelled asynchronously. I wrote this, attached, which does indeed make this test more reliable. You point is well made that this is making assumptions about how quickly async cancellation works that are not required by the standard (It would be a valid, if strange implementation, if async cancellation always took 10 seconds to take effect, which this test assumes isn't the case) Perhaps there is a better way to write a test that async cancellation works in the absence of cancellation points, but it eludes me... From d0023fa3ea1e8f29e80d473ab13d8200bdd2dc3a Mon Sep 17 00:00:00 2001 From: Jon Turney Date: Sat, 15 Jul 2023 17:57:43 +0100 Subject: [PATCH] Cygwin: testsuite: Make cancel3 and cancel5 more robust Despite our efforts, sometimes the async cancellation gets deferred. Notice this by calling pthread_testcancel(), and then try to work out if async cancellation was ever successful by checking if all threads ran for the full 10 seconds, or if some were stopped early. --- winsup/testsuite/winsup.api/pthread/cancel3.c | 16 +++- winsup/testsuite/winsup.api/pthread/cancel5.c | 14 ++ 2 files changed, 29 insertions(+), 1 deletion(-) diff --git a/winsup/testsuite/winsup.api/pthread/cancel3.c b/winsup/testsuite/winsup.api/pthread/cancel3.c index 07feb7c9b..075f052cc 100644 --- a/winsup/testsuite/winsup.api/pthread/cancel3.c +++ b/winsup/testsuite/winsup.api/pthread/cancel3.c @@ -92,6 +92,9 @@ mythread(void * arg) } } + /* Notice if asynchronous cancel got deferred */ + pthread_testcancel(); + return result; } @@ -101,6 +104,7 @@ main() int failed = 0; int i; pthread_t t[NUMTHREADS + 1]; + int ran_to_completion = 0; assert((t[0] = pthread_self()) != NULL); @@ -130,7 +134,7 @@ main() * Standard check that all threads started. */ for (i = 1; i <= NUMTHREADS; i++) -{ +{ if (!threadbag[i].started) { failed |= !threadbag[i].started; @@ -166,9 +170,19 @@ main() threadbag[i].count, result); } + + if (threadbag[i].count >= 10) + ran_to_completion++; + failed = (failed || fail); } + if (ran_to_completion >= 10) +{ + fprintf(stderr, "All threads ran to completion, async cancellation never happened\n"); + failed = TRUE; +} + assert(!failed); /* diff --git a/winsup/testsuite/winsup.api/pthread/cancel5.c b/winsup/testsuite/winsup.api/pthread/cancel5.c index 999b3c95c..23c02afe4 100644 --- a/winsup/testsuite/winsup.api/pthread/cancel5.c +++ b/winsup/testsuite/winsup.api/pthread/cancel5.c @@ -93,6 +93,9 @@ mythread(void * arg) } } + /* Notice if asynchronous cancel got deferred */ + pthread_testcancel(); + return result; } @@ -102,6 +105,7 @@ main() int failed = 0; int i; pthread_t t[NUMTHREADS + 1]; + int ran_to_completion = 0; for (i = 1; i <= NUMTHREADS; i++) { @@ -165,9 +169,19 @@ main() threadbag[i].count, result); } + + if (threadbag[i].count >= 10) + ran_to_completion++; + failed = (failed || fail); } + if (ran_to_completion >= 10) +{ + fprintf(stderr, "All threads ran to completion, async cancellation never happened\n"); + failed = TRUE; +} + assert(!failed); /* -- 2.39.0
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On Jul 14 20:57, Corinna Vinschen wrote: > On Jul 14 14:04, Jon Turney wrote: > > On 13/07/2023 19:53, Corinna Vinschen wrote: > > > > > Nevertheless, I think this is ok to do. The description of > > > > > pthread_cancel > > > > > contains this: > > > > > > > > > >Asynchronous cancelability means that the thread can be canceled at > > > > >any time (usually immediately, but the system does not guarantee > > > > > this). > > > > > > > > > > And > > > > > > > > > >The above steps happen asynchronously with respect to the > > > > >pthread_cancel() call; the return status of pthread_cancel() merely > > > > >informs the caller whether the cancellation request was > > > > > successfully > > > > >queued. > > > > > > > > > > So any assumption *when* the cancallation takes place is may be wrong. > > > > Yeah. > > > > I think the flakiness is when we happen to try to async cancel while in the > > Windows kernel, which implicitly converts to a deferred cancellation, but > > there are no cancellation points in the the thread, so it arrives at > > pthread_exit() and returns a exit code other than PTHREAD_CANCELED. > > In pthread_join(), right? > > > I did consider making the test non-flaky by adding a final call to > > pthread_testcancel(), to notice any failed async cancellation which has been > > converted to a deferred one. > > > > But then that is just the same as the deferred cancellation tests, and > > confirms the cancellation happens, but not that it's async, which is part of > > the point of the test. > > What if Cygwin checks for a deferred cancellation in pthread::exit, > too? It needs to do this by its own, not calling pthread::testcancel, > otherwise we're in an infinite loop. Since cancel is basically like > exit, just with a PTHREAD_CANCELED return value, the only additional > action would be to set retval to PTHREAD_CANCELED explicitely. Kind of like this: diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc index f614e01c42f6..fceb9bda1806 100644 --- a/winsup/cygwin/thread.cc +++ b/winsup/cygwin/thread.cc @@ -546,6 +546,13 @@ pthread::exit (void *value_ptr) class pthread *thread = this; _cygtls *tls = cygtls; /* Save cygtls before deleting this. */ + /* Deferred cancellation still pending? */ + if (canceled) +{ + WaitForSingleObject (cancel_event, INFINITE); + value_ptr = PTHREAD_CANCELED; +} + // run cleanup handlers pop_all_cleanup_handlers (); What do you think? Corinna
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On Jul 14 14:04, Jon Turney wrote: > On 13/07/2023 19:53, Corinna Vinschen wrote: > > > > Nevertheless, I think this is ok to do. The description of > > > > pthread_cancel > > > > contains this: > > > > > > > >Asynchronous cancelability means that the thread can be canceled at > > > >any time (usually immediately, but the system does not guarantee > > > > this). > > > > > > > > And > > > > > > > >The above steps happen asynchronously with respect to the > > > >pthread_cancel() call; the return status of pthread_cancel() merely > > > >informs the caller whether the cancellation request was successfully > > > >queued. > > > > > > > > So any assumption *when* the cancallation takes place is may be wrong. > > Yeah. > > I think the flakiness is when we happen to try to async cancel while in the > Windows kernel, which implicitly converts to a deferred cancellation, but > there are no cancellation points in the the thread, so it arrives at > pthread_exit() and returns a exit code other than PTHREAD_CANCELED. In pthread_join(), right? > I did consider making the test non-flaky by adding a final call to > pthread_testcancel(), to notice any failed async cancellation which has been > converted to a deferred one. > > But then that is just the same as the deferred cancellation tests, and > confirms the cancellation happens, but not that it's async, which is part of > the point of the test. What if Cygwin checks for a deferred cancellation in pthread::exit, too? It needs to do this by its own, not calling pthread::testcancel, otherwise we're in an infinite loop. Since cancel is basically like exit, just with a PTHREAD_CANCELED return value, the only additional action would be to set retval to PTHREAD_CANCELED explicitely. Corinna
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On 13/07/2023 19:53, Corinna Vinschen wrote: On Jul 13 20:37, Corinna Vinschen wrote: On Jul 13 20:16, Corinna Vinschen wrote: On Jul 13 12:39, Jon Turney wrote: These tests async thread cancellation of a thread that doesn't have any cancellation points. Unfortunately, since 450f557f the async cancellation silently fails when the thread is inside the kernel function Sleep(), so it just exits I'm not sure how this patch should be the actual culprit. It only handles thread priorities, not thread cancellation. Isn't this rather 2b165a453ea7b or some such? Yeah, I messed up there somehow. I will fix the commit id. normally after 10 seconds. (See the commentary in pthread::cancel() in thread.cc, where it checks if the target thread is inside the kernel, and silently converts the cancellation into a deferred one) Nevertheless, I think this is ok to do. The description of pthread_cancel contains this: Asynchronous cancelability means that the thread can be canceled at any time (usually immediately, but the system does not guarantee this). And The above steps happen asynchronously with respect to the pthread_cancel() call; the return status of pthread_cancel() merely informs the caller whether the cancellation request was successfully queued. So any assumption *when* the cancallation takes place is may be wrong. Yeah. I think the flakiness is when we happen to try to async cancel while in the Windows kernel, which implicitly converts to a deferred cancellation, but there are no cancellation points in the the thread, so it arrives at pthread_exit() and returns a exit code other than PTHREAD_CANCELED. I did consider making the test non-flaky by adding a final call to pthread_testcancel(), to notice any failed async cancellation which has been converted to a deferred one. But then that is just the same as the deferred cancellation tests, and confirms the cancellation happens, but not that it's async, which is part of the point of the test. I guess this could also check that not all of the threads ran for all 10 seconds, which would indicate that at least some of them were cancelled asynchronously. I wonder, though, if we can't come up with a better solution than just waiting for the next cancellation point. No solution comes to mind if the user code calls a Win32 function, but maybe _sigbe could check if the thread's cancel_event is set? It's all in assembler, that complicates it a bit, but that would make it at least working for POSIX functions which are no cancellation points. I think you'd need to record the "pending async cancellation" as different to "deferred cancellation" so this doesn't turn everything into a cancellation point. Alternatively, maybe we can utilize the existing signal handler and just send a Cygwin-only signal outside the maskable signal range. wait_sig calls sigpacket::process like for any other standard signal, The signal handler is basically pthread::static_cancel_self(). Something like that. I'm not sure this is worth lots of effort, as thread cancellation seems to be regarded as mis-specified in such as way as to make it unsafe for serious use.
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On Jul 13 20:37, Corinna Vinschen wrote: > On Jul 13 20:16, Corinna Vinschen wrote: > > On Jul 13 12:39, Jon Turney wrote: > > > These tests async thread cancellation of a thread that doesn't have any > > > cancellation points. > > > > > > Unfortunately, since 450f557f the async cancellation silently fails when > > > the thread is inside the kernel function Sleep(), so it just exits > > > > I'm not sure how this patch should be the actual culprit. It only > > handles thread priorities, not thread cancellation. Isn't this rather > > 2b165a453ea7b or some such? > > > > > normally after 10 seconds. (See the commentary in pthread::cancel() in > > > thread.cc, where it checks if the target thread is inside the kernel, > > > and silently converts the cancellation into a deferred one) > > > > Nevertheless, I think this is ok to do. The description of pthread_cancel > > contains this: > > > > Asynchronous cancelability means that the thread can be canceled at > > any time (usually immediately, but the system does not guarantee this). > > > > And > > > > The above steps happen asynchronously with respect to the > > pthread_cancel() call; the return status of pthread_cancel() merely > > informs the caller whether the cancellation request was successfully > > queued. > > > > So any assumption *when* the cancallation takes place is may be wrong. > > I wonder, though, if we can't come up with a better solution than just > waiting for the next cancellation point. > > No solution comes to mind if the user code calls a Win32 function, but > maybe _sigbe could check if the thread's cancel_event is set? It's all > in assembler, that complicates it a bit, but that would make it at least > working for POSIX functions which are no cancellation points. Alternatively, maybe we can utilize the existing signal handler and just send a Cygwin-only signal outside the maskable signal range. wait_sig calls sigpacket::process like for any other standard signal, The signal handler is basically pthread::static_cancel_self(). Something like that. Corinna
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On Jul 13 20:16, Corinna Vinschen wrote: > On Jul 13 12:39, Jon Turney wrote: > > These tests async thread cancellation of a thread that doesn't have any > > cancellation points. > > > > Unfortunately, since 450f557f the async cancellation silently fails when > > the thread is inside the kernel function Sleep(), so it just exits > > I'm not sure how this patch should be the actual culprit. It only > handles thread priorities, not thread cancellation. Isn't this rather > 2b165a453ea7b or some such? > > > normally after 10 seconds. (See the commentary in pthread::cancel() in > > thread.cc, where it checks if the target thread is inside the kernel, > > and silently converts the cancellation into a deferred one) > > Nevertheless, I think this is ok to do. The description of pthread_cancel > contains this: > > Asynchronous cancelability means that the thread can be canceled at > any time (usually immediately, but the system does not guarantee this). > > And > > The above steps happen asynchronously with respect to the > pthread_cancel() call; the return status of pthread_cancel() merely > informs the caller whether the cancellation request was successfully > queued. > > So any assumption *when* the cancallation takes place is may be wrong. I wonder, though, if we can't come up with a better solution than just waiting for the next cancellation point. No solution comes to mind if the user code calls a Win32 function, but maybe _sigbe could check if the thread's cancel_event is set? It's all in assembler, that complicates it a bit, but that would make it at least working for POSIX functions which are no cancellation points. Corinna
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On Jul 13 12:39, Jon Turney wrote: > These tests async thread cancellation of a thread that doesn't have any > cancellation points. > > Unfortunately, since 450f557f the async cancellation silently fails when > the thread is inside the kernel function Sleep(), so it just exits I'm not sure how this patch should be the actual culprit. It only handles thread priorities, not thread cancellation. Isn't this rather 2b165a453ea7b or some such? > normally after 10 seconds. (See the commentary in pthread::cancel() in > thread.cc, where it checks if the target thread is inside the kernel, > and silently converts the cancellation into a deferred one) Nevertheless, I think this is ok to do. The description of pthread_cancel contains this: Asynchronous cancelability means that the thread can be canceled at any time (usually immediately, but the system does not guarantee this). And The above steps happen asynchronously with respect to the pthread_cancel() call; the return status of pthread_cancel() merely informs the caller whether the cancellation request was successfully queued. So any assumption *when* the cancallation takes place is may be wrong. Corinna
Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5
On 13/07/2023 12:39, Jon Turney wrote: These tests async thread cancellation of a thread that doesn't have any cancellation points. Unfortunately, since 450f557f the async cancellation silently fails when the thread is inside the kernel function Sleep(), so it just exits normally after 10 seconds. (See the commentary in pthread::cancel() in thread.cc, where it checks if the target thread is inside the kernel, and silently converts the cancellation into a deferred one) Work around this by busy-waiting rather than Sleep()ing for 10 seconds. This is still somewhat fragile: the async cancel could still fail, if it happens to occur while we're inside the kernel function that time() calls. v2: Do nothing more efficiently --- winsup/testsuite/winsup.api/pthread/cancel3.c | 24 ++- winsup/testsuite/winsup.api/pthread/cancel5.c | 24 ++- 2 files changed, 36 insertions(+), 12 deletions(-) This still seems a bit flaky, for the reasons identified. Perhaps there's a better way of doing a pause without a cancellation point or entering the kernel? Or a better way to test that async cancellation actually works?