Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-18 Thread Jon Turney

On 18/07/2023 13:09, Corinna Vinschen wrote:

On Jul 18 12:20, Jon Turney wrote:

On 17/07/2023 16:41, Corinna Vinschen wrote:

Looking into pthread::cancel we have this order of things:

  // cancel deferred
  mutex.unlock ();
  canceled = true;
  SetEvent (cancel_event);
  return 0;

The canceled var is set before the SetEvent call.
What if the thread is terminated after canceled is set to true but
before SetEvent is called?

pthread::testcancel claims:

We check for the canceled flag first. [...]
Only if the thread is marked as canceled, we wait for cancel_event
being really set, on the off-chance that pthread_cancel gets
interrupted before calling SetEvent.

Neat idea to speed up the code, but doesn't that mean we have a
potential deadlock, especially given that pthread::testcancel calls WFSO
with an INFINITE timeout?


I'm not sure I follow: another thread sets cancelled = true, just before we
hit pthread::testcancel(), so we go into the WFSO, but then the other thread
continues, signals cancel_event and everything's fine.

What meaning are you assigning to "interrupted" here?

Are we worried about the thread calling pthread_cancel being cancelled
itself?


Yes.  My concern is if the thread gets terminated between setting
canceled and setting the event object.

Prior to commit 42faed412857, we didn't wait infinitely, just tested the
event object.  Only with adding the canceled variable, we (better: I)
added the the infinite timeout.

I don't see a real reason to do that.  I think this should be changed
to just checking the event object, see the below patch.


I see now.  Yes, this makes perfect sense.


And if so, how do we fix this?  Theoretically, the most simple
solution might be to call SetEvent before setting the canceled
variable, but in fact we would have to make setting canceld
and cancel_event an atomic operation.


Well, yeah, that is required for them to be coherent. But we have a mutex on
the thread object for that purpose, and I don't quite see why it's released
so early here.


The mutex is not guarding canceled or the event object.  Thus it's not
used in testcancel either, otherwise introducing the canceled var to
speed up stuff wouldn't have made any sense.


Corinna


commit 518e5e46f064de41d3ef6d6ef743e2e760a46282
Author: Corinna Vinschen 
AuthorDate: Mon Jul 17 18:02:04 2023 +0200
Commit: Corinna Vinschen 
CommitDate: Tue Jul 18 10:11:30 2023 +0200

 Cygwin: don't wait infinitely on a pthread cancel event
 
 Starting with commit 42faed412857 ("* thread.h (class pthread): Add bool

 member canceled."), pthread::testcancel waits infinitely on cancel_event
 after it checked if the canceled variable is set.  However, this might
 introduce a deadlock, if the thread calling pthread_cancel is terminated
 after setting canceled to true, but before calling SetEvent on 
cancel_event.
 
 In fact, it's not at all necessary to wait infinitely.  By definition,

 the thread is only canceled if cancel_event is set.  The canceled
 variable is just a helper to speed up code.  We can safely assume that
 the thread hasn't been canceled yet, if canceled is set, but cancel_event
 isn't.
 
 Fixes: 42faed412857 ("* thread.h (class pthread): Add bool member canceled.")

 Signed-off-by: Corinna Vinschen 

diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc
index f614e01c42f6..21e89e146e0a 100644
--- a/winsup/cygwin/thread.cc
+++ b/winsup/cygwin/thread.cc
@@ -961,12 +961,9 @@ pthread::testcancel ()
   pthread_testcancel function a lot without adding the overhead of
   an OS call.  Only if the thread is marked as canceled, we wait for
   cancel_event being really set, on the off-chance that pthread_cancel
- gets interrupted before calling SetEvent. */
-  if (canceled)
-{
-  WaitForSingleObject (cancel_event, INFINITE);
-  cancel_self ();
-}
+ gets interrupted or terminated before calling SetEvent. */
+  if (canceled && IsEventSignalled (cancel_event))
+cancel_self ();
  }
  
  /* Return cancel event handle if it exists *and* cancel is not disabled.







Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-18 Thread Corinna Vinschen
On Jul 18 12:20, Jon Turney wrote:
> On 17/07/2023 16:41, Corinna Vinschen wrote:
> > > Looking into pthread::cancel we have this order of things:
> > > 
> > >  // cancel deferred
> > >  mutex.unlock ();
> > >  canceled = true;
> > >  SetEvent (cancel_event);
> > >  return 0;
> > > 
> > > The canceled var is set before the SetEvent call.
> > > What if the thread is terminated after canceled is set to true but
> > > before SetEvent is called?
> > > 
> > > pthread::testcancel claims:
> > > 
> > >We check for the canceled flag first. [...]
> > >Only if the thread is marked as canceled, we wait for cancel_event
> > >being really set, on the off-chance that pthread_cancel gets
> > >interrupted before calling SetEvent.
> > > 
> > > Neat idea to speed up the code, but doesn't that mean we have a
> > > potential deadlock, especially given that pthread::testcancel calls WFSO
> > > with an INFINITE timeout?
> 
> I'm not sure I follow: another thread sets cancelled = true, just before we
> hit pthread::testcancel(), so we go into the WFSO, but then the other thread
> continues, signals cancel_event and everything's fine.
> 
> What meaning are you assigning to "interrupted" here?
> 
> Are we worried about the thread calling pthread_cancel being cancelled
> itself?

Yes.  My concern is if the thread gets terminated between setting
canceled and setting the event object.

Prior to commit 42faed412857, we didn't wait infinitely, just tested the
event object.  Only with adding the canceled variable, we (better: I)
added the the infinite timeout.

I don't see a real reason to do that.  I think this should be changed
to just checking the event object, see the below patch.

> > > And if so, how do we fix this?  Theoretically, the most simple
> > > solution might be to call SetEvent before setting the canceled
> > > variable, but in fact we would have to make setting canceld
> > > and cancel_event an atomic operation.
> 
> Well, yeah, that is required for them to be coherent. But we have a mutex on
> the thread object for that purpose, and I don't quite see why it's released
> so early here.

The mutex is not guarding canceled or the event object.  Thus it's not
used in testcancel either, otherwise introducing the canceled var to
speed up stuff wouldn't have made any sense.


Corinna


commit 518e5e46f064de41d3ef6d6ef743e2e760a46282
Author: Corinna Vinschen 
AuthorDate: Mon Jul 17 18:02:04 2023 +0200
Commit: Corinna Vinschen 
CommitDate: Tue Jul 18 10:11:30 2023 +0200

Cygwin: don't wait infinitely on a pthread cancel event

Starting with commit 42faed412857 ("* thread.h (class pthread): Add bool
member canceled."), pthread::testcancel waits infinitely on cancel_event
after it checked if the canceled variable is set.  However, this might
introduce a deadlock, if the thread calling pthread_cancel is terminated
after setting canceled to true, but before calling SetEvent on cancel_event.

In fact, it's not at all necessary to wait infinitely.  By definition,
the thread is only canceled if cancel_event is set.  The canceled
variable is just a helper to speed up code.  We can safely assume that
the thread hasn't been canceled yet, if canceled is set, but cancel_event
isn't.

Fixes: 42faed412857 ("* thread.h (class pthread): Add bool member 
canceled.")
Signed-off-by: Corinna Vinschen 

diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc
index f614e01c42f6..21e89e146e0a 100644
--- a/winsup/cygwin/thread.cc
+++ b/winsup/cygwin/thread.cc
@@ -961,12 +961,9 @@ pthread::testcancel ()
  pthread_testcancel function a lot without adding the overhead of
  an OS call.  Only if the thread is marked as canceled, we wait for
  cancel_event being really set, on the off-chance that pthread_cancel
- gets interrupted before calling SetEvent. */
-  if (canceled)
-{
-  WaitForSingleObject (cancel_event, INFINITE);
-  cancel_self ();
-}
+ gets interrupted or terminated before calling SetEvent. */
+  if (canceled && IsEventSignalled (cancel_event))
+cancel_self ();
 }
 
 /* Return cancel event handle if it exists *and* cancel is not disabled.


Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-18 Thread Jon Turney

On 17/07/2023 16:41, Corinna Vinschen wrote:

On Jul 17 16:21, Corinna Vinschen wrote:

On Jul 17 12:51, Jon Turney wrote:

On 17/07/2023 12:05, Corinna Vinschen wrote:

diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc
index f614e01c42f6..fceb9bda1806 100644
--- a/winsup/cygwin/thread.cc
+++ b/winsup/cygwin/thread.cc
@@ -546,6 +546,13 @@ pthread::exit (void *value_ptr)
 class pthread *thread = this;
 _cygtls *tls = cygtls; /* Save cygtls before deleting this. */
+  /* Deferred cancellation still pending? */
+  if (canceled)
+{
+  WaitForSingleObject (cancel_event, INFINITE);
+  value_ptr = PTHREAD_CANCELED;
+}
+
 // run cleanup handlers
 pop_all_cleanup_handlers ();
What do you think?


I mean, by your own interpretation of the standard, this isn't required,
because we're allowed to take arbitrarily long to deliver the async
cancellation, and in this case, we took so long that the thread exited
before it happened, too bad...


True enough!


It doesn't seem a bad addition,



Actually, it seems we actually *have* to do this.  I just searched
for more info on that problem and, to my surprise, I found this in the
most obvious piece of documentation:

https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_exit.html

Quote:

   As the meaning of the status is determined by the application (except
   when the thread has been canceled, in which case it is
   PTHREAD_CANCELED), [...]


On second thought...

One thing bugging me is this:


This is still a bit fuzzy, though.  I'd appreciate any input.


Looking into pthread::cancel we have this order of things:

 // cancel deferred
 mutex.unlock ();
 canceled = true;
 SetEvent (cancel_event);
 return 0;

The canceled var is set before the SetEvent call.
What if the thread is terminated after canceled is set to true but
before SetEvent is called?

pthread::testcancel claims:

   We check for the canceled flag first. [...]
   Only if the thread is marked as canceled, we wait for cancel_event
   being really set, on the off-chance that pthread_cancel gets
   interrupted before calling SetEvent.

Neat idea to speed up the code, but doesn't that mean we have a
potential deadlock, especially given that pthread::testcancel calls WFSO
with an INFINITE timeout?


I'm not sure I follow: another thread sets cancelled = true, just before 
we hit pthread::testcancel(), so we go into the WFSO, but then the other 
thread continues, signals cancel_event and everything's fine.


What meaning are you assigning to "interrupted" here?

Are we worried about the thread calling pthread_cancel being cancelled 
itself?



And if so, how do we fix this?  Theoretically, the most simple
solution might be to call SetEvent before setting the canceled
variable, but in fact we would have to make setting canceld
and cancel_event an atomic operation.


Well, yeah, that is required for them to be coherent. But we have a 
mutex on the thread object for that purpose, and I don't quite see why 
it's released so early here.




Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-17 Thread Corinna Vinschen
On Jul 17 17:41, Corinna Vinschen wrote:
> On Jul 17 16:21, Corinna Vinschen wrote:
> > On Jul 17 12:51, Jon Turney wrote:
> > > On 17/07/2023 12:05, Corinna Vinschen wrote:
> > > > diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc
> > > > index f614e01c42f6..fceb9bda1806 100644
> > > > --- a/winsup/cygwin/thread.cc
> > > > +++ b/winsup/cygwin/thread.cc
> > > > @@ -546,6 +546,13 @@ pthread::exit (void *value_ptr)
> > > > class pthread *thread = this;
> > > > _cygtls *tls = cygtls;  /* Save cygtls before deleting this. */
> > > > +  /* Deferred cancellation still pending? */
> > > > +  if (canceled)
> > > > +{
> > > > +  WaitForSingleObject (cancel_event, INFINITE);
> > > > +  value_ptr = PTHREAD_CANCELED;
> > > > +}
> > > > +
> > > > // run cleanup handlers
> > > > pop_all_cleanup_handlers ();
> > > > What do you think?
> > > 
> > > I mean, by your own interpretation of the standard, this isn't required,
> > > because we're allowed to take arbitrarily long to deliver the async
> > > cancellation, and in this case, we took so long that the thread exited
> > > before it happened, too bad...
> > 
> > True enough!
> > 
> > > It doesn't seem a bad addition,
> > 
> Actually, it seems we actually *have* to do this.  I just searched
> for more info on that problem and, to my surprise, I found this in the
> most obvious piece of documentation:
> 
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_exit.html
> 
> Quote:
> 
>   As the meaning of the status is determined by the application (except
>   when the thread has been canceled, in which case it is
>   PTHREAD_CANCELED), [...]

FTR, apparently I have overinterpreted this sentence.

I performed the following crude test on Linux,, the idea being
to call pthread_cancel and then pthread_exit without hitting a
cancallation point in between.

cat > pt.c <
#include 
#include 
#include 
#include 
#include 

int marker = 0;

void *
thread (void *arg)
{
  int oldval;

  pthread_setcanceltype (PTHREAD_CANCEL_DEFERRED, );
  pthread_setcancelstate (PTHREAD_CANCEL_ENABLE, );
  marker = 1;
  while (marker < 2)
;
  pthread_exit ((void *) 42);
}

int
main ()
{
  int error;
  pthread_t pt;
  void *retval;

  if ((error = pthread_create (, NULL, thread, NULL)))
{
  printf ("pthread_create: %d %s\n", error, strerror (error));
  return 1;
}
  while (marker < 1)
;
  if ((error = pthread_cancel (pt)))
{
  marker = 2;
  printf ("pthread_cancel: %d %s\n", error, strerror (error));
  pthread_detach (pt);
  return 1;
}
  marker = 2;
  if ((error = pthread_join (pt, )))
{
  printf ("pthread_join: %d %s\n", error, strerror (error));
  pthread_detach (pt);
  return 1;
}
  printf ("retval = %ld (%d)\n", (uintptr_t) retval, retval == 
PTHREAD_CANCELED);
  return 0;
}
EOF
$ gcc -g -o pt pt.c -lpthread
$ ./pt
retval = 42 (0)

So retval is the one set by the application, not PTHREAD_CANCELED,
despite the pthread_cancel call.  Looks like handling cancellation
inside pthread_exit is really not the right thing to do...


Corinna


Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-17 Thread Corinna Vinschen
On Jul 17 16:21, Corinna Vinschen wrote:
> On Jul 17 12:51, Jon Turney wrote:
> > On 17/07/2023 12:05, Corinna Vinschen wrote:
> > > diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc
> > > index f614e01c42f6..fceb9bda1806 100644
> > > --- a/winsup/cygwin/thread.cc
> > > +++ b/winsup/cygwin/thread.cc
> > > @@ -546,6 +546,13 @@ pthread::exit (void *value_ptr)
> > > class pthread *thread = this;
> > > _cygtls *tls = cygtls;/* Save cygtls before deleting this. */
> > > +  /* Deferred cancellation still pending? */
> > > +  if (canceled)
> > > +{
> > > +  WaitForSingleObject (cancel_event, INFINITE);
> > > +  value_ptr = PTHREAD_CANCELED;
> > > +}
> > > +
> > > // run cleanup handlers
> > > pop_all_cleanup_handlers ();
> > > What do you think?
> > 
> > I mean, by your own interpretation of the standard, this isn't required,
> > because we're allowed to take arbitrarily long to deliver the async
> > cancellation, and in this case, we took so long that the thread exited
> > before it happened, too bad...
> 
> True enough!
> 
> > It doesn't seem a bad addition,
> 
Actually, it seems we actually *have* to do this.  I just searched
for more info on that problem and, to my surprise, I found this in the
most obvious piece of documentation:

https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_exit.html

Quote:

  As the meaning of the status is determined by the application (except
  when the thread has been canceled, in which case it is
  PTHREAD_CANCELED), [...]

> On second thought...
> 
> One thing bugging me is this:

This is still a bit fuzzy, though.  I'd appreciate any input.

> Looking into pthread::cancel we have this order of things:
> 
> // cancel deferred
> mutex.unlock ();
> canceled = true;
> SetEvent (cancel_event);
> return 0; 
> 
> The canceled var is set before the SetEvent call.
> What if the thread is terminated after canceled is set to true but
> before SetEvent is called?
> 
> pthread::testcancel claims:
> 
>   We check for the canceled flag first. [...]
>   Only if the thread is marked as canceled, we wait for cancel_event
>   being really set, on the off-chance that pthread_cancel gets
>   interrupted before calling SetEvent.
> 
> Neat idea to speed up the code, but doesn't that mean we have a
> potential deadlock, especially given that pthread::testcancel calls WFSO
> with an INFINITE timeout?
> 
> And if so, how do we fix this?  Theoretically, the most simple
> solution might be to call SetEvent before setting the canceled
> variable, but in fact we would have to make setting canceld
> and cancel_event an atomic operation.
> 
> Another idea is never to wait for an INFINITE time.  Logically, if
> canceled is set and cancel_event isn't, the thread just hasn't been
> canceled yet.
> 
> Any better idea?
> 
> > but this turns pthread_exit() into a
> > deferred cancellation point as well, so it should be added to the list of
> > "an implementation may also mark other functions not specified in the
> > standard as cancellation points" in our documentation^W the huge comment in
> > threads.cc.
> 
> Yes, indeed.


Thanks,
Corinna


Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-17 Thread Corinna Vinschen
On Jul 17 16:04, Corinna Vinschen wrote:
> On Jul 17 12:51, Jon Turney wrote:
> > Perhaps there is a better way to write a test that async cancellation works
> > in the absence of cancellation points, but it eludes me...
> 
> Same here, so just go ahead.

Actually, it's not just that.  I think this is the right thing to do.


Corinna


Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-17 Thread Corinna Vinschen
On Jul 17 12:51, Jon Turney wrote:
> On 17/07/2023 12:05, Corinna Vinschen wrote:
> > On Jul 14 20:57, Corinna Vinschen wrote:
> > > What if Cygwin checks for a deferred cancellation in pthread::exit,
> > > too?  It needs to do this by its own, not calling pthread::testcancel,
> > > otherwise we're in an infinite loop.  Since cancel is basically like
> > > exit, just with a PTHREAD_CANCELED return value, the only additional
> > > action would be to set retval to PTHREAD_CANCELED explicitely.
> > 
> > Kind of like this:
> > 
> > diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc
> > index f614e01c42f6..fceb9bda1806 100644
> > --- a/winsup/cygwin/thread.cc
> > +++ b/winsup/cygwin/thread.cc
> > @@ -546,6 +546,13 @@ pthread::exit (void *value_ptr)
> > class pthread *thread = this;
> > _cygtls *tls = cygtls;  /* Save cygtls before deleting this. */
> > +  /* Deferred cancellation still pending? */
> > +  if (canceled)
> > +{
> > +  WaitForSingleObject (cancel_event, INFINITE);
> > +  value_ptr = PTHREAD_CANCELED;
> > +}
> > +
> > // run cleanup handlers
> > pop_all_cleanup_handlers ();
> > What do you think?
> 
> I mean, by your own interpretation of the standard, this isn't required,
> because we're allowed to take arbitrarily long to deliver the async
> cancellation, and in this case, we took so long that the thread exited
> before it happened, too bad...

True enough!

> It doesn't seem a bad addition,

On second thought...

One thing bugging me is this:

Looking into pthread::cancel we have this order of things:

// cancel deferred
mutex.unlock ();
canceled = true;
SetEvent (cancel_event);
return 0; 

The canceled var is set before the SetEvent call.
What if the thread is terminated after canceled is set to true but
before SetEvent is called?

pthread::testcancel claims:

  We check for the canceled flag first. [...]
  Only if the thread is marked as canceled, we wait for cancel_event
  being really set, on the off-chance that pthread_cancel gets
  interrupted before calling SetEvent.

Neat idea to speed up the code, but doesn't that mean we have a
potential deadlock, especially given that pthread::testcancel calls WFSO
with an INFINITE timeout?

And if so, how do we fix this?  Theoretically, the most simple
solution might be to call SetEvent before setting the canceled
variable, but in fact we would have to make setting canceld
and cancel_event an atomic operation.

Another idea is never to wait for an INFINITE time.  Logically, if
canceled is set and cancel_event isn't, the thread just hasn't been
canceled yet.

Any better idea?

> but this turns pthread_exit() into a
> deferred cancellation point as well, so it should be added to the list of
> "an implementation may also mark other functions not specified in the
> standard as cancellation points" in our documentation^W the huge comment in
> threads.cc.

Yes, indeed.


Thanks,
Corinna


Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-17 Thread Corinna Vinschen
On Jul 17 12:51, Jon Turney wrote:
> On 14/07/2023 14:04, Jon Turney wrote:
> > On 13/07/2023 19:53, Corinna Vinschen wrote:
> > > > > > normally after 10 seconds. (See the commentary in pthread::cancel() 
> > > > > > in
> > > > > > thread.cc, where it checks if the target thread is inside the 
> > > > > > kernel,
> > > > > > and silently converts the cancellation into a deferred one)
> > > > > 
> > > > > Nevertheless, I think this is ok to do.  The description of
> > > > > pthread_cancel
> > > > > contains this:
> > > > > 
> > > > >    Asynchronous cancelability means that the thread can be canceled at
> > > > >    any time (usually immediately, but the system does not
> > > > > guarantee this).
> > > > > 
> > > > > And
> > > > > 
> > > > >    The above steps happen asynchronously with respect to the
> > > > >    pthread_cancel() call; the return status of pthread_cancel() merely
> > > > >    informs the caller whether the cancellation request was 
> > > > > successfully
> > > > >    queued.
> > > > > 
> > > > > So any assumption *when* the cancallation takes place is may be wrong.
> > 
> > Yeah.
> > 
> > I think the flakiness is when we happen to try to async cancel while in
> > the Windows kernel, which implicitly converts to a deferred
> > cancellation, but there are no cancellation points in the the thread, so
> > it arrives at pthread_exit() and returns a exit code other than
> > PTHREAD_CANCELED.
> > 
> > I did consider making the test non-flaky by adding a final call to
> > pthread_testcancel(), to notice any failed async cancellation which has
> > been converted to a deferred one.
> > 
> > But then that is just the same as the deferred cancellation tests, and
> > confirms the cancellation happens, but not that it's async, which is
> > part of the point of the test.
> > 
> > I guess this could also check that not all of the threads ran for all 10
> > seconds, which would indicate that at least some of them were cancelled
> > asynchronously.
> 
> I wrote this, attached, which does indeed make this test more reliable.
> 
> You point is well made that this is making assumptions about how quickly
> async cancellation works that are not required by the standard
> 
> (It would be a valid, if strange implementation, if async cancellation
> always took 10 seconds to take effect, which this test assumes isn't the
> case)
> 
> Perhaps there is a better way to write a test that async cancellation works
> in the absence of cancellation points, but it eludes me...

Same here, so just go ahead.

Thanks,
Corinna



Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-17 Thread Jon Turney

On 17/07/2023 12:05, Corinna Vinschen wrote:

On Jul 14 20:57, Corinna Vinschen wrote:

On Jul 14 14:04, Jon Turney wrote:

On 13/07/2023 19:53, Corinna Vinschen wrote:

Nevertheless, I think this is ok to do.  The description of pthread_cancel
contains this:

Asynchronous cancelability means that the thread can be canceled at
any time (usually immediately, but the system does not guarantee this).

And

The above steps happen asynchronously with respect to the
pthread_cancel() call; the return status of pthread_cancel() merely
informs the caller whether the cancellation request was successfully
queued.

So any assumption *when* the cancallation takes place is may be wrong.


Yeah.

I think the flakiness is when we happen to try to async cancel while in the
Windows kernel, which implicitly converts to a deferred cancellation, but
there are no cancellation points in the the thread, so it arrives at
pthread_exit() and returns a exit code other than PTHREAD_CANCELED.


In pthread_join(), right?


I did consider making the test non-flaky by adding a final call to
pthread_testcancel(), to notice any failed async cancellation which has been
converted to a deferred one.

But then that is just the same as the deferred cancellation tests, and
confirms the cancellation happens, but not that it's async, which is part of
the point of the test.


What if Cygwin checks for a deferred cancellation in pthread::exit,
too?  It needs to do this by its own, not calling pthread::testcancel,
otherwise we're in an infinite loop.  Since cancel is basically like
exit, just with a PTHREAD_CANCELED return value, the only additional
action would be to set retval to PTHREAD_CANCELED explicitely.


Kind of like this:

diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc
index f614e01c42f6..fceb9bda1806 100644
--- a/winsup/cygwin/thread.cc
+++ b/winsup/cygwin/thread.cc
@@ -546,6 +546,13 @@ pthread::exit (void *value_ptr)
class pthread *thread = this;
_cygtls *tls = cygtls;  /* Save cygtls before deleting this. */
  
+  /* Deferred cancellation still pending? */

+  if (canceled)
+{
+  WaitForSingleObject (cancel_event, INFINITE);
+  value_ptr = PTHREAD_CANCELED;
+}
+
// run cleanup handlers
pop_all_cleanup_handlers ();
  
What do you think?


I mean, by your own interpretation of the standard, this isn't required, 
because we're allowed to take arbitrarily long to deliver the async 
cancellation, and in this case, we took so long that the thread exited 
before it happened, too bad...


It doesn't seem a bad addition, but this turns pthread_exit() into a 
deferred cancellation point as well, so it should be added to the list 
of "an implementation may also mark other functions not specified in the 
standard as cancellation points" in our documentation^W the huge comment 
in threads.cc.




Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-17 Thread Jon Turney

On 14/07/2023 14:04, Jon Turney wrote:

On 13/07/2023 19:53, Corinna Vinschen wrote:

normally after 10 seconds. (See the commentary in pthread::cancel() in
thread.cc, where it checks if the target thread is inside the kernel,
and silently converts the cancellation into a deferred one)


Nevertheless, I think this is ok to do.  The description of 
pthread_cancel

contains this:

   Asynchronous cancelability means that the thread can be canceled at
   any time (usually immediately, but the system does not guarantee 
this).


And

   The above steps happen asynchronously with respect to the
   pthread_cancel() call; the return status of pthread_cancel() merely
   informs the caller whether the cancellation request was successfully
   queued.

So any assumption *when* the cancallation takes place is may be wrong.


Yeah.

I think the flakiness is when we happen to try to async cancel while in 
the Windows kernel, which implicitly converts to a deferred 
cancellation, but there are no cancellation points in the the thread, so 
it arrives at pthread_exit() and returns a exit code other than 
PTHREAD_CANCELED.


I did consider making the test non-flaky by adding a final call to 
pthread_testcancel(), to notice any failed async cancellation which has 
been converted to a deferred one.


But then that is just the same as the deferred cancellation tests, and 
confirms the cancellation happens, but not that it's async, which is 
part of the point of the test.


I guess this could also check that not all of the threads ran for all 10 
seconds, which would indicate that at least some of them were cancelled 
asynchronously.


I wrote this, attached, which does indeed make this test more reliable.

You point is well made that this is making assumptions about how quickly 
async cancellation works that are not required by the standard


(It would be a valid, if strange implementation, if async cancellation 
always took 10 seconds to take effect, which this test assumes isn't the 
case)


Perhaps there is a better way to write a test that async cancellation 
works in the absence of cancellation points, but it eludes me...
From d0023fa3ea1e8f29e80d473ab13d8200bdd2dc3a Mon Sep 17 00:00:00 2001
From: Jon Turney 
Date: Sat, 15 Jul 2023 17:57:43 +0100
Subject: [PATCH] Cygwin: testsuite: Make cancel3 and cancel5 more robust

Despite our efforts, sometimes the async cancellation gets deferred.

Notice this by calling pthread_testcancel(), and then try to work out if
async cancellation was ever successful by checking if all threads ran
for the full 10 seconds, or if some were stopped early.
---
 winsup/testsuite/winsup.api/pthread/cancel3.c | 16 +++-
 winsup/testsuite/winsup.api/pthread/cancel5.c | 14 ++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/winsup/testsuite/winsup.api/pthread/cancel3.c 
b/winsup/testsuite/winsup.api/pthread/cancel3.c
index 07feb7c9b..075f052cc 100644
--- a/winsup/testsuite/winsup.api/pthread/cancel3.c
+++ b/winsup/testsuite/winsup.api/pthread/cancel3.c
@@ -92,6 +92,9 @@ mythread(void * arg)
}
 }
 
+  /* Notice if asynchronous cancel got deferred */
+  pthread_testcancel();
+
   return result;
 }
 
@@ -101,6 +104,7 @@ main()
   int failed = 0;
   int i;
   pthread_t t[NUMTHREADS + 1];
+  int ran_to_completion = 0;
 
   assert((t[0] = pthread_self()) != NULL);
 
@@ -130,7 +134,7 @@ main()
* Standard check that all threads started.
*/
   for (i = 1; i <= NUMTHREADS; i++)
-{ 
+{
   if (!threadbag[i].started)
{
  failed |= !threadbag[i].started;
@@ -166,9 +170,19 @@ main()
  threadbag[i].count,
  result);
}
+
+  if (threadbag[i].count >= 10)
+   ran_to_completion++;
+
   failed = (failed || fail);
 }
 
+  if (ran_to_completion >= 10)
+{
+  fprintf(stderr, "All threads ran to completion, async cancellation never 
happened\n");
+  failed = TRUE;
+}
+
   assert(!failed);
 
   /*
diff --git a/winsup/testsuite/winsup.api/pthread/cancel5.c 
b/winsup/testsuite/winsup.api/pthread/cancel5.c
index 999b3c95c..23c02afe4 100644
--- a/winsup/testsuite/winsup.api/pthread/cancel5.c
+++ b/winsup/testsuite/winsup.api/pthread/cancel5.c
@@ -93,6 +93,9 @@ mythread(void * arg)
}
 }
 
+  /* Notice if asynchronous cancel got deferred */
+  pthread_testcancel();
+
   return result;
 }
 
@@ -102,6 +105,7 @@ main()
   int failed = 0;
   int i;
   pthread_t t[NUMTHREADS + 1];
+  int ran_to_completion = 0;
 
   for (i = 1; i <= NUMTHREADS; i++)
 {
@@ -165,9 +169,19 @@ main()
  threadbag[i].count,
  result);
}
+
+  if (threadbag[i].count >= 10)
+   ran_to_completion++;
+
   failed = (failed || fail);
 }
 
+  if (ran_to_completion >= 10)
+{
+  fprintf(stderr, "All threads ran to completion, async cancellation never 
happened\n");
+  failed = TRUE;
+}
+
   assert(!failed);
 
   /*
-- 
2.39.0



Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-17 Thread Corinna Vinschen
On Jul 14 20:57, Corinna Vinschen wrote:
> On Jul 14 14:04, Jon Turney wrote:
> > On 13/07/2023 19:53, Corinna Vinschen wrote:
> > > > > Nevertheless, I think this is ok to do.  The description of 
> > > > > pthread_cancel
> > > > > contains this:
> > > > > 
> > > > >Asynchronous cancelability means that the thread can be canceled at
> > > > >any time (usually immediately, but the system does not guarantee 
> > > > > this).
> > > > > 
> > > > > And
> > > > > 
> > > > >The above steps happen asynchronously with respect to the
> > > > >pthread_cancel() call; the return status of pthread_cancel() merely
> > > > >informs the caller whether the cancellation request was 
> > > > > successfully
> > > > >queued.
> > > > > 
> > > > > So any assumption *when* the cancallation takes place is may be wrong.
> > 
> > Yeah.
> > 
> > I think the flakiness is when we happen to try to async cancel while in the
> > Windows kernel, which implicitly converts to a deferred cancellation, but
> > there are no cancellation points in the the thread, so it arrives at
> > pthread_exit() and returns a exit code other than PTHREAD_CANCELED.
> 
> In pthread_join(), right?
> 
> > I did consider making the test non-flaky by adding a final call to
> > pthread_testcancel(), to notice any failed async cancellation which has been
> > converted to a deferred one.
> > 
> > But then that is just the same as the deferred cancellation tests, and
> > confirms the cancellation happens, but not that it's async, which is part of
> > the point of the test.
> 
> What if Cygwin checks for a deferred cancellation in pthread::exit,
> too?  It needs to do this by its own, not calling pthread::testcancel,
> otherwise we're in an infinite loop.  Since cancel is basically like
> exit, just with a PTHREAD_CANCELED return value, the only additional
> action would be to set retval to PTHREAD_CANCELED explicitely.

Kind of like this:

diff --git a/winsup/cygwin/thread.cc b/winsup/cygwin/thread.cc
index f614e01c42f6..fceb9bda1806 100644
--- a/winsup/cygwin/thread.cc
+++ b/winsup/cygwin/thread.cc
@@ -546,6 +546,13 @@ pthread::exit (void *value_ptr)
   class pthread *thread = this;
   _cygtls *tls = cygtls;   /* Save cygtls before deleting this. */
 
+  /* Deferred cancellation still pending? */
+  if (canceled)
+{
+  WaitForSingleObject (cancel_event, INFINITE);
+  value_ptr = PTHREAD_CANCELED;
+}
+
   // run cleanup handlers
   pop_all_cleanup_handlers ();
 
What do you think?


Corinna



Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-14 Thread Corinna Vinschen
On Jul 14 14:04, Jon Turney wrote:
> On 13/07/2023 19:53, Corinna Vinschen wrote:
> > > > Nevertheless, I think this is ok to do.  The description of 
> > > > pthread_cancel
> > > > contains this:
> > > > 
> > > >Asynchronous cancelability means that the thread can be canceled at
> > > >any time (usually immediately, but the system does not guarantee 
> > > > this).
> > > > 
> > > > And
> > > > 
> > > >The above steps happen asynchronously with respect to the
> > > >pthread_cancel() call; the return status of pthread_cancel() merely
> > > >informs the caller whether the cancellation request was successfully
> > > >queued.
> > > > 
> > > > So any assumption *when* the cancallation takes place is may be wrong.
> 
> Yeah.
> 
> I think the flakiness is when we happen to try to async cancel while in the
> Windows kernel, which implicitly converts to a deferred cancellation, but
> there are no cancellation points in the the thread, so it arrives at
> pthread_exit() and returns a exit code other than PTHREAD_CANCELED.

In pthread_join(), right?

> I did consider making the test non-flaky by adding a final call to
> pthread_testcancel(), to notice any failed async cancellation which has been
> converted to a deferred one.
> 
> But then that is just the same as the deferred cancellation tests, and
> confirms the cancellation happens, but not that it's async, which is part of
> the point of the test.

What if Cygwin checks for a deferred cancellation in pthread::exit,
too?  It needs to do this by its own, not calling pthread::testcancel,
otherwise we're in an infinite loop.  Since cancel is basically like
exit, just with a PTHREAD_CANCELED return value, the only additional
action would be to set retval to PTHREAD_CANCELED explicitely.


Corinna


Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-14 Thread Jon Turney

On 13/07/2023 19:53, Corinna Vinschen wrote:

On Jul 13 20:37, Corinna Vinschen wrote:

On Jul 13 20:16, Corinna Vinschen wrote:

On Jul 13 12:39, Jon Turney wrote:

These tests async thread cancellation of a thread that doesn't have any
cancellation points.

Unfortunately, since 450f557f the async cancellation silently fails when
the thread is inside the kernel function Sleep(), so it just exits


I'm not sure how this patch should be the actual culprit.  It only
handles thread priorities, not thread cancellation.  Isn't this rather
2b165a453ea7b or some such?


Yeah, I messed up there somehow. I will fix the commit id.


normally after 10 seconds. (See the commentary in pthread::cancel() in
thread.cc, where it checks if the target thread is inside the kernel,
and silently converts the cancellation into a deferred one)


Nevertheless, I think this is ok to do.  The description of pthread_cancel
contains this:

   Asynchronous cancelability means that the thread can be canceled at
   any time (usually immediately, but the system does not guarantee this).

And

   The above steps happen asynchronously with respect to the
   pthread_cancel() call; the return status of pthread_cancel() merely
   informs the caller whether the cancellation request was successfully
   queued.

So any assumption *when* the cancallation takes place is may be wrong.


Yeah.

I think the flakiness is when we happen to try to async cancel while in 
the Windows kernel, which implicitly converts to a deferred 
cancellation, but there are no cancellation points in the the thread, so 
it arrives at pthread_exit() and returns a exit code other than 
PTHREAD_CANCELED.


I did consider making the test non-flaky by adding a final call to 
pthread_testcancel(), to notice any failed async cancellation which has 
been converted to a deferred one.


But then that is just the same as the deferred cancellation tests, and 
confirms the cancellation happens, but not that it's async, which is 
part of the point of the test.


I guess this could also check that not all of the threads ran for all 10 
seconds, which would indicate that at least some of them were cancelled 
asynchronously.



I wonder, though, if we can't come up with a better solution than just
waiting for the next cancellation point.

No solution comes to mind if the user code calls a Win32 function, but
maybe _sigbe could check if the thread's cancel_event is set?  It's all
in assembler, that complicates it a bit, but that would make it at least
working for POSIX functions which are no cancellation points.


I think you'd need to record the "pending async cancellation" as 
different to "deferred cancellation" so this doesn't turn everything 
into a cancellation point.



Alternatively, maybe we can utilize the existing signal handler and
just send a Cygwin-only signal outside the maskable signal range.
wait_sig calls sigpacket::process like for any other standard signal,
The signal handler is basically pthread::static_cancel_self().
Something like that.
I'm not sure this is worth lots of effort, as thread cancellation seems 
to be regarded as mis-specified in such as way as to make it unsafe for 
serious use.




Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-13 Thread Corinna Vinschen
On Jul 13 20:37, Corinna Vinschen wrote:
> On Jul 13 20:16, Corinna Vinschen wrote:
> > On Jul 13 12:39, Jon Turney wrote:
> > > These tests async thread cancellation of a thread that doesn't have any
> > > cancellation points.
> > > 
> > > Unfortunately, since 450f557f the async cancellation silently fails when
> > > the thread is inside the kernel function Sleep(), so it just exits
> > 
> > I'm not sure how this patch should be the actual culprit.  It only
> > handles thread priorities, not thread cancellation.  Isn't this rather
> > 2b165a453ea7b or some such?
> > 
> > > normally after 10 seconds. (See the commentary in pthread::cancel() in
> > > thread.cc, where it checks if the target thread is inside the kernel,
> > > and silently converts the cancellation into a deferred one)
> > 
> > Nevertheless, I think this is ok to do.  The description of pthread_cancel
> > contains this:
> > 
> >   Asynchronous cancelability means that the thread can be canceled at
> >   any time (usually immediately, but the system does not guarantee this).
> > 
> > And
> > 
> >   The above steps happen asynchronously with respect to the
> >   pthread_cancel() call; the return status of pthread_cancel() merely
> >   informs the caller whether the cancellation request was successfully
> >   queued.
> > 
> > So any assumption *when* the cancallation takes place is may be wrong.
> 
> I wonder, though, if we can't come up with a better solution than just
> waiting for the next cancellation point.
> 
> No solution comes to mind if the user code calls a Win32 function, but
> maybe _sigbe could check if the thread's cancel_event is set?  It's all
> in assembler, that complicates it a bit, but that would make it at least
> working for POSIX functions which are no cancellation points.

Alternatively, maybe we can utilize the existing signal handler and
just send a Cygwin-only signal outside the maskable signal range.
wait_sig calls sigpacket::process like for any other standard signal,
The signal handler is basically pthread::static_cancel_self().
Something like that.


Corinna


Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-13 Thread Corinna Vinschen
On Jul 13 20:16, Corinna Vinschen wrote:
> On Jul 13 12:39, Jon Turney wrote:
> > These tests async thread cancellation of a thread that doesn't have any
> > cancellation points.
> > 
> > Unfortunately, since 450f557f the async cancellation silently fails when
> > the thread is inside the kernel function Sleep(), so it just exits
> 
> I'm not sure how this patch should be the actual culprit.  It only
> handles thread priorities, not thread cancellation.  Isn't this rather
> 2b165a453ea7b or some such?
> 
> > normally after 10 seconds. (See the commentary in pthread::cancel() in
> > thread.cc, where it checks if the target thread is inside the kernel,
> > and silently converts the cancellation into a deferred one)
> 
> Nevertheless, I think this is ok to do.  The description of pthread_cancel
> contains this:
> 
>   Asynchronous cancelability means that the thread can be canceled at
>   any time (usually immediately, but the system does not guarantee this).
> 
> And
> 
>   The above steps happen asynchronously with respect to the
>   pthread_cancel() call; the return status of pthread_cancel() merely
>   informs the caller whether the cancellation request was successfully
>   queued.
> 
> So any assumption *when* the cancallation takes place is may be wrong.

I wonder, though, if we can't come up with a better solution than just
waiting for the next cancellation point.

No solution comes to mind if the user code calls a Win32 function, but
maybe _sigbe could check if the thread's cancel_event is set?  It's all
in assembler, that complicates it a bit, but that would make it at least
working for POSIX functions which are no cancellation points.


Corinna


Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-13 Thread Corinna Vinschen
On Jul 13 12:39, Jon Turney wrote:
> These tests async thread cancellation of a thread that doesn't have any
> cancellation points.
> 
> Unfortunately, since 450f557f the async cancellation silently fails when
> the thread is inside the kernel function Sleep(), so it just exits

I'm not sure how this patch should be the actual culprit.  It only
handles thread priorities, not thread cancellation.  Isn't this rather
2b165a453ea7b or some such?

> normally after 10 seconds. (See the commentary in pthread::cancel() in
> thread.cc, where it checks if the target thread is inside the kernel,
> and silently converts the cancellation into a deferred one)

Nevertheless, I think this is ok to do.  The description of pthread_cancel
contains this:

  Asynchronous cancelability means that the thread can be canceled at
  any time (usually immediately, but the system does not guarantee this).

And

  The above steps happen asynchronously with respect to the
  pthread_cancel() call; the return status of pthread_cancel() merely
  informs the caller whether the cancellation request was successfully
  queued.

So any assumption *when* the cancallation takes place is may be wrong.


Corinna


Re: [PATCH 08/11] Cygwin: testsuite: Busy-wait in cancel3 and cancel5

2023-07-13 Thread Jon Turney

On 13/07/2023 12:39, Jon Turney wrote:

These tests async thread cancellation of a thread that doesn't have any
cancellation points.

Unfortunately, since 450f557f the async cancellation silently fails when
the thread is inside the kernel function Sleep(), so it just exits
normally after 10 seconds. (See the commentary in pthread::cancel() in
thread.cc, where it checks if the target thread is inside the kernel,
and silently converts the cancellation into a deferred one)

Work around this by busy-waiting rather than Sleep()ing for 10 seconds.

This is still somewhat fragile: the async cancel could still fail, if it
happens to occur while we're inside the kernel function that time()
calls.

v2:
Do nothing more efficiently
---
  winsup/testsuite/winsup.api/pthread/cancel3.c | 24 ++-
  winsup/testsuite/winsup.api/pthread/cancel5.c | 24 ++-
  2 files changed, 36 insertions(+), 12 deletions(-)



This still seems a bit flaky, for the reasons identified. Perhaps 
there's a better way of doing a pause without a cancellation point or 
entering the kernel? Or a better way to test that async cancellation 
actually works?