Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-10 Thread Martin Wilck
On Sat, 2018-02-10 at 17:11 +0100, Martin Wilck wrote:
> On Fri, 2018-02-09 at 18:36 -0600, Benjamin Marzinski wrote:
> > On Sat, Feb 10, 2018 at 12:36:05AM +0100, Martin Wilck wrote:
> > > On Sat, 2018-02-10 at 00:28 +0100, Martin Wilck wrote:
> > > > Maybe it's easier than we thought. Attached is a patch on top
> > > > of
> > > > yours that I think might work, please have a look. 
> > > > 
> > > 
> > > That one didn't even compile. This one is better.
> > > 
> > > Martin
> > 
> > How about this one instead. The idea is that once we are in the
> > cleanup
> > handler, we just cleanup and exit. But before we enter it, we
> > atomically
> > exchange running, and if running was 0, we pause(), since the
> > checker
> > is
> > either about to cancel us, or already has.
> > 
> 
> Yes, that should work. Nice.

... but I just realized that we don't rcu_register_thread() the TUR
thread. Maybe we should if we use RCU primitives?

Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-10 Thread Martin Wilck
On Fri, 2018-02-09 at 18:36 -0600, Benjamin Marzinski wrote:
> On Sat, Feb 10, 2018 at 12:36:05AM +0100, Martin Wilck wrote:
> > On Sat, 2018-02-10 at 00:28 +0100, Martin Wilck wrote:
> > > Maybe it's easier than we thought. Attached is a patch on top of
> > > yours that I think might work, please have a look. 
> > > 
> > 
> > That one didn't even compile. This one is better.
> > 
> > Martin
> 
> How about this one instead. The idea is that once we are in the
> cleanup
> handler, we just cleanup and exit. But before we enter it, we
> atomically
> exchange running, and if running was 0, we pause(), since the checker
> is
> either about to cancel us, or already has.
> 

Yes, that should work. Nice.

Regards,
Martin


> diff --git a/libmultipath/checkers/tur.c
> b/libmultipath/checkers/tur.c
> index 894ad41..3774a17 100644
> --- a/libmultipath/checkers/tur.c
> +++ b/libmultipath/checkers/tur.c
> @@ -214,15 +214,12 @@ retry:
>  
>  static void cleanup_func(void *data)
>  {
> - int running, holders;
> + int holders;
>   struct tur_checker_context *ct = data;
>  
> - running = uatomic_xchg(>running, 0);
>   holders = uatomic_sub_return(>holders, 1);
>   if (!holders)
>   cleanup_context(ct);
> - if (!running)
> - pause();
>  }
>  
>  static int tur_running(struct tur_checker_context *ct)
> @@ -242,7 +239,7 @@ static void copy_msg_to_tcc(void *ct_p, const
> char *msg)
>  static void *tur_thread(void *ctx)
>  {
>   struct tur_checker_context *ct = ctx;
> - int state;
> + int state, running;
>   char devt[32];
>  
>   condlog(3, "%s: tur checker starting up",
> @@ -268,6 +265,11 @@ static void *tur_thread(void *ctx)
>  
>   condlog(3, "%s: tur checker finished, state %s",
>   tur_devt(devt, sizeof(devt), ct),
> checker_state_name(state));
> +
> + running = uatomic_xchg(>running, 0);
> + if (!running)
> + pause();
> +
>   tur_thread_cleanup_pop(ct);
>  
>   return ((void *)0);
> 
> 
> > 
> > -- 
> > Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
> > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham
> > Norton
> > HRB 21284 (AG Nürnberg)
> > From afb9c7de3658d49c4f28f6b9ee618a87b806ecdd Mon Sep 17 00:00:00
> > 2001
> > From: Martin Wilck 
> > Date: Sat, 10 Feb 2018 00:22:17 +0100
> > Subject: [PATCH] tur checker: make sure pthread_cancel isn't called
> > for exited
> >  thread
> > 
> > If we enter the cleanup function as the result of a pthread_cancel
> > by another
> > thread, we don't need to wait for a cancellation any more. If we
> > exit
> > regularly, just tell the other thread not to try to cancel us.
> > ---
> >  libmultipath/checkers/tur.c | 9 +
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> > 
> > diff --git a/libmultipath/checkers/tur.c
> > b/libmultipath/checkers/tur.c
> > index 894ad41c89c3..5d2b36bfa883 100644
> > --- a/libmultipath/checkers/tur.c
> > +++ b/libmultipath/checkers/tur.c
> > @@ -214,15 +214,13 @@ retry:
> >  
> >  static void cleanup_func(void *data)
> >  {
> > -   int running, holders;
> > +   int holders;
> > struct tur_checker_context *ct = data;
> >  
> > -   running = uatomic_xchg(>running, 0);
> > +   uatomic_set(>running, 0);
> > holders = uatomic_sub_return(>holders, 1);
> > if (!holders)
> > cleanup_context(ct);
> > -   if (!running)
> > -   pause();
> >  }
> >  
> >  static int tur_running(struct tur_checker_context *ct)
> > @@ -266,6 +264,9 @@ static void *tur_thread(void *ctx)
> > pthread_cond_signal(>active);
> > pthread_mutex_unlock(>lock);
> >  
> > +   /* Tell main checker thread not to cancel us, as we exit
> > anyway */
> > +   uatomic_set(>running, 0);
> > +
> > condlog(3, "%s: tur checker finished, state %s",
> > tur_devt(devt, sizeof(devt), ct),
> > checker_state_name(state));
> > tur_thread_cleanup_pop(ct);
> > -- 
> > 2.16.1
> > 
> 
> 

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-10 Thread Martin Wilck
On Fri, 2018-02-09 at 18:17 -0600, Benjamin Marzinski wrote:
> On Sat, Feb 10, 2018 at 12:36:05AM +0100, Martin Wilck wrote:
> > On Sat, 2018-02-10 at 00:28 +0100, Martin Wilck wrote:
> > > Maybe it's easier than we thought. Attached is a patch on top of
> > > yours that I think might work, please have a look. 
> > > 
> > 
> > That one didn't even compile. This one is better.
> > 
> > Martin
> 
> So if we have this ordering
> 
> - checker calls uatomic_xchg() which returns 1 and then gets
> scheduled
> - thread calls uatomic_set() and then runs till it terminates
> - checker calls pthread_cancel()
> 
> You will get Bart's original bug. 

Yes, I realized that overnight :-( I shouldn't post stuff like this
around midnight. But I have another idea in my mind.

Martin

>  I realize that having the condlog()
> after the uatomic_set() in the thread makes this unlikely, but I
> don't
> races like this. I would be happier with simply taking the original
> code
> and moving the condlog(), if neither of my other two options are
> acceptable.
> 
> -Ben
> 
> > 
> > -- 
> > Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
> > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham
> > Norton
> > HRB 21284 (AG Nürnberg)
> > From afb9c7de3658d49c4f28f6b9ee618a87b806ecdd Mon Sep 17 00:00:00
> > 2001
> > From: Martin Wilck 
> > Date: Sat, 10 Feb 2018 00:22:17 +0100
> > Subject: [PATCH] tur checker: make sure pthread_cancel isn't called
> > for exited
> >  thread
> > 
> > If we enter the cleanup function as the result of a pthread_cancel
> > by another
> > thread, we don't need to wait for a cancellation any more. If we
> > exit
> > regularly, just tell the other thread not to try to cancel us.
> > ---
> >  libmultipath/checkers/tur.c | 9 +
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> > 
> > diff --git a/libmultipath/checkers/tur.c
> > b/libmultipath/checkers/tur.c
> > index 894ad41c89c3..5d2b36bfa883 100644
> > --- a/libmultipath/checkers/tur.c
> > +++ b/libmultipath/checkers/tur.c
> > @@ -214,15 +214,13 @@ retry:
> >  
> >  static void cleanup_func(void *data)
> >  {
> > -   int running, holders;
> > +   int holders;
> > struct tur_checker_context *ct = data;
> >  
> > -   running = uatomic_xchg(>running, 0);
> > +   uatomic_set(>running, 0);
> > holders = uatomic_sub_return(>holders, 1);
> > if (!holders)
> > cleanup_context(ct);
> > -   if (!running)
> > -   pause();
> >  }
> >  
> >  static int tur_running(struct tur_checker_context *ct)
> > @@ -266,6 +264,9 @@ static void *tur_thread(void *ctx)
> > pthread_cond_signal(>active);
> > pthread_mutex_unlock(>lock);
> >  
> > +   /* Tell main checker thread not to cancel us, as we exit
> > anyway */
> > +   uatomic_set(>running, 0);
> > +
> > condlog(3, "%s: tur checker finished, state %s",
> > tur_devt(devt, sizeof(devt), ct),
> > checker_state_name(state));
> > tur_thread_cleanup_pop(ct);
> > -- 
> > 2.16.1
> > 
> 
> 

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-09 Thread Benjamin Marzinski
On Sat, Feb 10, 2018 at 12:36:05AM +0100, Martin Wilck wrote:
> On Sat, 2018-02-10 at 00:28 +0100, Martin Wilck wrote:
> > Maybe it's easier than we thought. Attached is a patch on top of
> > yours that I think might work, please have a look. 
> > 
> 
> That one didn't even compile. This one is better.
> 
> Martin

How about this one instead. The idea is that once we are in the cleanup
handler, we just cleanup and exit. But before we enter it, we atomically
exchange running, and if running was 0, we pause(), since the checker is
either about to cancel us, or already has.

diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index 894ad41..3774a17 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -214,15 +214,12 @@ retry:
 
 static void cleanup_func(void *data)
 {
-   int running, holders;
+   int holders;
struct tur_checker_context *ct = data;
 
-   running = uatomic_xchg(>running, 0);
holders = uatomic_sub_return(>holders, 1);
if (!holders)
cleanup_context(ct);
-   if (!running)
-   pause();
 }
 
 static int tur_running(struct tur_checker_context *ct)
@@ -242,7 +239,7 @@ static void copy_msg_to_tcc(void *ct_p, const char *msg)
 static void *tur_thread(void *ctx)
 {
struct tur_checker_context *ct = ctx;
-   int state;
+   int state, running;
char devt[32];
 
condlog(3, "%s: tur checker starting up",
@@ -268,6 +265,11 @@ static void *tur_thread(void *ctx)
 
condlog(3, "%s: tur checker finished, state %s",
tur_devt(devt, sizeof(devt), ct), checker_state_name(state));
+
+   running = uatomic_xchg(>running, 0);
+   if (!running)
+   pause();
+
tur_thread_cleanup_pop(ct);
 
return ((void *)0);


> 
> -- 
> Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)

> From afb9c7de3658d49c4f28f6b9ee618a87b806ecdd Mon Sep 17 00:00:00 2001
> From: Martin Wilck 
> Date: Sat, 10 Feb 2018 00:22:17 +0100
> Subject: [PATCH] tur checker: make sure pthread_cancel isn't called for exited
>  thread
> 
> If we enter the cleanup function as the result of a pthread_cancel by another
> thread, we don't need to wait for a cancellation any more. If we exit
> regularly, just tell the other thread not to try to cancel us.
> ---
>  libmultipath/checkers/tur.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
> index 894ad41c89c3..5d2b36bfa883 100644
> --- a/libmultipath/checkers/tur.c
> +++ b/libmultipath/checkers/tur.c
> @@ -214,15 +214,13 @@ retry:
>  
>  static void cleanup_func(void *data)
>  {
> - int running, holders;
> + int holders;
>   struct tur_checker_context *ct = data;
>  
> - running = uatomic_xchg(>running, 0);
> + uatomic_set(>running, 0);
>   holders = uatomic_sub_return(>holders, 1);
>   if (!holders)
>   cleanup_context(ct);
> - if (!running)
> - pause();
>  }
>  
>  static int tur_running(struct tur_checker_context *ct)
> @@ -266,6 +264,9 @@ static void *tur_thread(void *ctx)
>   pthread_cond_signal(>active);
>   pthread_mutex_unlock(>lock);
>  
> + /* Tell main checker thread not to cancel us, as we exit anyway */
> + uatomic_set(>running, 0);
> +
>   condlog(3, "%s: tur checker finished, state %s",
>   tur_devt(devt, sizeof(devt), ct), checker_state_name(state));
>   tur_thread_cleanup_pop(ct);
> -- 
> 2.16.1
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-09 Thread Benjamin Marzinski
On Sat, Feb 10, 2018 at 12:36:05AM +0100, Martin Wilck wrote:
> On Sat, 2018-02-10 at 00:28 +0100, Martin Wilck wrote:
> > Maybe it's easier than we thought. Attached is a patch on top of
> > yours that I think might work, please have a look. 
> > 
> 
> That one didn't even compile. This one is better.
> 
> Martin

So if we have this ordering

- checker calls uatomic_xchg() which returns 1 and then gets scheduled
- thread calls uatomic_set() and then runs till it terminates
- checker calls pthread_cancel()

You will get Bart's original bug.  I realize that having the condlog()
after the uatomic_set() in the thread makes this unlikely, but I don't
races like this. I would be happier with simply taking the original code
and moving the condlog(), if neither of my other two options are
acceptable.

-Ben

> 
> -- 
> Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)

> From afb9c7de3658d49c4f28f6b9ee618a87b806ecdd Mon Sep 17 00:00:00 2001
> From: Martin Wilck 
> Date: Sat, 10 Feb 2018 00:22:17 +0100
> Subject: [PATCH] tur checker: make sure pthread_cancel isn't called for exited
>  thread
> 
> If we enter the cleanup function as the result of a pthread_cancel by another
> thread, we don't need to wait for a cancellation any more. If we exit
> regularly, just tell the other thread not to try to cancel us.
> ---
>  libmultipath/checkers/tur.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
> index 894ad41c89c3..5d2b36bfa883 100644
> --- a/libmultipath/checkers/tur.c
> +++ b/libmultipath/checkers/tur.c
> @@ -214,15 +214,13 @@ retry:
>  
>  static void cleanup_func(void *data)
>  {
> - int running, holders;
> + int holders;
>   struct tur_checker_context *ct = data;
>  
> - running = uatomic_xchg(>running, 0);
> + uatomic_set(>running, 0);
>   holders = uatomic_sub_return(>holders, 1);
>   if (!holders)
>   cleanup_context(ct);
> - if (!running)
> - pause();
>  }
>  
>  static int tur_running(struct tur_checker_context *ct)
> @@ -266,6 +264,9 @@ static void *tur_thread(void *ctx)
>   pthread_cond_signal(>active);
>   pthread_mutex_unlock(>lock);
>  
> + /* Tell main checker thread not to cancel us, as we exit anyway */
> + uatomic_set(>running, 0);
> +
>   condlog(3, "%s: tur checker finished, state %s",
>   tur_devt(devt, sizeof(devt), ct), checker_state_name(state));
>   tur_thread_cleanup_pop(ct);
> -- 
> 2.16.1
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-09 Thread Martin Wilck
On Fri, 2018-02-09 at 17:04 -0600, Benjamin Marzinski wrote:
> On Fri, Feb 09, 2018 at 09:30:56PM +0100, Martin Wilck wrote:
> > On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote:
> > > ct->running is now an atomic variable.  When the thread is
> > > started
> > > it is set to 1. When the checker wants to kill a thread, it
> > > atomically
> > > sets the value to 0 and reads the previous value.  If it was 1,
> > > the checker cancels the thread. If it was 0, the nothing needs to
> > > be
> > > done.  After the checker has dealt with the thread, it sets ct-
> > > > thread
> > > 
> > > to NULL.
> > > 
> > > When the thread is done, it atomicalllys sets the value of ct-
> > > > running
> > > 
> > > to 0 and reads the previous value. If it was 1, the thread just
> > > exits.
> > > If it was 0, then the checker is trying to cancel the thread, and
> > > so
> > > the thread calls pause(), which is a cancellation point.
> > > 
> > 
> > I'm missing one thing here. My poor brain is aching.
> > 
> > cleanup_func() can be entered in two ways: a) if the thread has
> > been
> > cancelled and passed a cancellation point already, or b) if it
> > exits
> > normally and calls pthread_cleanup_pop(). 
> > In case b), waiting for the cancellation request by calling pause()
> > makes sense to me. But in case a), the thread has already seen the
> > cancellation request - wouldn't calling pause() cause it to sleep
> > forever?
> 
> Urgh. You're right. If it is in the cleanup helper because it already
> has been cancelled, then the pause isn't going get cancelled. So much
> for my quick rewrite.

Maybe it's easier than we thought. Attached is a patch on top of yours
that I think might work, please have a look. 

It's quite late here, so I'll need to ponder your alternatives below
the other day.

Cheers
Martin

> 
> That leaves three options.
> 
> 1. have either the thread or the checker detach the thread (depending
>on which one exits first)
> 2. make the checker always cancel and detach the thread. This
> simplifies
>the code, but there will zombie threads hanging around between
> calls
>to the checker.
> 3. just move the condlog
> 
> I really don't care which one we pick anymore.
> 
> -Ben
> 
> > 
> > Martin
> > 
> > -- 
> > Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
> > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham
> > Norton
> > HRB 21284 (AG Nürnberg)
> 
> 

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
From 831ef27b41858fa248201b74f2dd8ea5b7c4aece Mon Sep 17 00:00:00 2001
From: Martin Wilck 
Date: Sat, 10 Feb 2018 00:22:17 +0100
Subject: [PATCH] tur checker: make sure pthread_cancel isn't called for exited
 thread

If we enter the cleanup function as the result of a pthread_cancel by another
thread, we don't need to wait for a cancellation any more. If we exit
regularly, just tell the other thread not to try to cancel us.
---
 libmultipath/checkers/tur.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index 894ad41c89c3..31a87d2b5cf2 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -221,8 +221,6 @@ static void cleanup_func(void *data)
 	holders = uatomic_sub_return(>holders, 1);
 	if (!holders)
 		cleanup_context(ct);
-	if (!running)
-		pause();
 }
 
 static int tur_running(struct tur_checker_context *ct)
@@ -266,6 +264,9 @@ static void *tur_thread(void *ctx)
 	pthread_cond_signal(>active);
 	pthread_mutex_unlock(>lock);
 
+	/* Tell main checker thread not to cancel us, as we exit anyway */
+	running = uatomic_xchg(>running, 0);
+
 	condlog(3, "%s: tur checker finished, state %s",
 		tur_devt(devt, sizeof(devt), ct), checker_state_name(state));
 	tur_thread_cleanup_pop(ct);
-- 
2.16.1

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-09 Thread Martin Wilck
On Sat, 2018-02-10 at 00:28 +0100, Martin Wilck wrote:
> Maybe it's easier than we thought. Attached is a patch on top of
> yours that I think might work, please have a look. 
> 

That one didn't even compile. This one is better.

Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
From afb9c7de3658d49c4f28f6b9ee618a87b806ecdd Mon Sep 17 00:00:00 2001
From: Martin Wilck 
Date: Sat, 10 Feb 2018 00:22:17 +0100
Subject: [PATCH] tur checker: make sure pthread_cancel isn't called for exited
 thread

If we enter the cleanup function as the result of a pthread_cancel by another
thread, we don't need to wait for a cancellation any more. If we exit
regularly, just tell the other thread not to try to cancel us.
---
 libmultipath/checkers/tur.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index 894ad41c89c3..5d2b36bfa883 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -214,15 +214,13 @@ retry:
 
 static void cleanup_func(void *data)
 {
-	int running, holders;
+	int holders;
 	struct tur_checker_context *ct = data;
 
-	running = uatomic_xchg(>running, 0);
+	uatomic_set(>running, 0);
 	holders = uatomic_sub_return(>holders, 1);
 	if (!holders)
 		cleanup_context(ct);
-	if (!running)
-		pause();
 }
 
 static int tur_running(struct tur_checker_context *ct)
@@ -266,6 +264,9 @@ static void *tur_thread(void *ctx)
 	pthread_cond_signal(>active);
 	pthread_mutex_unlock(>lock);
 
+	/* Tell main checker thread not to cancel us, as we exit anyway */
+	uatomic_set(>running, 0);
+
 	condlog(3, "%s: tur checker finished, state %s",
 		tur_devt(devt, sizeof(devt), ct), checker_state_name(state));
 	tur_thread_cleanup_pop(ct);
-- 
2.16.1

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-09 Thread Benjamin Marzinski
On Fri, Feb 09, 2018 at 09:30:56PM +0100, Martin Wilck wrote:
> On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote:
> > ct->running is now an atomic variable.  When the thread is started
> > it is set to 1. When the checker wants to kill a thread, it
> > atomically
> > sets the value to 0 and reads the previous value.  If it was 1,
> > the checker cancels the thread. If it was 0, the nothing needs to be
> > done.  After the checker has dealt with the thread, it sets ct-
> > >thread
> > to NULL.
> > 
> > When the thread is done, it atomicalllys sets the value of ct-
> > >running
> > to 0 and reads the previous value. If it was 1, the thread just
> > exits.
> > If it was 0, then the checker is trying to cancel the thread, and so
> > the thread calls pause(), which is a cancellation point.
> > 
> 
> I'm missing one thing here. My poor brain is aching.
> 
> cleanup_func() can be entered in two ways: a) if the thread has been
> cancelled and passed a cancellation point already, or b) if it exits
> normally and calls pthread_cleanup_pop(). 
> In case b), waiting for the cancellation request by calling pause()
> makes sense to me. But in case a), the thread has already seen the
> cancellation request - wouldn't calling pause() cause it to sleep
> forever?

Urgh. You're right. If it is in the cleanup helper because it already
has been cancelled, then the pause isn't going get cancelled. So much
for my quick rewrite.

That leaves three options.

1. have either the thread or the checker detach the thread (depending
   on which one exits first)
2. make the checker always cancel and detach the thread. This simplifies
   the code, but there will zombie threads hanging around between calls
   to the checker.
3. just move the condlog

I really don't care which one we pick anymore.

-Ben

> 
> Martin
> 
> -- 
> Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-09 Thread Martin Wilck
On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote:
> ct->running is now an atomic variable.  When the thread is started
> it is set to 1. When the checker wants to kill a thread, it
> atomically
> sets the value to 0 and reads the previous value.  If it was 1,
> the checker cancels the thread. If it was 0, the nothing needs to be
> done.  After the checker has dealt with the thread, it sets ct-
> >thread
> to NULL.
> 
> When the thread is done, it atomicalllys sets the value of ct-
> >running
> to 0 and reads the previous value. If it was 1, the thread just
> exits.
> If it was 0, then the checker is trying to cancel the thread, and so
> the thread calls pause(), which is a cancellation point.
> 

I'm missing one thing here. My poor brain is aching.

cleanup_func() can be entered in two ways: a) if the thread has been
cancelled and passed a cancellation point already, or b) if it exits
normally and calls pthread_cleanup_pop(). 
In case b), waiting for the cancellation request by calling pause()
makes sense to me. But in case a), the thread has already seen the
cancellation request - wouldn't calling pause() cause it to sleep
forever?

Martin

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-09 Thread Bart Van Assche
On Fri, 2018-02-09 at 11:26 -0600, Benjamin Marzinski wrote:
> On Fri, Feb 09, 2018 at 04:15:34PM +, Bart Van Assche wrote:
> > On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote:
> > >  static void cleanup_func(void *data)
> > >  {
> > > - int holders;
> > > + int running, holders;
> > >   struct tur_checker_context *ct = data;
> > > - pthread_spin_lock(>hldr_lock);
> > > - ct->holders--;
> > > - holders = ct->holders;
> > > - ct->thread = 0;
> > > - pthread_spin_unlock(>hldr_lock);
> > > +
> > > + running = uatomic_xchg(>running, 0);
> > > + holders = uatomic_sub_return(>holders, 1);
> > >   if (!holders)
> > >   cleanup_context(ct);
> > > + if (!running)
> > > + pause();
> > >  }
> > 
> > Hello Ben,
> > 
> > Why has the pause() call been added? I think it is safe to call 
> > pthread_cancel()
> > for a non-detached thread that has finished so I don't think that pause() 
> > call
> > is necessary.
> 
> Martin objected to having the threads getting detached as part of
> cancelling them (I think. I'm a little fuzzy on what he didn't like).
> But he definitely said he preferred the thread to start detached, so in
> this version, it does.  That's why we need the pause().  If he's fine with
> the threads getting detached later, I will happily replace the pause()
> with
> 
> if (running)
>   pthread_detach(pthread_self());
> 
> and add pthread_detach(ct->thread) after the calls to
> pthread_cancel(ct->thread). Otherwise we need the pause() to solve your
> original bug.

Ah, thanks, I had overlooked that the tur checker detaches the checker thread. 
Have
you considered to add a comment above the pause() call that explains the 
purpose of
that call?

Thanks,

Bart.




--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-09 Thread Benjamin Marzinski
On Fri, Feb 09, 2018 at 04:15:34PM +, Bart Van Assche wrote:
> On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote:
> >  static void cleanup_func(void *data)
> >  {
> > -   int holders;
> > +   int running, holders;
> > struct tur_checker_context *ct = data;
> > -   pthread_spin_lock(>hldr_lock);
> > -   ct->holders--;
> > -   holders = ct->holders;
> > -   ct->thread = 0;
> > -   pthread_spin_unlock(>hldr_lock);
> > +
> > +   running = uatomic_xchg(>running, 0);
> > +   holders = uatomic_sub_return(>holders, 1);
> > if (!holders)
> > cleanup_context(ct);
> > +   if (!running)
> > +   pause();
> >  }
> 
> Hello Ben,
> 
> Why has the pause() call been added? I think it is safe to call 
> pthread_cancel()
> for a non-detached thread that has finished so I don't think that pause() call
> is necessary.

Martin objected to having the threads getting detached as part of
cancelling them (I think. I'm a little fuzzy on what he didn't like).
But he definitely said he preferred the thread to start detached, so in
this version, it does.  That's why we need the pause().  If he's fine with
the threads getting detached later, I will happily replace the pause()
with

if (running)
pthread_detach(pthread_self());

and add pthread_detach(ct->thread) after the calls to
pthread_cancel(ct->thread). Otherwise we need the pause() to solve your
original bug.

As an aside, Martin, if your problem is the thread detaching itself, we
can skip that if we are fine with a zombie thread hanging around until
the next time we call libcheck_check() or libcheck_free(). Then the
checker can always be in charge of detaching the thread.
 
> >  static int tur_running(struct tur_checker_context *ct)
> >  {
> > -   pthread_t thread;
> > -
> > -   pthread_spin_lock(>hldr_lock);
> > -   thread = ct->thread;
> > -   pthread_spin_unlock(>hldr_lock);
> > -
> > -   return thread != 0;
> > +   return (uatomic_read(>running) != 0);
> >  }
> 
> Is such a one line function really useful?

Nope. I just left it there to keep the number of changes that the patch
makes lower, to make it more straightforward to review. I'm fine will
inlining it.

> I think the code will be easier to read if this function is inlined
> into its callers.

> > @@ -418,8 +396,12 @@ int libcheck_check(struct checker * c)
> > (tur_status == PATH_PENDING || tur_status == 
> > PATH_UNCHECKED)) {
> > condlog(3, "%s: tur checker still running",
> > tur_devt(devt, sizeof(devt), ct));
> > -   ct->running = 1;
> > tur_status = PATH_PENDING;
> > +   } else {
> > +   int running = uatomic_xchg(>running, 0);
> > +   if (running)
> > +   pthread_cancel(ct->thread);
> > +   ct->thread = 0;
> > }
> > }
> 
> Why has this pthread_cancel() call been added? I think that else clause can 
> only be
> reached if ct->running == 0 so I don't think that the pthread_cancel() call 
> will ever
> be reached.

It can be reached if ct->running is 1, as long as tur_status has been
updated.  In practice this means that the thread should have done
everything it needs to do, and all that's left is for it to shutdown.
However, if the thread doesn't shut itself down before the next time you
call libcheck_check(), the checker will give up and return PATH_TIMEOUT.

It seems pretty unlikely that this will happen, since there should be a
significant delay before calling libcheck_check() again. This
theoretical race has been in the code for a while, and AFAIK, it's never
occured. But there definitely is the possiblity that the thread will
still be running at the end of libcheck_check(), and it doesn't hurt
things to forceably shut the thread down, if it is. 

-Ben

> Thanks,
> 
> Bart.
> 
> 
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2 1/7] libmultipath: fix tur checker locking

2018-02-09 Thread Bart Van Assche
On Thu, 2018-02-08 at 17:56 -0600, Benjamin Marzinski wrote:
>  static void cleanup_func(void *data)
>  {
> - int holders;
> + int running, holders;
>   struct tur_checker_context *ct = data;
> - pthread_spin_lock(>hldr_lock);
> - ct->holders--;
> - holders = ct->holders;
> - ct->thread = 0;
> - pthread_spin_unlock(>hldr_lock);
> +
> + running = uatomic_xchg(>running, 0);
> + holders = uatomic_sub_return(>holders, 1);
>   if (!holders)
>   cleanup_context(ct);
> + if (!running)
> + pause();
>  }

Hello Ben,

Why has the pause() call been added? I think it is safe to call pthread_cancel()
for a non-detached thread that has finished so I don't think that pause() call
is necessary.
 
>  static int tur_running(struct tur_checker_context *ct)
>  {
> - pthread_t thread;
> -
> - pthread_spin_lock(>hldr_lock);
> - thread = ct->thread;
> - pthread_spin_unlock(>hldr_lock);
> -
> - return thread != 0;
> + return (uatomic_read(>running) != 0);
>  }

Is such a one line function really useful? I think the code will be easier to 
read
if this function is inlined into its callers.

> @@ -418,8 +396,12 @@ int libcheck_check(struct checker * c)
>   (tur_status == PATH_PENDING || tur_status == 
> PATH_UNCHECKED)) {
>   condlog(3, "%s: tur checker still running",
>   tur_devt(devt, sizeof(devt), ct));
> - ct->running = 1;
>   tur_status = PATH_PENDING;
> + } else {
> + int running = uatomic_xchg(>running, 0);
> + if (running)
> + pthread_cancel(ct->thread);
> + ct->thread = 0;
>   }
>   }

Why has this pthread_cancel() call been added? I think that else clause can 
only be
reached if ct->running == 0 so I don't think that the pthread_cancel() call 
will ever
be reached.

Thanks,

Bart.




--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel