Re: [PATCH v2] NFSv4: replace seqcount_t with a seqlock_t

2016-10-31 Thread Sebastian Andrzej Siewior
On 2016-10-28 22:24:52 [+], Trond Myklebust wrote:
> > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> > index 5f4281ec5f72..a442d9867942 100644
> > --- a/fs/nfs/nfs4state.c
> > +++ b/fs/nfs/nfs4state.c
> > @@ -1497,8 +1498,18 @@ static int nfs4_reclaim_open_state(struct 
> > nfs4_state_owner *sp, const struct nfs
> >  * recovering after a network partition or a reboot from a
> >  * server that doesn't support a grace period.
> >  */
> > +   /*
> > +* XXX
> > +* This mutex is wrong. It protects against multiple writer. However
> 
> There is only 1 recovery thread per client/server pair, which is why we know 
> there is only a single writer. No need for a mutex here.

This isn't documented but it should.

> > +* write_seqlock() should have been used for this task. This would avoid
> > +* preemption while the seqlock is held which is good because the writer
> > +* shouldn't be preempted as it would let the reader spin for no good
> 
> Recovery involves sending RPC calls to a server. There is no avoiding 
> preemption if we have to recover state. The point of the sequence counter is 
> to signal to processes that may have raced with this recovery thread that 
> they may need to replay their locks.

Then the sequence counter is the wrong mechanism used here. As I
explained: the call can be preempted and once it is, the reader will
spin and waste cycles.
What is wrong with a rwsem? Are the reader not preemptible?

> > +* reason. There are a few memory allocations and kthread_run() so we
> > +* have this mutex now.
> > +*/
> > +   mutex_lock(>so_reclaim_seqlock_mutex);
> > +   write_seqcount_begin(>so_reclaim_seqlock.seqcount);
> > spin_lock(>so_lock);
> > -   raw_write_seqcount_begin(>so_reclaim_seqcount);
> > restart:
> > list_for_each_entry(state, >so_states, open_states) {
> > if (!test_and_clear_bit(ops->state_flag_bit, >flags))
> > @@ -1566,14 +1577,14 @@ static int nfs4_reclaim_open_state(struct 
> > nfs4_state_owner *sp, const struct nfs
> > spin_lock(>so_lock);
> > goto restart;
> > }
> > -   raw_write_seqcount_end(>so_reclaim_seqcount);
> > spin_unlock(>so_lock);
> > +   write_seqcount_end(>so_reclaim_seqlock.seqcount);
> 
> This will reintroduce lockdep checking. We don’t need or want that...

Why isn't lockdep welcome? And what kind warning will it introduce? How
do I trigger this? I tried various things but never got close to this.

Sebastian


Re: [PATCH v2] NFSv4: replace seqcount_t with a seqlock_t

2016-10-31 Thread Sebastian Andrzej Siewior
On 2016-10-28 22:24:52 [+], Trond Myklebust wrote:
> > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> > index 5f4281ec5f72..a442d9867942 100644
> > --- a/fs/nfs/nfs4state.c
> > +++ b/fs/nfs/nfs4state.c
> > @@ -1497,8 +1498,18 @@ static int nfs4_reclaim_open_state(struct 
> > nfs4_state_owner *sp, const struct nfs
> >  * recovering after a network partition or a reboot from a
> >  * server that doesn't support a grace period.
> >  */
> > +   /*
> > +* XXX
> > +* This mutex is wrong. It protects against multiple writer. However
> 
> There is only 1 recovery thread per client/server pair, which is why we know 
> there is only a single writer. No need for a mutex here.

This isn't documented but it should.

> > +* write_seqlock() should have been used for this task. This would avoid
> > +* preemption while the seqlock is held which is good because the writer
> > +* shouldn't be preempted as it would let the reader spin for no good
> 
> Recovery involves sending RPC calls to a server. There is no avoiding 
> preemption if we have to recover state. The point of the sequence counter is 
> to signal to processes that may have raced with this recovery thread that 
> they may need to replay their locks.

Then the sequence counter is the wrong mechanism used here. As I
explained: the call can be preempted and once it is, the reader will
spin and waste cycles.
What is wrong with a rwsem? Are the reader not preemptible?

> > +* reason. There are a few memory allocations and kthread_run() so we
> > +* have this mutex now.
> > +*/
> > +   mutex_lock(>so_reclaim_seqlock_mutex);
> > +   write_seqcount_begin(>so_reclaim_seqlock.seqcount);
> > spin_lock(>so_lock);
> > -   raw_write_seqcount_begin(>so_reclaim_seqcount);
> > restart:
> > list_for_each_entry(state, >so_states, open_states) {
> > if (!test_and_clear_bit(ops->state_flag_bit, >flags))
> > @@ -1566,14 +1577,14 @@ static int nfs4_reclaim_open_state(struct 
> > nfs4_state_owner *sp, const struct nfs
> > spin_lock(>so_lock);
> > goto restart;
> > }
> > -   raw_write_seqcount_end(>so_reclaim_seqcount);
> > spin_unlock(>so_lock);
> > +   write_seqcount_end(>so_reclaim_seqlock.seqcount);
> 
> This will reintroduce lockdep checking. We don’t need or want that...

Why isn't lockdep welcome? And what kind warning will it introduce? How
do I trigger this? I tried various things but never got close to this.

Sebastian


Re: [PATCH v2] NFSv4: replace seqcount_t with a seqlock_t

2016-10-28 Thread Trond Myklebust

> On Oct 28, 2016, at 17:05, Sebastian Andrzej Siewior  
> wrote:
> 
> The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
> because it maps to preempt_disable() in -RT which I can't have at this
> point. So I took a look at the code.
> It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
> raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
> lockdep complained. The whole seqcount thing was introduced in commit
> c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
> being recovered").
> I don't understand how it is possible that we don't end up with two
> writers for the same ressource because the `sp->so_lock' lock is dropped
> is soon in the list_for_each_entry() loop. It might be the
> test_and_clear_bit() check in nfs4_do_reclaim() but it might clear one
> bit on each iteration so I *think* we could have two invocations of the
> same struct nfs4_state_owner in nfs4_reclaim_open_state().
> So there is that.
> 
> But back to the list_for_each_entry() macro.
> It seems that this lock protects the ->so_states list among other
> atomic_t & flags members. So at the begin of the loop we inc ->count
> ensuring that this field is not removed while we use it. So we drop the
> ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
> invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
> we were the last user of this struct and we remove it, then the
> following list_next_entry() invocation is a use-after-free. Even if we
> use list_for_each_entry_safe() there is no guarantee that the following
> member is still valid because it might have been removed by another
> writer, right?
> So there is this.
> 
> However to address my initial problem I have here a patch :) So it uses
> a seqlock_t which ensures that there is only one writer at a time. So it
> should be basically what is happening now plus a tiny tiny tiny lock
> plus lockdep coverage. I tried to this myself but I don't manage to get
> into this code path at all so I might be doing something wrong.
> 
> Could you please check if this patch is working for you and whether my
> list_for_each_entry() observation is correct or not?
> 
> v1…v2: write_seqlock() disables preemption and some function need it
> (thread_run(), non-GFP_ATOMIC memory alloction()). We don't want
> preemption enabled because a preempted writer would stall the reader
> spinning. This is a duct tape mutex. Maybe the seqlock should go.
> 
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
> fs/nfs/delegation.c |  4 ++--
> fs/nfs/nfs4_fs.h|  3 ++-
> fs/nfs/nfs4proc.c   |  4 ++--
> fs/nfs/nfs4state.c  | 23 +--
> 4 files changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> index dff600ae0d74..d726d2e09353 100644
> --- a/fs/nfs/delegation.c
> +++ b/fs/nfs/delegation.c
> @@ -150,11 +150,11 @@ static int nfs_delegation_claim_opens(struct inode 
> *inode,
>   sp = state->owner;
>   /* Block nfs4_proc_unlck */
>   mutex_lock(>so_delegreturn_mutex);
> - seq = raw_seqcount_begin(>so_reclaim_seqcount);
> + seq = read_seqbegin(>so_reclaim_seqlock);
>   err = nfs4_open_delegation_recall(ctx, state, stateid, type);
>   if (!err)
>   err = nfs_delegation_claim_locks(ctx, state, stateid);
> - if (!err && read_seqcount_retry(>so_reclaim_seqcount, seq))
> + if (!err && read_seqretry(>so_reclaim_seqlock, seq))
>   err = -EAGAIN;
>   mutex_unlock(>so_delegreturn_mutex);
>   put_nfs_open_context(ctx);
> diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
> index 9b3a82abab07..2fee1a2e8b57 100644
> --- a/fs/nfs/nfs4_fs.h
> +++ b/fs/nfs/nfs4_fs.h
> @@ -111,7 +111,8 @@ struct nfs4_state_owner {
>   unsigned longso_flags;
>   struct list_head so_states;
>   struct nfs_seqid_counter so_seqid;
> - seqcount_t   so_reclaim_seqcount;
> + seqlock_tso_reclaim_seqlock;
> + struct mutex so_reclaim_seqlock_mutex;
>   struct mutex so_delegreturn_mutex;
> };
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 7897826d7c51..9b9d53cd85f9 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -2685,7 +2685,7 @@ static int _nfs4_open_and_get_state(struct 
> nfs4_opendata *opendata,
>   unsigned int seq;
>   int ret;
> 
> - seq = raw_seqcount_begin(>so_reclaim_seqcount);
> + seq = raw_seqcount_begin(>so_reclaim_seqlock.seqcount);
> 
>   ret = _nfs4_proc_open(opendata);
>   if (ret != 0)
> @@ -2723,7 +2723,7 @@ static int _nfs4_open_and_get_state(struct 
> nfs4_opendata *opendata,
>   ctx->state = state;
>   if (d_inode(dentry) == state->inode) {
>   nfs_inode_attach_open_context(ctx);
> - if 

Re: [PATCH v2] NFSv4: replace seqcount_t with a seqlock_t

2016-10-28 Thread Trond Myklebust

> On Oct 28, 2016, at 17:05, Sebastian Andrzej Siewior  
> wrote:
> 
> The raw_write_seqcount_begin() in nfs4_reclaim_open_state() bugs me
> because it maps to preempt_disable() in -RT which I can't have at this
> point. So I took a look at the code.
> It the lockdep part was removed in commit abbec2da13f0 ("NFS: Use
> raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because
> lockdep complained. The whole seqcount thing was introduced in commit
> c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as
> being recovered").
> I don't understand how it is possible that we don't end up with two
> writers for the same ressource because the `sp->so_lock' lock is dropped
> is soon in the list_for_each_entry() loop. It might be the
> test_and_clear_bit() check in nfs4_do_reclaim() but it might clear one
> bit on each iteration so I *think* we could have two invocations of the
> same struct nfs4_state_owner in nfs4_reclaim_open_state().
> So there is that.
> 
> But back to the list_for_each_entry() macro.
> It seems that this lock protects the ->so_states list among other
> atomic_t & flags members. So at the begin of the loop we inc ->count
> ensuring that this field is not removed while we use it. So we drop the
> ->so_lock loc during the loop it seems. And after nfs4_reclaim_locks()
> invocation we nfs4_put_open_state() and grab the ->so_lock again. So if
> we were the last user of this struct and we remove it, then the
> following list_next_entry() invocation is a use-after-free. Even if we
> use list_for_each_entry_safe() there is no guarantee that the following
> member is still valid because it might have been removed by another
> writer, right?
> So there is this.
> 
> However to address my initial problem I have here a patch :) So it uses
> a seqlock_t which ensures that there is only one writer at a time. So it
> should be basically what is happening now plus a tiny tiny tiny lock
> plus lockdep coverage. I tried to this myself but I don't manage to get
> into this code path at all so I might be doing something wrong.
> 
> Could you please check if this patch is working for you and whether my
> list_for_each_entry() observation is correct or not?
> 
> v1…v2: write_seqlock() disables preemption and some function need it
> (thread_run(), non-GFP_ATOMIC memory alloction()). We don't want
> preemption enabled because a preempted writer would stall the reader
> spinning. This is a duct tape mutex. Maybe the seqlock should go.
> 
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
> fs/nfs/delegation.c |  4 ++--
> fs/nfs/nfs4_fs.h|  3 ++-
> fs/nfs/nfs4proc.c   |  4 ++--
> fs/nfs/nfs4state.c  | 23 +--
> 4 files changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> index dff600ae0d74..d726d2e09353 100644
> --- a/fs/nfs/delegation.c
> +++ b/fs/nfs/delegation.c
> @@ -150,11 +150,11 @@ static int nfs_delegation_claim_opens(struct inode 
> *inode,
>   sp = state->owner;
>   /* Block nfs4_proc_unlck */
>   mutex_lock(>so_delegreturn_mutex);
> - seq = raw_seqcount_begin(>so_reclaim_seqcount);
> + seq = read_seqbegin(>so_reclaim_seqlock);
>   err = nfs4_open_delegation_recall(ctx, state, stateid, type);
>   if (!err)
>   err = nfs_delegation_claim_locks(ctx, state, stateid);
> - if (!err && read_seqcount_retry(>so_reclaim_seqcount, seq))
> + if (!err && read_seqretry(>so_reclaim_seqlock, seq))
>   err = -EAGAIN;
>   mutex_unlock(>so_delegreturn_mutex);
>   put_nfs_open_context(ctx);
> diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
> index 9b3a82abab07..2fee1a2e8b57 100644
> --- a/fs/nfs/nfs4_fs.h
> +++ b/fs/nfs/nfs4_fs.h
> @@ -111,7 +111,8 @@ struct nfs4_state_owner {
>   unsigned longso_flags;
>   struct list_head so_states;
>   struct nfs_seqid_counter so_seqid;
> - seqcount_t   so_reclaim_seqcount;
> + seqlock_tso_reclaim_seqlock;
> + struct mutex so_reclaim_seqlock_mutex;
>   struct mutex so_delegreturn_mutex;
> };
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 7897826d7c51..9b9d53cd85f9 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -2685,7 +2685,7 @@ static int _nfs4_open_and_get_state(struct 
> nfs4_opendata *opendata,
>   unsigned int seq;
>   int ret;
> 
> - seq = raw_seqcount_begin(>so_reclaim_seqcount);
> + seq = raw_seqcount_begin(>so_reclaim_seqlock.seqcount);
> 
>   ret = _nfs4_proc_open(opendata);
>   if (ret != 0)
> @@ -2723,7 +2723,7 @@ static int _nfs4_open_and_get_state(struct 
> nfs4_opendata *opendata,
>   ctx->state = state;
>   if (d_inode(dentry) == state->inode) {
>   nfs_inode_attach_open_context(ctx);
> - if (read_seqcount_retry(>so_reclaim_seqcount, seq))
> +