Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-26 Thread NeilBrown
On Mon, 27 May 2024, Richard Kojedzinszky wrote:
> Dear Neil,
> 
> I was running it on arm64, may that be the reason?

That could certainly explain it.  I know x86_64 almost never needs
barriers, though I have seen cases where they matter.  ppc64 is very
sensitive.  A quick search suggests that arm64 does need barriers some
times.
I don't have arm64 hardware to test on but I'm happy with your
test results.

Thanks,
NeilBrown


> 
> Regards,
> Richard
> 
> 
> On May 27, 2024 4:02:32 AM GMT+02:00, NeilBrown  wrote:
> 
> On Sun, 26 May 2024, Richard Kojedzinszky wrote:
> Dear Neil,
> 
> According to my quick tests, your patch seems to fix this bug. Could 
> you 
> also manage to try my attached code, could you also reproduce the bug?
> 
> Thanks for testing.
> 
> I can run your test code but it isn't triggering the bug (90 minutes so
> far).  Possibly a different compiler used for the kernel, possibly
> hardware differences (I'm running under qemu).  Bugs related to barriers
> (which this one seems to be) need just the right circumstances to
> trigger so they can be hard to reproduce on a different system.
> 
> I've made some cosmetic improvements to the patch and will post it to
> the NFS maintainers.
> 
> Thanks again,
> NeilBrown
> 
> 
> 
> Thanks,
> Richard
> 
> 2024-05-24 07:29 időpontban Richard Kojedzinszky ezt írta:
> Dear Neil,
> 
> I've applied your patch, and since then there are no lockups. 
> Before 
> that my application reported a lockup in a minute or two, now it 
> has 
> been running for half an hour, and still running.
> 
> Thanks,
> Richard
> 
> 2024-05-24 01:31 időpontban NeilBrown ezt írta:
> On Fri, 24 May 2024, Richard Kojedzinszky wrote:
> Dear devs,
> 
> I am attaching a stripped down version of the little 
> program which
> triggers the bug very quickly, in a few minutes in my 
> test lab. It
> turned out that a single NFS mountpoint is enough. Just 
> start the
> program giving it the NFS mount as first argument. It 
> will chdir 
> there,
> and do file operations, which will trigger a lockup in a 
> few minutes.
> 
> I couldn't get the go code to run.  But then it is a long 
> time since I
> played with go and I didn't try very hard.
> If you could provide simple instructions and a list of package
> dependencies that I need to install (on Debian), I can give 
> it a try.
> 
> Or you could try this patch.  It might help, but I don't have 
> high
> hopes.  It adds some memory barriers and fixes a bug which 
> would cause 
> a
> problem if memory allocation failed (but memory allocation 
> never 
> fails).
> 
> NeilBrown
> 
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index ac505671efbd..5bcc0d14d519 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1804,7 +1804,7 @@ __nfs_lookup_revalidate(struct dentry 
> *dentry, 
> unsigned int flags,
> } else {
> /* Wait for unlink to complete */
> wait_var_event(&dentry->d_fsdata,
> -  dentry->d_fsdata != 
> NFS_FSDATA_BLOCKED);
> +  
> smp_load_acquire(&dentry->d_fsdata) != NFS_FSDATA_BLOCKED);
> parent = dget_parent(dentry);
> ret = reval(d_inode(parent), dentry, flags);
> dput(parent);
> @@ -2508,7 +2508,7 @@ int nfs_unlink(struct inode *dir, 
> struct dentry 
> *dentry)
> spin_unlock(&dentry->d_lock);
> error = nfs_safe_remove(dentry);
> nfs_dentry_remove_handle_error(dir, dentry, error);
> -   dentry->d_fsdata = NULL;
> +   smp_store_release(&dentry->d_fsdata, NULL);
> wake_up_var(&dentry->d_fsdata);
>  out:
> trace_nfs_unlink_exit(dir, dentry, error);
> @@ -2616,7 +2616,7 @@ nfs_unblock_rename(struct rpc_task 
> *task, struct 
> nfs_renamedata *data)
>  {
> struct dentry *new_dentry = data->new_dentry;
> 
> -   new_

Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-26 Thread Richard Kojedzinszky
Dear Neil,

I was running it on arm64, may that be the reason?

Regards,
Richard

On May 27, 2024 4:02:32 AM GMT+02:00, NeilBrown  wrote:
>On Sun, 26 May 2024, Richard Kojedzinszky wrote:
>> Dear Neil,
>> 
>> According to my quick tests, your patch seems to fix this bug. Could you 
>> also manage to try my attached code, could you also reproduce the bug?
>
>Thanks for testing.
>
>I can run your test code but it isn't triggering the bug (90 minutes so
>far).  Possibly a different compiler used for the kernel, possibly
>hardware differences (I'm running under qemu).  Bugs related to barriers
>(which this one seems to be) need just the right circumstances to
>trigger so they can be hard to reproduce on a different system.
>
>I've made some cosmetic improvements to the patch and will post it to
>the NFS maintainers.
>
>Thanks again,
>NeilBrown
>
>
>> 
>> Thanks,
>> Richard
>> 
>> 2024-05-24 07:29 időpontban Richard Kojedzinszky ezt írta:
>> > Dear Neil,
>> > 
>> > I've applied your patch, and since then there are no lockups. Before 
>> > that my application reported a lockup in a minute or two, now it has 
>> > been running for half an hour, and still running.
>> > 
>> > Thanks,
>> > Richard
>> > 
>> > 2024-05-24 01:31 időpontban NeilBrown ezt írta:
>> >> On Fri, 24 May 2024, Richard Kojedzinszky wrote:
>> >>> Dear devs,
>> >>> 
>> >>> I am attaching a stripped down version of the little program which
>> >>> triggers the bug very quickly, in a few minutes in my test lab. It
>> >>> turned out that a single NFS mountpoint is enough. Just start the
>> >>> program giving it the NFS mount as first argument. It will chdir 
>> >>> there,
>> >>> and do file operations, which will trigger a lockup in a few minutes.
>> >> 
>> >> I couldn't get the go code to run.  But then it is a long time since I
>> >> played with go and I didn't try very hard.
>> >> If you could provide simple instructions and a list of package
>> >> dependencies that I need to install (on Debian), I can give it a try.
>> >> 
>> >> Or you could try this patch.  It might help, but I don't have high
>> >> hopes.  It adds some memory barriers and fixes a bug which would cause 
>> >> a
>> >> problem if memory allocation failed (but memory allocation never 
>> >> fails).
>> >> 
>> >> NeilBrown
>> >> 
>> >> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>> >> index ac505671efbd..5bcc0d14d519 100644
>> >> --- a/fs/nfs/dir.c
>> >> +++ b/fs/nfs/dir.c
>> >> @@ -1804,7 +1804,7 @@ __nfs_lookup_revalidate(struct dentry *dentry, 
>> >> unsigned int flags,
>> >>   } else {
>> >>   /* Wait for unlink to complete */
>> >>   wait_var_event(&dentry->d_fsdata,
>> >> -dentry->d_fsdata != NFS_FSDATA_BLOCKED);
>> >> +smp_load_acquire(&dentry->d_fsdata) != 
>> >> NFS_FSDATA_BLOCKED);
>> >>   parent = dget_parent(dentry);
>> >>   ret = reval(d_inode(parent), dentry, flags);
>> >>   dput(parent);
>> >> @@ -2508,7 +2508,7 @@ int nfs_unlink(struct inode *dir, struct dentry 
>> >> *dentry)
>> >>   spin_unlock(&dentry->d_lock);
>> >>   error = nfs_safe_remove(dentry);
>> >>   nfs_dentry_remove_handle_error(dir, dentry, error);
>> >> - dentry->d_fsdata = NULL;
>> >> + smp_store_release(&dentry->d_fsdata, NULL);
>> >>   wake_up_var(&dentry->d_fsdata);
>> >>  out:
>> >>   trace_nfs_unlink_exit(dir, dentry, error);
>> >> @@ -2616,7 +2616,7 @@ nfs_unblock_rename(struct rpc_task *task, struct 
>> >> nfs_renamedata *data)
>> >>  {
>> >>   struct dentry *new_dentry = data->new_dentry;
>> >> 
>> >> - new_dentry->d_fsdata = NULL;
>> >> + smp_store_release(&new_dentry->d_fsdata, NULL);
>> >>   wake_up_var(&new_dentry->d_fsdata);
>> >>  }
>> >> 
>> >> @@ -2717,6 +2717,10 @@ int nfs_rename(struct mnt_idmap *idmap, struct 
>> >> inode *old_dir,
>> >>   task = nfs_async_rename(old_dir, new_dir, old_dentry, new_dentry,
>> >>   must_unblock ? nfs_unblock_rename : NULL);
>> >>   if (IS_ERR(task)) {
>> >> + if (must_unlock) {
>> >> + smp_store_release(&new_dentry->d_fsdata, NULL);
>> >> + wake_up_var(&new_dentry->d_fsdata);
>> >> + }
>> >>   error = PTR_ERR(task);
>> >>   goto out;
>> >>   }
>> 
>


Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-26 Thread NeilBrown
On Sun, 26 May 2024, Richard Kojedzinszky wrote:
> Dear Neil,
> 
> According to my quick tests, your patch seems to fix this bug. Could you 
> also manage to try my attached code, could you also reproduce the bug?

Thanks for testing.

I can run your test code but it isn't triggering the bug (90 minutes so
far).  Possibly a different compiler used for the kernel, possibly
hardware differences (I'm running under qemu).  Bugs related to barriers
(which this one seems to be) need just the right circumstances to
trigger so they can be hard to reproduce on a different system.

I've made some cosmetic improvements to the patch and will post it to
the NFS maintainers.

Thanks again,
NeilBrown


> 
> Thanks,
> Richard
> 
> 2024-05-24 07:29 időpontban Richard Kojedzinszky ezt írta:
> > Dear Neil,
> > 
> > I've applied your patch, and since then there are no lockups. Before 
> > that my application reported a lockup in a minute or two, now it has 
> > been running for half an hour, and still running.
> > 
> > Thanks,
> > Richard
> > 
> > 2024-05-24 01:31 időpontban NeilBrown ezt írta:
> >> On Fri, 24 May 2024, Richard Kojedzinszky wrote:
> >>> Dear devs,
> >>> 
> >>> I am attaching a stripped down version of the little program which
> >>> triggers the bug very quickly, in a few minutes in my test lab. It
> >>> turned out that a single NFS mountpoint is enough. Just start the
> >>> program giving it the NFS mount as first argument. It will chdir 
> >>> there,
> >>> and do file operations, which will trigger a lockup in a few minutes.
> >> 
> >> I couldn't get the go code to run.  But then it is a long time since I
> >> played with go and I didn't try very hard.
> >> If you could provide simple instructions and a list of package
> >> dependencies that I need to install (on Debian), I can give it a try.
> >> 
> >> Or you could try this patch.  It might help, but I don't have high
> >> hopes.  It adds some memory barriers and fixes a bug which would cause 
> >> a
> >> problem if memory allocation failed (but memory allocation never 
> >> fails).
> >> 
> >> NeilBrown
> >> 
> >> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> >> index ac505671efbd..5bcc0d14d519 100644
> >> --- a/fs/nfs/dir.c
> >> +++ b/fs/nfs/dir.c
> >> @@ -1804,7 +1804,7 @@ __nfs_lookup_revalidate(struct dentry *dentry, 
> >> unsigned int flags,
> >>} else {
> >>/* Wait for unlink to complete */
> >>wait_var_event(&dentry->d_fsdata,
> >> - dentry->d_fsdata != NFS_FSDATA_BLOCKED);
> >> + smp_load_acquire(&dentry->d_fsdata) != 
> >> NFS_FSDATA_BLOCKED);
> >>parent = dget_parent(dentry);
> >>ret = reval(d_inode(parent), dentry, flags);
> >>dput(parent);
> >> @@ -2508,7 +2508,7 @@ int nfs_unlink(struct inode *dir, struct dentry 
> >> *dentry)
> >>spin_unlock(&dentry->d_lock);
> >>error = nfs_safe_remove(dentry);
> >>nfs_dentry_remove_handle_error(dir, dentry, error);
> >> -  dentry->d_fsdata = NULL;
> >> +  smp_store_release(&dentry->d_fsdata, NULL);
> >>wake_up_var(&dentry->d_fsdata);
> >>  out:
> >>trace_nfs_unlink_exit(dir, dentry, error);
> >> @@ -2616,7 +2616,7 @@ nfs_unblock_rename(struct rpc_task *task, struct 
> >> nfs_renamedata *data)
> >>  {
> >>struct dentry *new_dentry = data->new_dentry;
> >> 
> >> -  new_dentry->d_fsdata = NULL;
> >> +  smp_store_release(&new_dentry->d_fsdata, NULL);
> >>wake_up_var(&new_dentry->d_fsdata);
> >>  }
> >> 
> >> @@ -2717,6 +2717,10 @@ int nfs_rename(struct mnt_idmap *idmap, struct 
> >> inode *old_dir,
> >>task = nfs_async_rename(old_dir, new_dir, old_dentry, new_dentry,
> >>must_unblock ? nfs_unblock_rename : NULL);
> >>if (IS_ERR(task)) {
> >> +  if (must_unlock) {
> >> +  smp_store_release(&new_dentry->d_fsdata, NULL);
> >> +  wake_up_var(&new_dentry->d_fsdata);
> >> +  }
> >>error = PTR_ERR(task);
> >>goto out;
> >>}
> 



Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-25 Thread Richard Kojedzinszky

Dear Neil,

According to my quick tests, your patch seems to fix this bug. Could you 
also manage to try my attached code, could you also reproduce the bug?


Thanks,
Richard

2024-05-24 07:29 időpontban Richard Kojedzinszky ezt írta:

Dear Neil,

I've applied your patch, and since then there are no lockups. Before 
that my application reported a lockup in a minute or two, now it has 
been running for half an hour, and still running.


Thanks,
Richard

2024-05-24 01:31 időpontban NeilBrown ezt írta:

On Fri, 24 May 2024, Richard Kojedzinszky wrote:

Dear devs,

I am attaching a stripped down version of the little program which
triggers the bug very quickly, in a few minutes in my test lab. It
turned out that a single NFS mountpoint is enough. Just start the
program giving it the NFS mount as first argument. It will chdir 
there,

and do file operations, which will trigger a lockup in a few minutes.


I couldn't get the go code to run.  But then it is a long time since I
played with go and I didn't try very hard.
If you could provide simple instructions and a list of package
dependencies that I need to install (on Debian), I can give it a try.

Or you could try this patch.  It might help, but I don't have high
hopes.  It adds some memory barriers and fixes a bug which would cause 
a
problem if memory allocation failed (but memory allocation never 
fails).


NeilBrown

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ac505671efbd..5bcc0d14d519 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1804,7 +1804,7 @@ __nfs_lookup_revalidate(struct dentry *dentry, 
unsigned int flags,

} else {
/* Wait for unlink to complete */
wait_var_event(&dentry->d_fsdata,
-  dentry->d_fsdata != NFS_FSDATA_BLOCKED);
+  smp_load_acquire(&dentry->d_fsdata) != 
NFS_FSDATA_BLOCKED);
parent = dget_parent(dentry);
ret = reval(d_inode(parent), dentry, flags);
dput(parent);
@@ -2508,7 +2508,7 @@ int nfs_unlink(struct inode *dir, struct dentry 
*dentry)

spin_unlock(&dentry->d_lock);
error = nfs_safe_remove(dentry);
nfs_dentry_remove_handle_error(dir, dentry, error);
-   dentry->d_fsdata = NULL;
+   smp_store_release(&dentry->d_fsdata, NULL);
wake_up_var(&dentry->d_fsdata);
 out:
trace_nfs_unlink_exit(dir, dentry, error);
@@ -2616,7 +2616,7 @@ nfs_unblock_rename(struct rpc_task *task, struct 
nfs_renamedata *data)

 {
struct dentry *new_dentry = data->new_dentry;

-   new_dentry->d_fsdata = NULL;
+   smp_store_release(&new_dentry->d_fsdata, NULL);
wake_up_var(&new_dentry->d_fsdata);
 }

@@ -2717,6 +2717,10 @@ int nfs_rename(struct mnt_idmap *idmap, struct 
inode *old_dir,

task = nfs_async_rename(old_dir, new_dir, old_dentry, new_dentry,
must_unblock ? nfs_unblock_rename : NULL);
if (IS_ERR(task)) {
+   if (must_unlock) {
+   smp_store_release(&new_dentry->d_fsdata, NULL);
+   wake_up_var(&new_dentry->d_fsdata);
+   }
error = PTR_ERR(task);
goto out;
}




Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-23 Thread Richard Kojedzinszky

Dear Neil,

I've applied your patch, and since then there are no lockups. Before 
that my application reported a lockup in a minute or two, now it has 
been running for half an hour, and still running.


Thanks,
Richard

2024-05-24 01:31 időpontban NeilBrown ezt írta:

On Fri, 24 May 2024, Richard Kojedzinszky wrote:

Dear devs,

I am attaching a stripped down version of the little program which
triggers the bug very quickly, in a few minutes in my test lab. It
turned out that a single NFS mountpoint is enough. Just start the
program giving it the NFS mount as first argument. It will chdir 
there,

and do file operations, which will trigger a lockup in a few minutes.


I couldn't get the go code to run.  But then it is a long time since I
played with go and I didn't try very hard.
If you could provide simple instructions and a list of package
dependencies that I need to install (on Debian), I can give it a try.

Or you could try this patch.  It might help, but I don't have high
hopes.  It adds some memory barriers and fixes a bug which would cause 
a
problem if memory allocation failed (but memory allocation never 
fails).


NeilBrown

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ac505671efbd..5bcc0d14d519 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1804,7 +1804,7 @@ __nfs_lookup_revalidate(struct dentry *dentry, 
unsigned int flags,

} else {
/* Wait for unlink to complete */
wait_var_event(&dentry->d_fsdata,
-  dentry->d_fsdata != NFS_FSDATA_BLOCKED);
+  smp_load_acquire(&dentry->d_fsdata) != 
NFS_FSDATA_BLOCKED);
parent = dget_parent(dentry);
ret = reval(d_inode(parent), dentry, flags);
dput(parent);
@@ -2508,7 +2508,7 @@ int nfs_unlink(struct inode *dir, struct dentry 
*dentry)

spin_unlock(&dentry->d_lock);
error = nfs_safe_remove(dentry);
nfs_dentry_remove_handle_error(dir, dentry, error);
-   dentry->d_fsdata = NULL;
+   smp_store_release(&dentry->d_fsdata, NULL);
wake_up_var(&dentry->d_fsdata);
 out:
trace_nfs_unlink_exit(dir, dentry, error);
@@ -2616,7 +2616,7 @@ nfs_unblock_rename(struct rpc_task *task, struct 
nfs_renamedata *data)

 {
struct dentry *new_dentry = data->new_dentry;

-   new_dentry->d_fsdata = NULL;
+   smp_store_release(&new_dentry->d_fsdata, NULL);
wake_up_var(&new_dentry->d_fsdata);
 }

@@ -2717,6 +2717,10 @@ int nfs_rename(struct mnt_idmap *idmap, struct 
inode *old_dir,

task = nfs_async_rename(old_dir, new_dir, old_dentry, new_dentry,
must_unblock ? nfs_unblock_rename : NULL);
if (IS_ERR(task)) {
+   if (must_unlock) {
+   smp_store_release(&new_dentry->d_fsdata, NULL);
+   wake_up_var(&new_dentry->d_fsdata);
+   }
error = PTR_ERR(task);
goto out;
}




Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-23 Thread Richard Kojedzinszky

Dear Neil,

I've stripped the code more, which still triggers the bug for me. On 
Bookworm, to get the binary, simply:


$ sudo apt-get install golang
$ go build .

And then give it an nfs mountpoint, e.g.:

$ ./ds /mnt/nfs

Meanwhile, I will try your patch too.

Regards,
Richard

2024-05-24 01:31 időpontban NeilBrown ezt írta:

On Fri, 24 May 2024, Richard Kojedzinszky wrote:

Dear devs,

I am attaching a stripped down version of the little program which
triggers the bug very quickly, in a few minutes in my test lab. It
turned out that a single NFS mountpoint is enough. Just start the
program giving it the NFS mount as first argument. It will chdir 
there,

and do file operations, which will trigger a lockup in a few minutes.


I couldn't get the go code to run.  But then it is a long time since I
played with go and I didn't try very hard.
If you could provide simple instructions and a list of package
dependencies that I need to install (on Debian), I can give it a try.

Or you could try this patch.  It might help, but I don't have high
hopes.  It adds some memory barriers and fixes a bug which would cause 
a
problem if memory allocation failed (but memory allocation never 
fails).


NeilBrown

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ac505671efbd..5bcc0d14d519 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1804,7 +1804,7 @@ __nfs_lookup_revalidate(struct dentry *dentry, 
unsigned int flags,

} else {
/* Wait for unlink to complete */
wait_var_event(&dentry->d_fsdata,
-  dentry->d_fsdata != NFS_FSDATA_BLOCKED);
+  smp_load_acquire(&dentry->d_fsdata) != 
NFS_FSDATA_BLOCKED);
parent = dget_parent(dentry);
ret = reval(d_inode(parent), dentry, flags);
dput(parent);
@@ -2508,7 +2508,7 @@ int nfs_unlink(struct inode *dir, struct dentry 
*dentry)

spin_unlock(&dentry->d_lock);
error = nfs_safe_remove(dentry);
nfs_dentry_remove_handle_error(dir, dentry, error);
-   dentry->d_fsdata = NULL;
+   smp_store_release(&dentry->d_fsdata, NULL);
wake_up_var(&dentry->d_fsdata);
 out:
trace_nfs_unlink_exit(dir, dentry, error);
@@ -2616,7 +2616,7 @@ nfs_unblock_rename(struct rpc_task *task, struct 
nfs_renamedata *data)

 {
struct dentry *new_dentry = data->new_dentry;

-   new_dentry->d_fsdata = NULL;
+   smp_store_release(&new_dentry->d_fsdata, NULL);
wake_up_var(&new_dentry->d_fsdata);
 }

@@ -2717,6 +2717,10 @@ int nfs_rename(struct mnt_idmap *idmap, struct 
inode *old_dir,

task = nfs_async_rename(old_dir, new_dir, old_dentry, new_dentry,
must_unblock ? nfs_unblock_rename : NULL);
if (IS_ERR(task)) {
+   if (must_unlock) {
+   smp_store_release(&new_dentry->d_fsdata, NULL);
+   wake_up_var(&new_dentry->d_fsdata);
+   }
error = PTR_ERR(task);
goto out;
}


ds.tar
Description: Unix tar archive


Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-23 Thread NeilBrown
On Fri, 24 May 2024, Richard Kojedzinszky wrote:
> Dear devs,
> 
> I am attaching a stripped down version of the little program which 
> triggers the bug very quickly, in a few minutes in my test lab. It 
> turned out that a single NFS mountpoint is enough. Just start the 
> program giving it the NFS mount as first argument. It will chdir there, 
> and do file operations, which will trigger a lockup in a few minutes.

I couldn't get the go code to run.  But then it is a long time since I
played with go and I didn't try very hard.
If you could provide simple instructions and a list of package
dependencies that I need to install (on Debian), I can give it a try.

Or you could try this patch.  It might help, but I don't have high
hopes.  It adds some memory barriers and fixes a bug which would cause a
problem if memory allocation failed (but memory allocation never fails).

NeilBrown

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ac505671efbd..5bcc0d14d519 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1804,7 +1804,7 @@ __nfs_lookup_revalidate(struct dentry *dentry, unsigned 
int flags,
} else {
/* Wait for unlink to complete */
wait_var_event(&dentry->d_fsdata,
-  dentry->d_fsdata != NFS_FSDATA_BLOCKED);
+  smp_load_acquire(&dentry->d_fsdata) != 
NFS_FSDATA_BLOCKED);
parent = dget_parent(dentry);
ret = reval(d_inode(parent), dentry, flags);
dput(parent);
@@ -2508,7 +2508,7 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
spin_unlock(&dentry->d_lock);
error = nfs_safe_remove(dentry);
nfs_dentry_remove_handle_error(dir, dentry, error);
-   dentry->d_fsdata = NULL;
+   smp_store_release(&dentry->d_fsdata, NULL);
wake_up_var(&dentry->d_fsdata);
 out:
trace_nfs_unlink_exit(dir, dentry, error);
@@ -2616,7 +2616,7 @@ nfs_unblock_rename(struct rpc_task *task, struct 
nfs_renamedata *data)
 {
struct dentry *new_dentry = data->new_dentry;
 
-   new_dentry->d_fsdata = NULL;
+   smp_store_release(&new_dentry->d_fsdata, NULL);
wake_up_var(&new_dentry->d_fsdata);
 }
 
@@ -2717,6 +2717,10 @@ int nfs_rename(struct mnt_idmap *idmap, struct inode 
*old_dir,
task = nfs_async_rename(old_dir, new_dir, old_dentry, new_dentry,
must_unblock ? nfs_unblock_rename : NULL);
if (IS_ERR(task)) {
+   if (must_unlock) {
+   smp_store_release(&new_dentry->d_fsdata, NULL);
+   wake_up_var(&new_dentry->d_fsdata);
+   }
error = PTR_ERR(task);
goto out;
}



Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-23 Thread Richard Kojedzinszky

Dear devs,

I am attaching a stripped down version of the little program which 
triggers the bug very quickly, in a few minutes in my test lab. It 
turned out that a single NFS mountpoint is enough. Just start the 
program giving it the NFS mount as first argument. It will chdir there, 
and do file operations, which will trigger a lockup in a few minutes.


Please take a look at it.

Thanks in advance,
Richard

2024-05-23 14:12 időpontban Richard Kojedzinszky ezt írta:

Dear devs,

Now bisecting turned out that 3c59366c207e4c6c6569524af606baf017a55c61 
is the bad commit for me. Strangely it only affects my dovecot process 
accessing data over NFS.


Can you please confirm that this may be a bad commit?

My earlier attached programs may be used to demonstrate/trigger the 
issue. It even could be stripped down to minimal operations to trigger 
the bug.


Thanks in advance,
Richard


2024-05-23 09:10 időpontban Richard Kojedzinszky ezt írta:

Dear NFS developers,

I am running multiple PODs on a Kubernetes node, they all mount 
different NFS shares from the same nfs server. I started to notice 
hangups in my dovecot process after I switched to Debian's kernel from 
upstream 5.15. You can find Debian bugreport at 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501.


So, effectively I am running dovecot in Kubernetes, and dovecot's data 
directory is accessed over NFS. Eventually one dovecot process stucks 
in nfs4_lookup_revalidate(). From that point, that process cannot be 
killed, howewer, other processes can access NFS as normal. Also, 
another dovecot process running on the very same node accessing the 
same NFS share works too.


Now, I am still in the process of bisecting, howewer, I cannot 
reliably trigger the bug. Originally it took a few days after I've 
noticed a hanging process. Now I am trying to mimic file operations 
what dovecot does in a faster way. Now it seems that it triggers the 
bug in a few hours, howewer, during bisects, I can still make 
mistakes.


I've scheduled many of my applications which use NFS shares to the 
same node, to have more NFS load on that node.


I am attaching my simple app which triggers the bug in a few hours, at 
least in my lab. I have two dedicated NFS shares for this test case, 
and I am running 3 instances of the applications for both shares. 
Also, I am running other production applications on the same node 
which also use NFS, howewer, I dont experience lockups with them. They 
are librenms, prometheus, and a docker private registry. This way I 
dont know if running the attached app only is enough to trigger the 
bug.


Once I have a suspectible commit based on my bisecting process, I will 
report it here.


My NFS server is a TrueNAS, based on FreeBSD 13.3.

Thanks in advance,
Richard


ds.tar
Description: Unix tar archive


Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-23 Thread Richard Kojedzinszky

Dear devs,

Now bisecting turned out that 3c59366c207e4c6c6569524af606baf017a55c61 
is the bad commit for me. Strangely it only affects my dovecot process 
accessing data over NFS.


Can you please confirm that this may be a bad commit?

My earlier attached programs may be used to demonstrate/trigger the 
issue. It even could be stripped down to minimal operations to trigger 
the bug.


Thanks in advance,
Richard


2024-05-23 09:10 időpontban Richard Kojedzinszky ezt írta:

Dear NFS developers,

I am running multiple PODs on a Kubernetes node, they all mount 
different NFS shares from the same nfs server. I started to notice 
hangups in my dovecot process after I switched to Debian's kernel from 
upstream 5.15. You can find Debian bugreport at 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501.


So, effectively I am running dovecot in Kubernetes, and dovecot's data 
directory is accessed over NFS. Eventually one dovecot process stucks 
in nfs4_lookup_revalidate(). From that point, that process cannot be 
killed, howewer, other processes can access NFS as normal. Also, 
another dovecot process running on the very same node accessing the 
same NFS share works too.


Now, I am still in the process of bisecting, howewer, I cannot reliably 
trigger the bug. Originally it took a few days after I've noticed a 
hanging process. Now I am trying to mimic file operations what dovecot 
does in a faster way. Now it seems that it triggers the bug in a few 
hours, howewer, during bisects, I can still make mistakes.


I've scheduled many of my applications which use NFS shares to the same 
node, to have more NFS load on that node.


I am attaching my simple app which triggers the bug in a few hours, at 
least in my lab. I have two dedicated NFS shares for this test case, 
and I am running 3 instances of the applications for both shares. Also, 
I am running other production applications on the same node which also 
use NFS, howewer, I dont experience lockups with them. They are 
librenms, prometheus, and a docker private registry. This way I dont 
know if running the attached app only is enough to trigger the bug.


Once I have a suspectible commit based on my bisecting process, I will 
report it here.


My NFS server is a TrueNAS, based on FreeBSD 13.3.

Thanks in advance,
Richard




Bug#1071501: Linux NFS client hangs in nfs4_lookup_revalidate

2024-05-23 Thread Richard Kojedzinszky

Dear NFS developers,

I am running multiple PODs on a Kubernetes node, they all mount 
different NFS shares from the same nfs server. I started to notice 
hangups in my dovecot process after I switched to Debian's kernel from 
upstream 5.15. You can find Debian bugreport at 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501.


So, effectively I am running dovecot in Kubernetes, and dovecot's data 
directory is accessed over NFS. Eventually one dovecot process stucks in 
nfs4_lookup_revalidate(). From that point, that process cannot be 
killed, howewer, other processes can access NFS as normal. Also, another 
dovecot process running on the very same node accessing the same NFS 
share works too.


Now, I am still in the process of bisecting, howewer, I cannot reliably 
trigger the bug. Originally it took a few days after I've noticed a 
hanging process. Now I am trying to mimic file operations what dovecot 
does in a faster way. Now it seems that it triggers the bug in a few 
hours, howewer, during bisects, I can still make mistakes.


I've scheduled many of my applications which use NFS shares to the same 
node, to have more NFS load on that node.


I am attaching my simple app which triggers the bug in a few hours, at 
least in my lab. I have two dedicated NFS shares for this test case, and 
I am running 3 instances of the applications for both shares. Also, I am 
running other production applications on the same node which also use 
NFS, howewer, I dont experience lockups with them. They are librenms, 
prometheus, and a docker private registry. This way I dont know if 
running the attached app only is enough to trigger the bug.


Once I have a suspectible commit based on my bisecting process, I will 
report it here.


My NFS server is a TrueNAS, based on FreeBSD 13.3.

Thanks in advance,
Richard

ds.tar
Description: Unix tar archive