Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Vivek Goyal
On Fri, Jan 05, 2024 at 08:57:55PM +, Matthew Wilcox wrote:
> On Fri, Jan 05, 2024 at 03:41:48PM -0500, Vivek Goyal wrote:
> > On Fri, Jan 05, 2024 at 08:21:00PM +, Matthew Wilcox wrote:
> > > On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
> > > > On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
> > > > > From: Hou Tao 
> > > > > 
> > > > > When invoking virtio_fs_enqueue_req() through kworker, both the
> > > > > allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
> > > > > Considering the size of both the sg array and the bounce buffer may be
> > > > > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower 
> > > > > the
> > > > > possibility of memory allocation failure.
> > > > > 
> > > > 
> > > > What's the practical benefit of this patch. Looks like if memory
> > > > allocation fails, we keep retrying at interval of 1ms and don't
> > > > return error to user space.
> > > 
> > > You don't deplete the atomic reserves unnecessarily?
> > 
> > Sounds reasonable. 
> > 
> > With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block
> > indefinitely till memory can be allocated. 
> 
> If you need the "loop indefinitely" behaviour, that's
> GFP_NOFS | __GFP_NOFAIL.  If you're actually doing something yourself
> which can free up memory, this is a bad choice.  If you're just sleeping
> and retrying, you might as well have the MM do that for you.

I probably don't want to wait indefinitely. There might be some cases
where I might want to return error to user space. For example, if
virtiofs device has been hot-unplugged, then there is no point in
waiting indefinitely for memory allocation. Even if memory was allocated,
soon we will return error to user space with -ENOTCONN. 

We are currently not doing that check after memory allocation failure but
we probably could as an optimization.

So this patch looks good to me as it is. Thanks Hou Tao.

Reviewed-by: Vivek Goyal 

Thanks
Vivek




Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Vivek Goyal
On Fri, Jan 05, 2024 at 08:21:00PM +, Matthew Wilcox wrote:
> On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
> > On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
> > > From: Hou Tao 
> > > 
> > > When invoking virtio_fs_enqueue_req() through kworker, both the
> > > allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
> > > Considering the size of both the sg array and the bounce buffer may be
> > > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
> > > possibility of memory allocation failure.
> > > 
> > 
> > What's the practical benefit of this patch. Looks like if memory
> > allocation fails, we keep retrying at interval of 1ms and don't
> > return error to user space.
> 
> You don't deplete the atomic reserves unnecessarily?

Sounds reasonable. 

With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block
indefinitely till memory can be allocated. 

I am trying to figure out with GFP_NOFS, do we still need to check for
-ENOMEM while requeuing the req and asking worker thread to retry after
1ms. 

Thanks
Vivek




Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Vivek Goyal
On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
> From: Hou Tao 
> 
> When invoking virtio_fs_enqueue_req() through kworker, both the
> allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
> Considering the size of both the sg array and the bounce buffer may be
> greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
> possibility of memory allocation failure.
> 

What's the practical benefit of this patch. Looks like if memory
allocation fails, we keep retrying at interval of 1ms and don't
return error to user space.

Thanks
Vivek

> Signed-off-by: Hou Tao 
> ---
> Change log:
> v2:
>   * pass gfp_t instead of bool to virtio_fs_enqueue_req() (Suggested by 
> Matthew)
> 
> v1: 
> https://lore.kernel.org/linux-fsdevel/20240104015805.2103766-1-hou...@huaweicloud.com
> 
>  fs/fuse/virtio_fs.c | 20 +++-
>  1 file changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 3aac31d451985..8cf518624ce9e 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -87,7 +87,8 @@ struct virtio_fs_req_work {
>  };
>  
>  static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
> -  struct fuse_req *req, bool in_flight);
> +  struct fuse_req *req, bool in_flight,
> +  gfp_t gfp);
>  
>  static const struct constant_table dax_param_enums[] = {
>   {"always",  FUSE_DAX_ALWAYS },
> @@ -383,7 +384,7 @@ static void virtio_fs_request_dispatch_work(struct 
> work_struct *work)
>   list_del_init(>list);
>   spin_unlock(>lock);
>  
> - ret = virtio_fs_enqueue_req(fsvq, req, true);
> + ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_NOFS);
>   if (ret < 0) {
>   if (ret == -ENOMEM || ret == -ENOSPC) {
>   spin_lock(>lock);
> @@ -488,7 +489,7 @@ static void virtio_fs_hiprio_dispatch_work(struct 
> work_struct *work)
>  }
>  
>  /* Allocate and copy args into req->argbuf */
> -static int copy_args_to_argbuf(struct fuse_req *req)
> +static int copy_args_to_argbuf(struct fuse_req *req, gfp_t gfp)
>  {
>   struct fuse_args *args = req->args;
>   unsigned int offset = 0;
> @@ -502,7 +503,7 @@ static int copy_args_to_argbuf(struct fuse_req *req)
>   len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) +
> fuse_len_args(num_out, args->out_args);
>  
> - req->argbuf = kmalloc(len, GFP_ATOMIC);
> + req->argbuf = kmalloc(len, gfp);
>   if (!req->argbuf)
>   return -ENOMEM;
>  
> @@ -1119,7 +1120,8 @@ static unsigned int sg_init_fuse_args(struct 
> scatterlist *sg,
>  
>  /* Add a request to a virtqueue and kick the device */
>  static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
> -  struct fuse_req *req, bool in_flight)
> +  struct fuse_req *req, bool in_flight,
> +  gfp_t gfp)
>  {
>   /* requests need at least 4 elements */
>   struct scatterlist *stack_sgs[6];
> @@ -1140,8 +1142,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
> *fsvq,
>   /* Does the sglist fit on the stack? */
>   total_sgs = sg_count_fuse_req(req);
>   if (total_sgs > ARRAY_SIZE(stack_sgs)) {
> - sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
> - sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
> + sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp);
> + sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp);
>   if (!sgs || !sg) {
>   ret = -ENOMEM;
>   goto out;
> @@ -1149,7 +1151,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
> *fsvq,
>   }
>  
>   /* Use a bounce buffer since stack args cannot be mapped */
> - ret = copy_args_to_argbuf(req);
> + ret = copy_args_to_argbuf(req, gfp);
>   if (ret < 0)
>   goto out;
>  
> @@ -1245,7 +1247,7 @@ __releases(fiq->lock)
>fuse_len_args(req->args->out_numargs, req->args->out_args));
>  
>   fsvq = >vqs[queue_id];
> - ret = virtio_fs_enqueue_req(fsvq, req, false);
> + ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC);
>   if (ret < 0) {
>   if (ret == -ENOMEM || ret == -ENOSPC) {
>   /*
> -- 
> 2.29.2
> 




Re: [PATCH 1/2] virtiofs: Improve three size determinations

2024-01-02 Thread Vivek Goyal
On Fri, Dec 29, 2023 at 09:36:36AM +0100, Markus Elfring wrote:
> From: Markus Elfring 
> Date: Fri, 29 Dec 2023 08:42:04 +0100
> 
> Replace the specification of data structures by pointer dereferences
> as the parameter for the operator “sizeof” to make the corresponding size
> determination a bit safer according to the Linux coding style convention.

I had a look at coding-style.rst and it does say that dereferencing the
pointer is preferred form. Primary argument seems to be that somebody
might change the pointer variable type but not the corresponding type
passed to sizeof().

There is some value to the argument. I don't feel strongly about it.

Miklos, if you like this change, feel free to apply. 

Thanks
Vivek
  
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---
>  fs/fuse/virtio_fs.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 5f1be1da92ce..2f8ba9254c1e 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1435,11 +1435,11 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
>   goto out_err;
> 
>   err = -ENOMEM;
> - fc = kzalloc(sizeof(struct fuse_conn), GFP_KERNEL);
> + fc = kzalloc(sizeof(*fc), GFP_KERNEL);
>   if (!fc)
>   goto out_err;
> 
> - fm = kzalloc(sizeof(struct fuse_mount), GFP_KERNEL);
> + fm = kzalloc(sizeof(*fm), GFP_KERNEL);
>   if (!fm)
>   goto out_err;
> 
> @@ -1495,7 +1495,7 @@ static int virtio_fs_init_fs_context(struct fs_context 
> *fsc)
>   if (fsc->purpose == FS_CONTEXT_FOR_SUBMOUNT)
>   return fuse_init_fs_context_submount(fsc);
> 
> - ctx = kzalloc(sizeof(struct fuse_fs_context), GFP_KERNEL);
> + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>   if (!ctx)
>   return -ENOMEM;
>   fsc->fs_private = ctx;
> --
> 2.43.0
> 
> 




Re: [PATCH 2/2] virtiofs: Improve error handling in virtio_fs_get_tree()

2024-01-02 Thread Vivek Goyal
On Fri, Dec 29, 2023 at 09:38:47AM +0100, Markus Elfring wrote:
> From: Markus Elfring 
> Date: Fri, 29 Dec 2023 09:15:07 +0100
> 
> The kfree() function was called in two cases by
> the virtio_fs_get_tree() function during error handling
> even if the passed variable contained a null pointer.
> This issue was detected by using the Coccinelle software.
> 
> * Thus use another label.
> 
> * Move an error code assignment into an if branch.
> 
> * Delete an initialisation (for the variable “fc”)
>   which became unnecessary with this refactoring.
> 
> Signed-off-by: Markus Elfring 

As Matthew said that kfree(NULL) is perfectly acceptable usage in kernel,
so I really don't feel that this patch is required. Current code looks
good as it is.

Thanks
Vivek

> ---
>  fs/fuse/virtio_fs.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 2f8ba9254c1e..0746f54ec743 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1415,10 +1415,10 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
>  {
>   struct virtio_fs *fs;
>   struct super_block *sb;
> - struct fuse_conn *fc = NULL;
> + struct fuse_conn *fc;
>   struct fuse_mount *fm;
>   unsigned int virtqueue_size;
> - int err = -EIO;
> + int err;
> 
>   /* This gets a reference on virtio_fs object. This ptr gets installed
>* in fc->iq->priv. Once fuse_conn is going away, it calls ->put()
> @@ -1431,13 +1431,15 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
>   }
> 
>   virtqueue_size = virtqueue_get_vring_size(fs->vqs[VQ_REQUEST].vq);
> - if (WARN_ON(virtqueue_size <= FUSE_HEADER_OVERHEAD))
> - goto out_err;
> + if (WARN_ON(virtqueue_size <= FUSE_HEADER_OVERHEAD)) {
> + err = -EIO;
> + goto lock_mutex;
> + }
> 
>   err = -ENOMEM;
>   fc = kzalloc(sizeof(*fc), GFP_KERNEL);
>   if (!fc)
> - goto out_err;
> + goto lock_mutex;
> 
>   fm = kzalloc(sizeof(*fm), GFP_KERNEL);
>   if (!fm)
> @@ -1476,6 +1478,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
> 
>  out_err:
>   kfree(fc);
> +lock_mutex:
>   mutex_lock(_fs_mutex);
>   virtio_fs_put(fs);
>   mutex_unlock(_fs_mutex);
> --
> 2.43.0
> 




Re: [Virtio-fs] [PATCH] virtiofs: propagate sync() to file server

2021-04-20 Thread Vivek Goyal
On Mon, Apr 19, 2021 at 05:08:48PM +0200, Greg Kurz wrote:
> Even if POSIX doesn't mandate it, linux users legitimately expect
> sync() to flush all data and metadata to physical storage when it
> is located on the same system. This isn't happening with virtiofs
> though : sync() inside the guest returns right away even though
> data still needs to be flushed from the host page cache.
> 
> This is easily demonstrated by doing the following in the guest:
> 
> $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync
> 5120+0 records in
> 5120+0 records out
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.4 s, 1.0 GB/s
> sync()  = 0 <0.024068>
> +++ exited with 0 +++
> 
> and start the following in the host when the 'dd' command completes
> in the guest:
> 
> $ strace -T -e fsync sync virtiofs/foo
   
That "sync" is not /usr/bin/sync and its your own binary to call fsync()?

> fsync(3)= 0 <10.371640>
> +++ exited with 0 +++
> 
> There are no good reasons not to honor the expected behavior of
> sync() actually : it gives an unrealistic impression that virtiofs
> is super fast and that data has safely landed on HW, which isn't
> the case obviously.
> 
> Implement a ->sync_fs() superblock operation that sends a new
> FUSE_SYNC request type for this purpose. The FUSE_SYNC request
> conveys the 'wait' argument of ->sync_fs() in case the file
> server has a use for it. Like with FUSE_FSYNC and FUSE_FSYNCDIR,
> lack of support for FUSE_SYNC in the file server is treated as
> permanent success.
> 
> Note that such an operation allows the file server to DoS sync().
> Since a typical FUSE file server is an untrusted piece of software
> running in userspace, this is disabled by default.  Only enable it
> with virtiofs for now since virtiofsd is supposedly trusted by the
> guest kernel.
> 
> Reported-by: Robert Krawitz 
> Signed-off-by: Greg Kurz 
> ---
> 
> Can be tested using the following custom QEMU with FUSE_SYNCFS support:
> 
> https://gitlab.com/gkurz/qemu/-/tree/fuse-sync
> 
> ---
>  fs/fuse/fuse_i.h  |  3 +++
>  fs/fuse/inode.c   | 29 +
>  fs/fuse/virtio_fs.c   |  1 +
>  include/uapi/linux/fuse.h | 11 ++-
>  4 files changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 63d97a15ffde..68e9ae96cbd4 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -755,6 +755,9 @@ struct fuse_conn {
>   /* Auto-mount submounts announced by the server */
>   unsigned int auto_submounts:1;
>  
> + /* Propagate syncfs() to server */
> + unsigned int sync_fs:1;
> +
>   /** The number of requests waiting for completion */
>   atomic_t num_waiting;
>  
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index b0e18b470e91..425d567a06c5 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -506,6 +506,34 @@ static int fuse_statfs(struct dentry *dentry, struct 
> kstatfs *buf)
>   return err;
>  }
>  
> +static int fuse_sync_fs(struct super_block *sb, int wait)
> +{
> + struct fuse_mount *fm = get_fuse_mount_super(sb);
> + struct fuse_conn *fc = fm->fc;
> + struct fuse_syncfs_in inarg;
> + FUSE_ARGS(args);
> + int err;
> +
> + if (!fc->sync_fs)
> + return 0;
> +
> + memset(, 0, sizeof(inarg));
> + inarg.wait = wait;
> + args.in_numargs = 1;
> + args.in_args[0].size = sizeof(inarg);
> + args.in_args[0].value = 
> + args.opcode = FUSE_SYNCFS;
> + args.out_numargs = 0;
> +
> + err = fuse_simple_request(fm, );
> + if (err == -ENOSYS) {
> + fc->sync_fs = 0;
> + err = 0;
> + }

I was wondering what will happen if older file server does not support
FUSE_SYNCFS. So we will get -ENOSYS and future syncfs commmands will not
be sent.

> +
> + return err;

Right now we don't propagate this error code all the way to user space.
I think I should post my patch to fix it again.

https://lore.kernel.org/linux-fsdevel/20201221195055.35295-2-vgo...@redhat.com/

> +}
> +
>  enum {
>   OPT_SOURCE,
>   OPT_SUBTYPE,
> @@ -909,6 +937,7 @@ static const struct super_operations 
> fuse_super_operations = {
>   .put_super  = fuse_put_super,
>   .umount_begin   = fuse_umount_begin,
>   .statfs = fuse_statfs,
> + .sync_fs= fuse_sync_fs,
>   .show_options   = fuse_show_options,
>  };
>  
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 4ee6f734ba83..a3c025308743 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1441,6 +1441,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
>   fc->release = fuse_free_conn;
>   fc->delete_stale = true;
>   fc->auto_submounts = true;
> + fc->sync_fs = true;
>  
>   fsc->s_fs_info = fm;
>   sb = sget_fc(fsc, virtio_fs_test_super, set_anon_super_fc);
> diff --git 

Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-20 Thread Vivek Goyal
On Tue, Apr 20, 2021 at 09:34:20AM +0200, Greg Kurz wrote:
> On Mon, 19 Apr 2021 17:36:35 -0400
> Vivek Goyal  wrote:
> 
> > As of now put_unlocked_entry() always wakes up next waiter. In next
> > patches we want to wake up all waiters at one callsite. Hence, add a
> > parameter to the function.
> > 
> > This patch does not introduce any change of behavior.
> > 
> > Suggested-by: Dan Williams 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/dax.c | 13 +++--
> >  1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 00978d0838b1..f19d76a6a493 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -275,11 +275,12 @@ static void wait_entry_unlocked(struct xa_state *xas, 
> > void *entry)
> > finish_wait(wq, );
> >  }
> >  
> > -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > +static void put_unlocked_entry(struct xa_state *xas, void *entry,
> > +  enum dax_entry_wake_mode mode)
> >  {
> > /* If we were the only waiter woken, wake the next one */
> 
> With this change, the comment is no longer accurate since the
> function can now wake all waiters if passed mode == WAKE_ALL.
> Also, it paraphrases the code which is simple enough, so I'd
> simply drop it.
> 
> This is minor though and it shouldn't prevent this fix to go
> forward.
> 
> Reviewed-by: Greg Kurz 

Ok, here is the updated patch which drops that comment line.

Vivek

Subject: dax: Add a wakeup mode parameter to put_unlocked_entry()

As of now put_unlocked_entry() always wakes up next waiter. In next
patches we want to wake up all waiters at one callsite. Hence, add a
parameter to the function.

This patch does not introduce any change of behavior.

Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

Index: redhat-linux/fs/dax.c
===
--- redhat-linux.orig/fs/dax.c  2021-04-20 09:55:45.105069893 -0400
+++ redhat-linux/fs/dax.c   2021-04-20 09:56:27.685822730 -0400
@@ -275,11 +275,11 @@ static void wait_entry_unlocked(struct x
finish_wait(wq, );
 }
 
-static void put_unlocked_entry(struct xa_state *xas, void *entry)
+static void put_unlocked_entry(struct xa_state *xas, void *entry,
+  enum dax_entry_wake_mode mode)
 {
-   /* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, WAKE_NEXT);
+   dax_wake_entry(xas, entry, mode);
 }
 
 /*
@@ -633,7 +633,7 @@ struct page *dax_layout_busy_page_range(
entry = get_unlocked_entry(, 0);
if (entry)
page = dax_busy_page(entry);
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
if (page)
break;
if (++scanned % XA_CHECK_SCHED)
@@ -675,7 +675,7 @@ static int __dax_invalidate_entry(struct
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
return ret;
 }
@@ -954,7 +954,7 @@ static int dax_writeback_one(struct xa_s
return ret;
 
  put_unlocked:
-   put_unlocked_entry(xas, entry);
+   put_unlocked_entry(xas, entry, WAKE_NEXT);
return ret;
 }
 
@@ -1695,7 +1695,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *
/* Did we race with someone splitting entry or so? */
if (!entry || dax_is_conflict(entry) ||
(order == 0 && !dax_is_pte_entry(entry))) {
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
  VM_FAULT_NOPAGE);



[PATCH v3 0/3] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-19 Thread Vivek Goyal
Hi,

This is V3 of patches. Posted V2 here.

https://lore.kernel.org/linux-fsdevel/20210419184516.gc1472...@redhat.com/

Changes since v2:

- Broke down patch in to a patch series (Dan)
- Added an enum to communicate wake mode (Dan)

Thanks
Vivek

Vivek Goyal (3):
  dax: Add an enum for specifying dax wakup mode
  dax: Add a wakeup mode parameter to put_unlocked_entry()
  dax: Wake up all waiters after invalidating dax entry

 fs/dax.c | 34 +++---
 1 file changed, 23 insertions(+), 11 deletions(-)

-- 
2.25.4



[PATCH v3 1/3] dax: Add an enum for specifying dax wakup mode

2021-04-19 Thread Vivek Goyal
Dan mentioned that he is not very fond of passing around a boolean true/false
to specify if only next waiter should be woken up or all waiters should be
woken up. He instead prefers that we introduce an enum and make it very
explicity at the callsite itself. Easier to read code.

This patch should not introduce any change of behavior.

Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b3d27fdc6775..00978d0838b1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -144,6 +144,16 @@ struct wait_exceptional_entry_queue {
struct exceptional_entry_key key;
 };
 
+/**
+ * enum dax_entry_wake_mode: waitqueue wakeup toggle
+ * @WAKE_NEXT: entry was not mutated
+ * @WAKE_ALL: entry was invalidated, or resized
+ */
+enum dax_entry_wake_mode {
+   WAKE_NEXT,
+   WAKE_ALL,
+};
+
 static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
void *entry, struct exceptional_entry_key *key)
 {
@@ -182,7 +192,8 @@ static int wake_exceptional_entry_func(wait_queue_entry_t 
*wait,
  * The important information it's conveying is whether the entry at
  * this index used to be a PMD entry.
  */
-static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
+static void dax_wake_entry(struct xa_state *xas, void *entry,
+  enum dax_entry_wake_mode mode)
 {
struct exceptional_entry_key key;
wait_queue_head_t *wq;
@@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, void 
*entry, bool wake_all)
 * must be in the waitqueue and the following check will see them.
 */
if (waitqueue_active(wq))
-   __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, );
+   __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, );
 }
 
 /*
@@ -268,7 +279,7 @@ static void put_unlocked_entry(struct xa_state *xas, void 
*entry)
 {
/* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -286,7 +297,7 @@ static void dax_unlock_entry(struct xa_state *xas, void 
*entry)
old = xas_store(xas, entry);
xas_unlock_irq(xas);
BUG_ON(!dax_is_locked(old));
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 }
 
 /*
@@ -524,7 +535,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
dax_disassociate_entry(entry, mapping, false);
xas_store(xas, NULL);   /* undo the PMD join */
-   dax_wake_entry(xas, entry, true);
+   dax_wake_entry(xas, entry, WAKE_ALL);
mapping->nrexceptional--;
entry = NULL;
xas_set(xas, index);
@@ -937,7 +948,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
dax_device *dax_dev,
xas_lock_irq(xas);
xas_store(xas, entry);
xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, WAKE_NEXT);
 
trace_dax_writeback_one(mapping->host, index, count);
return ret;
-- 
2.25.4



[PATCH v3 3/3] dax: Wake up all waiters after invalidating dax entry

2021-04-19 Thread Vivek Goyal
I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
the problem consistently.

So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on
this index but nobody will wait these.

invalidate_inode_pages2()
  invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
  dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq();
entry = get_unlocked_entry(, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
...
...
put_unlocked_entry(, entry);
xas_unlock_irq();
}

Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.

When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.

This patch fixes the issue by waking up all waiters when a dax entry
has been invalidated. This seems to fix the deadlock I am facing
and I can make forward progress.

Reported-by: Sergio Lopez 
Fixes: ac401cc78242 ("dax: New fault locking")
Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index f19d76a6a493..cc497519be83 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -676,7 +676,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry, WAKE_NEXT);
+   put_unlocked_entry(, entry, WAKE_ALL);
xas_unlock_irq();
return ret;
 }
-- 
2.25.4



[PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-19 Thread Vivek Goyal
As of now put_unlocked_entry() always wakes up next waiter. In next
patches we want to wake up all waiters at one callsite. Hence, add a
parameter to the function.

This patch does not introduce any change of behavior.

Suggested-by: Dan Williams 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 00978d0838b1..f19d76a6a493 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -275,11 +275,12 @@ static void wait_entry_unlocked(struct xa_state *xas, 
void *entry)
finish_wait(wq, );
 }
 
-static void put_unlocked_entry(struct xa_state *xas, void *entry)
+static void put_unlocked_entry(struct xa_state *xas, void *entry,
+  enum dax_entry_wake_mode mode)
 {
/* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, WAKE_NEXT);
+   dax_wake_entry(xas, entry, mode);
 }
 
 /*
@@ -633,7 +634,7 @@ struct page *dax_layout_busy_page_range(struct 
address_space *mapping,
entry = get_unlocked_entry(, 0);
if (entry)
page = dax_busy_page(entry);
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
if (page)
break;
if (++scanned % XA_CHECK_SCHED)
@@ -675,7 +676,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
return ret;
 }
@@ -954,7 +955,7 @@ static int dax_writeback_one(struct xa_state *xas, struct 
dax_device *dax_dev,
return ret;
 
  put_unlocked:
-   put_unlocked_entry(xas, entry);
+   put_unlocked_entry(xas, entry, WAKE_NEXT);
return ret;
 }
 
@@ -1695,7 +1696,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, 
unsigned int order)
/* Did we race with someone splitting entry or so? */
if (!entry || dax_is_conflict(entry) ||
(order == 0 && !dax_is_pte_entry(entry))) {
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, WAKE_NEXT);
xas_unlock_irq();
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
  VM_FAULT_NOPAGE);
-- 
2.25.4



Re: [PATCH][v2] dax: Fix missed wakeup during dax entry invalidation

2021-04-19 Thread Vivek Goyal
On Mon, Apr 19, 2021 at 04:39:47PM -0400, Vivek Goyal wrote:
> On Mon, Apr 19, 2021 at 12:48:58PM -0700, Dan Williams wrote:
> > On Mon, Apr 19, 2021 at 11:45 AM Vivek Goyal  wrote:
> > >
> > > This is V2 of the patch. Posted V1 here.
> > >
> > > https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/
> > >
> > > Based on feedback from Dan and Jan, modified the patch to wake up
> > > all waiters when dax entry is invalidated. This solves the issues
> > > of missed wakeups.
> > 
> > Care to send a formal patch with this commentary moved below the --- line?
> > 
> > One style fixup below...
> > 
> > >
> > > I am seeing missed wakeups which ultimately lead to a deadlock when I am
> > > using virtiofs with DAX enabled and running "make -j". I had to mount
> > > virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
> > > the problem consistently.
> > >
> > > So here is the problem. put_unlocked_entry() wakes up waiters only
> > > if entry is not null as well as !dax_is_conflict(entry). But if I
> > > call multiple instances of invalidate_inode_pages2() in parallel,
> > > then I can run into a situation where there are waiters on
> > > this index but nobody will wait these.
> > >
> > > invalidate_inode_pages2()
> > >   invalidate_inode_pages2_range()
> > > invalidate_exceptional_entry2()
> > >   dax_invalidate_mapping_entry_sync()
> > > __dax_invalidate_entry() {
> > > xas_lock_irq();
> > > entry = get_unlocked_entry(, 0);
> > > ...
> > > ...
> > > dax_disassociate_entry(entry, mapping, trunc);
> > > xas_store(, NULL);
> > > ...
> > > ...
> > > put_unlocked_entry(, entry);
> > > xas_unlock_irq();
> > > }
> > >
> > > Say a fault in in progress and it has locked entry at offset say "0x1c".
> > > Now say three instances of invalidate_inode_pages2() are in progress
> > > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given
> > > dax entry is locked, all tree instances A, B, C will wait in wait queue.
> > >
> > > When dax fault finishes, say A is woken up. It will store NULL entry
> > > at index "0x1c" and wake up B. When B comes along it will find "entry=0"
> > > at page offset 0x1c and it will call put_unlocked_entry(, 0). And
> > > this means put_unlocked_entry() will not wake up next waiter, given
> > > the current code. And that means C continues to wait and is not woken
> > > up.
> > >
> > > This patch fixes the issue by waking up all waiters when a dax entry
> > > has been invalidated. This seems to fix the deadlock I am facing
> > > and I can make forward progress.
> > >
> > > Reported-by: Sergio Lopez 
> > > Signed-off-by: Vivek Goyal 
> > > ---
> > >  fs/dax.c |   12 ++--
> > >  1 file changed, 6 insertions(+), 6 deletions(-)
> > >
> > > Index: redhat-linux/fs/dax.c
> > > ===
> > > --- redhat-linux.orig/fs/dax.c  2021-04-16 14:16:44.332140543 -0400
> > > +++ redhat-linux/fs/dax.c   2021-04-19 11:24:11.465213474 -0400
> > > @@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x
> > > finish_wait(wq, );
> > >  }
> > >
> > > -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > > +static void put_unlocked_entry(struct xa_state *xas, void *entry, bool 
> > > wake_all)
> > >  {
> > > /* If we were the only waiter woken, wake the next one */
> > > if (entry && !dax_is_conflict(entry))
> > > -   dax_wake_entry(xas, entry, false);
> > > +   dax_wake_entry(xas, entry, wake_all);
> > >  }
> > >
> > >  /*
> > > @@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range(
> > > entry = get_unlocked_entry(, 0);
> > > if (entry)
> > > page = dax_busy_page(entry);
> > > -   put_unlocked_entry(, entry);
> > > +   put_unlocked_entry(, entry, false);
> > 
> > I'm not a fan of raw true/false arguments be

Re: [PATCH][v2] dax: Fix missed wakeup during dax entry invalidation

2021-04-19 Thread Vivek Goyal
On Mon, Apr 19, 2021 at 12:48:58PM -0700, Dan Williams wrote:
> On Mon, Apr 19, 2021 at 11:45 AM Vivek Goyal  wrote:
> >
> > This is V2 of the patch. Posted V1 here.
> >
> > https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/
> >
> > Based on feedback from Dan and Jan, modified the patch to wake up
> > all waiters when dax entry is invalidated. This solves the issues
> > of missed wakeups.
> 
> Care to send a formal patch with this commentary moved below the --- line?
> 
> One style fixup below...
> 
> >
> > I am seeing missed wakeups which ultimately lead to a deadlock when I am
> > using virtiofs with DAX enabled and running "make -j". I had to mount
> > virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
> > the problem consistently.
> >
> > So here is the problem. put_unlocked_entry() wakes up waiters only
> > if entry is not null as well as !dax_is_conflict(entry). But if I
> > call multiple instances of invalidate_inode_pages2() in parallel,
> > then I can run into a situation where there are waiters on
> > this index but nobody will wait these.
> >
> > invalidate_inode_pages2()
> >   invalidate_inode_pages2_range()
> > invalidate_exceptional_entry2()
> >   dax_invalidate_mapping_entry_sync()
> > __dax_invalidate_entry() {
> > xas_lock_irq();
> > entry = get_unlocked_entry(, 0);
> > ...
> > ...
> > dax_disassociate_entry(entry, mapping, trunc);
> > xas_store(, NULL);
> > ...
> > ...
> > put_unlocked_entry(, entry);
> > xas_unlock_irq();
> > }
> >
> > Say a fault in in progress and it has locked entry at offset say "0x1c".
> > Now say three instances of invalidate_inode_pages2() are in progress
> > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given
> > dax entry is locked, all tree instances A, B, C will wait in wait queue.
> >
> > When dax fault finishes, say A is woken up. It will store NULL entry
> > at index "0x1c" and wake up B. When B comes along it will find "entry=0"
> > at page offset 0x1c and it will call put_unlocked_entry(, 0). And
> > this means put_unlocked_entry() will not wake up next waiter, given
> > the current code. And that means C continues to wait and is not woken
> > up.
> >
> > This patch fixes the issue by waking up all waiters when a dax entry
> > has been invalidated. This seems to fix the deadlock I am facing
> > and I can make forward progress.
> >
> > Reported-by: Sergio Lopez 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/dax.c |   12 ++--
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> >
> > Index: redhat-linux/fs/dax.c
> > ===
> > --- redhat-linux.orig/fs/dax.c  2021-04-16 14:16:44.332140543 -0400
> > +++ redhat-linux/fs/dax.c   2021-04-19 11:24:11.465213474 -0400
> > @@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x
> > finish_wait(wq, );
> >  }
> >
> > -static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > +static void put_unlocked_entry(struct xa_state *xas, void *entry, bool 
> > wake_all)
> >  {
> > /* If we were the only waiter woken, wake the next one */
> > if (entry && !dax_is_conflict(entry))
> > -   dax_wake_entry(xas, entry, false);
> > +   dax_wake_entry(xas, entry, wake_all);
> >  }
> >
> >  /*
> > @@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range(
> > entry = get_unlocked_entry(, 0);
> > if (entry)
> > page = dax_busy_page(entry);
> > -   put_unlocked_entry(, entry);
> > +   put_unlocked_entry(, entry, false);
> 
> I'm not a fan of raw true/false arguments because if you read this
> line in isolation you need to go read put_unlocked_entry() to recall
> what that argument means. So lets add something like:
> 
> /**
>  * enum dax_entry_wake_mode: waitqueue wakeup toggle
>  * @WAKE_NEXT: entry was not mutated
>  * @WAKE_ALL: entry was invalidated, or resized
>  */
> enum dax_entry_wake_mode {
> WAKE_NEXT,
> WAKE_ALL,
> }
> 
> ...and use that as the arg for dax_wake_entry(). So I'd expect this to
> be a 3 patch series, introduce dax_entry_wake_mode for
> dax_wake_entry(), introduce the argument for put_unlocked_entry()
> without changing the logic, and finally this bug fix. Feel free to add
> 'Fixes: ac401cc78242 ("dax: New fault locking")' in case you feel this
> needs to be backported.

Hi Dan,

I will make changes as you suggested and post another version.

I am wondering what to do with dax_wake_entry(). It also has a boolean
parameter wake_all. Should that be converted as well to make use of
enum dax_entry_wake_mode?

Thanks
Vivek



[PATCH][v2] dax: Fix missed wakeup during dax entry invalidation

2021-04-19 Thread Vivek Goyal
This is V2 of the patch. Posted V1 here.

https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/

Based on feedback from Dan and Jan, modified the patch to wake up 
all waiters when dax entry is invalidated. This solves the issues
of missed wakeups.

I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 256M to reproduce
the problem consistently.

So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on
this index but nobody will wait these.

invalidate_inode_pages2()
  invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
  dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq();
entry = get_unlocked_entry(, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
...
...
put_unlocked_entry(, entry);
xas_unlock_irq();
}

Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.

When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.

This patch fixes the issue by waking up all waiters when a dax entry
has been invalidated. This seems to fix the deadlock I am facing
and I can make forward progress.

Reported-by: Sergio Lopez 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: redhat-linux/fs/dax.c
===
--- redhat-linux.orig/fs/dax.c  2021-04-16 14:16:44.332140543 -0400
+++ redhat-linux/fs/dax.c   2021-04-19 11:24:11.465213474 -0400
@@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x
finish_wait(wq, );
 }
 
-static void put_unlocked_entry(struct xa_state *xas, void *entry)
+static void put_unlocked_entry(struct xa_state *xas, void *entry, bool 
wake_all)
 {
/* If we were the only waiter woken, wake the next one */
if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, false);
+   dax_wake_entry(xas, entry, wake_all);
 }
 
 /*
@@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range(
entry = get_unlocked_entry(, 0);
if (entry)
page = dax_busy_page(entry);
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, false);
if (page)
break;
if (++scanned % XA_CHECK_SCHED)
@@ -664,7 +664,7 @@ static int __dax_invalidate_entry(struct
mapping->nrexceptional--;
ret = 1;
 out:
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, true);
xas_unlock_irq();
return ret;
 }
@@ -943,7 +943,7 @@ static int dax_writeback_one(struct xa_s
return ret;
 
  put_unlocked:
-   put_unlocked_entry(xas, entry);
+   put_unlocked_entry(xas, entry, false);
return ret;
 }
 
@@ -1684,7 +1684,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *
/* Did we race with someone splitting entry or so? */
if (!entry || dax_is_conflict(entry) ||
(order == 0 && !dax_is_pte_entry(entry))) {
-   put_unlocked_entry(, entry);
+   put_unlocked_entry(, entry, false);
xas_unlock_irq();
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
  VM_FAULT_NOPAGE);



Re: [PATCH] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-16 Thread Vivek Goyal
On Fri, Apr 16, 2021 at 12:56:05PM -0700, Dan Williams wrote:
> On Fri, Apr 16, 2021 at 10:35 AM Vivek Goyal  wrote:
> >
> > I am seeing missed wakeups which ultimately lead to a deadlock when I am
> > using virtiofs with DAX enabled and running "make -j". I had to mount
> > virtiofs as rootfs and also reduce to dax window size to 32M to reproduce
> > the problem consistently.
> >
> > This is not a complete patch. I am just proposing this partial fix to
> > highlight the issue and trying to figure out how it should be fixed.
> > Should it be fixed in generic dax code or should filesystem (fuse/virtiofs)
> > take care of this.
> >
> > So here is the problem. put_unlocked_entry() wakes up waiters only
> > if entry is not null as well as !dax_is_conflict(entry). But if I
> > call multiple instances of invalidate_inode_pages2() in parallel,
> > then I can run into a situation where there are waiters on
> > this index but nobody will wait these.
> >
> > invalidate_inode_pages2()
> >   invalidate_inode_pages2_range()
> > invalidate_exceptional_entry2()
> >   dax_invalidate_mapping_entry_sync()
> > __dax_invalidate_entry() {
> > xas_lock_irq();
> > entry = get_unlocked_entry(, 0);
> > ...
> > ...
> > dax_disassociate_entry(entry, mapping, trunc);
> > xas_store(, NULL);
> > ...
> > ...
> > put_unlocked_entry(, entry);
> > xas_unlock_irq();
> > }
> >
> > Say a fault in in progress and it has locked entry at offset say "0x1c".
> > Now say three instances of invalidate_inode_pages2() are in progress
> > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given
> > dax entry is locked, all tree instances A, B, C will wait in wait queue.
> >
> > When dax fault finishes, say A is woken up. It will store NULL entry
> > at index "0x1c" and wake up B. When B comes along it will find "entry=0"
> > at page offset 0x1c and it will call put_unlocked_entry(, 0). And
> > this means put_unlocked_entry() will not wake up next waiter, given
> > the current code. And that means C continues to wait and is not woken
> > up.
> >
> > In my case I am seeing that dax page fault path itself is waiting
> > on grab_mapping_entry() and also invalidate_inode_page2() is
> > waiting in get_unlocked_entry() but entry has already been cleaned
> > up and nobody woke up these processes. Atleast I think that's what
> > is happening.
> >
> > This patch wakes up a process even if entry=0. And deadlock does not
> > happen. I am running into some OOM issues, that will debug.
> >
> > So my question is that is it a dax issue and should it be fixed in
> > dax layer. Or should it be handled in fuse to make sure that
> > multiple instances of invalidate_inode_pages2() on same inode
> > don't make progress in parallel and introduce enough locking
> > around it.
> >
> > Right now fuse_finish_open() calls invalidate_inode_pages2() without
> > any locking. That allows it to make progress in parallel to dax
> > fault path as well as allows multiple instances of invalidate_inode_pages2()
> > to run in parallel.
> >
> > Not-yet-signed-off-by: Vivek Goyal 
> > ---
> >  fs/dax.c |7 ---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > Index: redhat-linux/fs/dax.c
> > ===
> > --- redhat-linux.orig/fs/dax.c  2021-04-16 12:50:40.141363317 -0400
> > +++ redhat-linux/fs/dax.c   2021-04-16 12:51:42.385926390 -0400
> > @@ -266,9 +266,10 @@ static void wait_entry_unlocked(struct x
> >
> >  static void put_unlocked_entry(struct xa_state *xas, void *entry)
> >  {
> > -   /* If we were the only waiter woken, wake the next one */
> > -   if (entry && !dax_is_conflict(entry))
> > -   dax_wake_entry(xas, entry, false);
> > +   if (dax_is_conflict(entry))
> > +   return;
> > +
> > +   dax_wake_entry(xas, entry, false);
> 

Hi Dan,

> How does this work if entry is NULL? dax_entry_waitqueue() will not
> know if it needs to adjust the index.

Wake waiters both at current index as well PMD adjusted index. It feels
little ugly though.

> I think the fix might be to
> specify that put_unlocked_entry() in the invalidate path needs to do a
> wake_up_all().

Doing a wake_up_all() when we invalidate an entry, sounds good. I will give
it a try.

Thanks
Vivek



[PATCH] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-16 Thread Vivek Goyal
I am seeing missed wakeups which ultimately lead to a deadlock when I am
using virtiofs with DAX enabled and running "make -j". I had to mount
virtiofs as rootfs and also reduce to dax window size to 32M to reproduce
the problem consistently.

This is not a complete patch. I am just proposing this partial fix to
highlight the issue and trying to figure out how it should be fixed.
Should it be fixed in generic dax code or should filesystem (fuse/virtiofs)
take care of this.

So here is the problem. put_unlocked_entry() wakes up waiters only
if entry is not null as well as !dax_is_conflict(entry). But if I
call multiple instances of invalidate_inode_pages2() in parallel,
then I can run into a situation where there are waiters on 
this index but nobody will wait these.

invalidate_inode_pages2()
  invalidate_inode_pages2_range()
invalidate_exceptional_entry2()
  dax_invalidate_mapping_entry_sync()
__dax_invalidate_entry() {
xas_lock_irq();
entry = get_unlocked_entry(, 0);
...
...
dax_disassociate_entry(entry, mapping, trunc);
xas_store(, NULL);
...
...
put_unlocked_entry(, entry);
xas_unlock_irq();
} 

Say a fault in in progress and it has locked entry at offset say "0x1c".
Now say three instances of invalidate_inode_pages2() are in progress
(A, B, C) and they all try to invalidate entry at offset "0x1c". Given
dax entry is locked, all tree instances A, B, C will wait in wait queue.

When dax fault finishes, say A is woken up. It will store NULL entry
at index "0x1c" and wake up B. When B comes along it will find "entry=0"
at page offset 0x1c and it will call put_unlocked_entry(, 0). And
this means put_unlocked_entry() will not wake up next waiter, given
the current code. And that means C continues to wait and is not woken
up.

In my case I am seeing that dax page fault path itself is waiting
on grab_mapping_entry() and also invalidate_inode_page2() is 
waiting in get_unlocked_entry() but entry has already been cleaned
up and nobody woke up these processes. Atleast I think that's what
is happening.

This patch wakes up a process even if entry=0. And deadlock does not
happen. I am running into some OOM issues, that will debug.

So my question is that is it a dax issue and should it be fixed in
dax layer. Or should it be handled in fuse to make sure that
multiple instances of invalidate_inode_pages2() on same inode
don't make progress in parallel and introduce enough locking
around it.

Right now fuse_finish_open() calls invalidate_inode_pages2() without
any locking. That allows it to make progress in parallel to dax
fault path as well as allows multiple instances of invalidate_inode_pages2()
to run in parallel.

Not-yet-signed-off-by: Vivek Goyal 
---
 fs/dax.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: redhat-linux/fs/dax.c
===
--- redhat-linux.orig/fs/dax.c  2021-04-16 12:50:40.141363317 -0400
+++ redhat-linux/fs/dax.c   2021-04-16 12:51:42.385926390 -0400
@@ -266,9 +266,10 @@ static void wait_entry_unlocked(struct x
 
 static void put_unlocked_entry(struct xa_state *xas, void *entry)
 {
-   /* If we were the only waiter woken, wake the next one */
-   if (entry && !dax_is_conflict(entry))
-   dax_wake_entry(xas, entry, false);
+   if (dax_is_conflict(entry))
+   return;
+
+   dax_wake_entry(xas, entry, false);
 }
 
 /*



Re: [PATCH v2 0/2] fuse: Fix clearing SGID when access ACL is set

2021-04-14 Thread Vivek Goyal
On Wed, Apr 14, 2021 at 01:57:01PM +0200, Miklos Szeredi wrote:
> On Thu, Mar 25, 2021 at 4:19 PM Vivek Goyal  wrote:
> >
> >
> > Hi,
> >
> > This is V2 of the patchset. Posted V1 here.
> >
> > https://lore.kernel.org/linux-fsdevel/20210319195547.427371-1-vgo...@redhat.com/
> >
> > Changes since V1:
> >
> > - Dropped the helper to determine if SGID should be cleared and open
> >   coded it instead. I will follow up on helper separately in a different
> >   patch series. There are few places already which open code this, so
> >   for now fuse can do the same. Atleast I can make progress on this
> >   and virtiofs can enable ACL support.
> >
> > Luis reported that xfstests generic/375 fails with virtiofs. Little
> > debugging showed that when posix access acl is set that in some
> > cases SGID needs to be cleared and that does not happen with virtiofs.
> >
> > Setting posix access acl can lead to mode change and it can also lead
> > to clear of SGID. fuse relies on file server taking care of all
> > the mode changes. But file server does not have enough information to
> > determine whether SGID should be cleared or not.
> >
> > Hence this patch series add support to send a flag in SETXATTR message
> > to tell server to clear SGID.
> 
> Changed it to have a single extended structure for the request, which
> is how this has always been handled in the fuse API.
> 
> The ABI is unchanged, but you'll need to update the userspace part
> according to the API change.  Otherwise looks good.

Hi Miklos,

Thanks. Patches look good. I will update userspace part and repost.

Vivek

> 
> Applied and pushed to fuse.git#for-next.
> 
> Thanks,
> Miklos
> 



Re: [Virtio-fs] [PATCH v2 0/2] fuse: Fix clearing SGID when access ACL is set

2021-04-13 Thread Vivek Goyal
Hi Miklos,

Ping for this patch series.

Vivek

On Thu, Mar 25, 2021 at 11:18:21AM -0400, Vivek Goyal wrote:
> 
> Hi,
> 
> This is V2 of the patchset. Posted V1 here.
> 
> https://lore.kernel.org/linux-fsdevel/20210319195547.427371-1-vgo...@redhat.com/
> 
> Changes since V1:
> 
> - Dropped the helper to determine if SGID should be cleared and open
>   coded it instead. I will follow up on helper separately in a different
>   patch series. There are few places already which open code this, so
>   for now fuse can do the same. Atleast I can make progress on this
>   and virtiofs can enable ACL support.
> 
> Luis reported that xfstests generic/375 fails with virtiofs. Little
> debugging showed that when posix access acl is set that in some
> cases SGID needs to be cleared and that does not happen with virtiofs.
> 
> Setting posix access acl can lead to mode change and it can also lead
> to clear of SGID. fuse relies on file server taking care of all
> the mode changes. But file server does not have enough information to
> determine whether SGID should be cleared or not.
> 
> Hence this patch series add support to send a flag in SETXATTR message
> to tell server to clear SGID.
> 
> I have staged corresponding virtiofsd patches here.
> 
> https://github.com/rhvgoyal/qemu/commits/acl-sgid-setxattr-flag
> 
> With these patches applied "./check -g acl" passes now on virtiofs.
> 
> Thanks
> Vivek
> 
> Vivek Goyal (2):
>   fuse: Add support for FUSE_SETXATTR_V2
>   fuse: Add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGID
> 
>  fs/fuse/acl.c |  8 +++-
>  fs/fuse/fuse_i.h  |  5 -
>  fs/fuse/inode.c   |  4 +++-
>  fs/fuse/xattr.c   | 21 +++--
>  include/uapi/linux/fuse.h | 17 +
>  5 files changed, 46 insertions(+), 9 deletions(-)
> 
> -- 
> 2.25.4
> 
> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://listman.redhat.com/mailman/listinfo/virtio-fs



Re: [PATCH] fuse: Avoid potential use after free

2021-04-07 Thread Vivek Goyal
On Tue, Apr 06, 2021 at 06:53:32PM -0500, Aditya Pakki wrote:
> In virtio_fs_get_tree, after fm is freed, it is again freed in case
> s_root is NULL and virtio_fs_fill_super() returns an error. To avoid
> a double free, set fm to NULL.
> 
> Signed-off-by: Aditya Pakki 
> ---
>  fs/fuse/virtio_fs.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 4ee6f734ba83..a7484c1539bf 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1447,6 +1447,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
>   if (fsc->s_fs_info) {
>   fuse_conn_put(fc);
>   kfree(fm);
> + fm = NULL;

I think both the code paths are mutually exclusive and that's why we
don't double free it.

sget_fc(), can either return existing super block which is already
initialized, or it can create a new super block which need to
initialize further.

If if get an existing super block, in that case fs->s_fs_info will
still be set and we need to free fm (as we did not use it). But in 
that case this super block is already initialized so sb->s_root
should be non-null and we will not call virtio_fs_fill_super()
on this. And hence we will not get into kfree(fm) again.

Same applies to fuse_conn_put(fc) call as well.

So I think this patch is not needed. I think sget_fc() semantics are
not obvious and that confuses the reader of the code.

Thanks
Vivek

>   }
>   if (IS_ERR(sb))
>   return PTR_ERR(sb);
> -- 
> 2.25.1
> 



Re: [PATCH v1] ovl: Fix leaked dentry

2021-04-01 Thread Vivek Goyal
On Mon, Mar 29, 2021 at 06:49:07PM +0200, Mickaël Salaün wrote:
> From: Mickaël Salaün 
> 
> Since commit 6815f479ca90 ("ovl: use only uppermetacopy state in
> ovl_lookup()"), overlayfs doesn't put temporary dentry when there is a
> metacopy error, which leads to dentry leaks when shutting down the
> related superblock:
> 

Hi,

Thanks for finding and fixing this bug. Patch looks correct to me. We
need to drop that reference to this.

I am not able to trigger this warning on umount of overlayfs. I copied
up a file with metacopy enabled and then remounted overlay again with
metacopy disabled. That does hit this code and I see the warning.

refusing to follow metacopy origin for (/foo.txt)

This should have lead to leak of dentry pointed by "this".

But after that I unmounted, overlay and that succeeds. Is there any
additional step to be done to trigger this VFS warning.

Vivek

>   overlayfs: refusing to follow metacopy origin for (/file0)
>   ...
>   BUG: Dentry (ptrval){i=3f33,n=file3}  still in use (1) [unmount of 
> overlay overlay]
>   ...
>   WARNING: CPU: 1 PID: 432 at umount_check.cold+0x107/0x14d
>   CPU: 1 PID: 432 Comm: unmount-overlay Not tainted 5.12.0-rc5 #1
>   ...
>   RIP: 0010:umount_check.cold+0x107/0x14d
>   ...
>   Call Trace:
>d_walk+0x28c/0x950
>? dentry_lru_isolate+0x2b0/0x2b0
>? __kasan_slab_free+0x12/0x20
>do_one_tree+0x33/0x60
>shrink_dcache_for_umount+0x78/0x1d0
>generic_shutdown_super+0x70/0x440
>kill_anon_super+0x3e/0x70
>deactivate_locked_super+0xc4/0x160
>deactivate_super+0xfa/0x140
>cleanup_mnt+0x22e/0x370
>__cleanup_mnt+0x1a/0x30
>task_work_run+0x139/0x210
>do_exit+0xb0c/0x2820
>? __kasan_check_read+0x1d/0x30
>? find_held_lock+0x35/0x160
>? lock_release+0x1b6/0x660
>? mm_update_next_owner+0xa20/0xa20
>? reacquire_held_locks+0x3f0/0x3f0
>? __sanitizer_cov_trace_const_cmp4+0x22/0x30
>do_group_exit+0x135/0x380
>__do_sys_exit_group.isra.0+0x20/0x20
>__x64_sys_exit_group+0x3c/0x50
>do_syscall_64+0x45/0x70
>entry_SYSCALL_64_after_hwframe+0x44/0xae
>   ...
>   VFS: Busy inodes after unmount of overlay. Self-destruct in 5 seconds.  
> Have a nice day...
> 
> This fix has been tested with a syzkaller reproducer.
> 
> Cc: Amir Goldstein 
> Cc: Miklos Szeredi 
> Cc: Vivek Goyal 
> Cc:  # v5.7+
> Reported-by: syzbot 
> Fixes: 6815f479ca90 ("ovl: use only uppermetacopy state in ovl_lookup()")
> Signed-off-by: Mickaël Salaün 
> Link: https://lore.kernel.org/r/20210329164907.2133175-1-...@digikod.net
> ---
>  fs/overlayfs/namei.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
> index 3fe05fb5d145..424c594afd79 100644
> --- a/fs/overlayfs/namei.c
> +++ b/fs/overlayfs/namei.c
> @@ -921,6 +921,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct 
> dentry *dentry,
>   if ((uppermetacopy || d.metacopy) && !ofs->config.metacopy) {
>   err = -EPERM;
>   pr_warn_ratelimited("refusing to follow metacopy origin 
> for (%pd2)\n", dentry);
> + dput(this);
>   goto out_put;
>   }
>  
> 
> base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
> -- 
> 2.30.2
> 



Re: [PATCH v1] ovl: Fix leaked dentry

2021-04-01 Thread Vivek Goyal
On Mon, Mar 29, 2021 at 06:49:07PM +0200, Mickaël Salaün wrote:
> From: Mickaël Salaün 
> 
> Since commit 6815f479ca90 ("ovl: use only uppermetacopy state in
> ovl_lookup()"), overlayfs doesn't put temporary dentry when there is a
> metacopy error, which leads to dentry leaks when shutting down the
> related superblock:
> 
>   overlayfs: refusing to follow metacopy origin for (/file0)
>   ...
>   BUG: Dentry (ptrval){i=3f33,n=file3}  still in use (1) [unmount of 
> overlay overlay]
>   ...
>   WARNING: CPU: 1 PID: 432 at umount_check.cold+0x107/0x14d
>   CPU: 1 PID: 432 Comm: unmount-overlay Not tainted 5.12.0-rc5 #1
>   ...
>   RIP: 0010:umount_check.cold+0x107/0x14d
>   ...
>   Call Trace:
>d_walk+0x28c/0x950
>? dentry_lru_isolate+0x2b0/0x2b0
>? __kasan_slab_free+0x12/0x20
>do_one_tree+0x33/0x60
>shrink_dcache_for_umount+0x78/0x1d0
>generic_shutdown_super+0x70/0x440
>kill_anon_super+0x3e/0x70
>deactivate_locked_super+0xc4/0x160
>deactivate_super+0xfa/0x140
>cleanup_mnt+0x22e/0x370
>__cleanup_mnt+0x1a/0x30
>task_work_run+0x139/0x210
>do_exit+0xb0c/0x2820
>? __kasan_check_read+0x1d/0x30
>? find_held_lock+0x35/0x160
>? lock_release+0x1b6/0x660
>? mm_update_next_owner+0xa20/0xa20
>? reacquire_held_locks+0x3f0/0x3f0
>? __sanitizer_cov_trace_const_cmp4+0x22/0x30
>do_group_exit+0x135/0x380
>__do_sys_exit_group.isra.0+0x20/0x20
>__x64_sys_exit_group+0x3c/0x50
>do_syscall_64+0x45/0x70
>entry_SYSCALL_64_after_hwframe+0x44/0xae
>   ...
>   VFS: Busy inodes after unmount of overlay. Self-destruct in 5 seconds.  
> Have a nice day...
> 
> This fix has been tested with a syzkaller reproducer.
> 

Looks good to me. I realized that dentry leak will happen on underlying
filesystem so unmount of underlying filesystem will give this warning. I
created nested overlayfs configuration and could reproduce this error
and tested that this patch fixes it.

Reviewed-by: Vivek Goyal 

Vivek

> Cc: Amir Goldstein 
> Cc: Miklos Szeredi 
> Cc: Vivek Goyal 
> Cc:  # v5.7+
> Reported-by: syzbot 
> Fixes: 6815f479ca90 ("ovl: use only uppermetacopy state in ovl_lookup()")
> Signed-off-by: Mickaël Salaün 
> Link: https://lore.kernel.org/r/20210329164907.2133175-1-...@digikod.net
> ---
>  fs/overlayfs/namei.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
> index 3fe05fb5d145..424c594afd79 100644
> --- a/fs/overlayfs/namei.c
> +++ b/fs/overlayfs/namei.c
> @@ -921,6 +921,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct 
> dentry *dentry,
>   if ((uppermetacopy || d.metacopy) && !ofs->config.metacopy) {
>   err = -EPERM;
>   pr_warn_ratelimited("refusing to follow metacopy origin 
> for (%pd2)\n", dentry);
> + dput(this);
>   goto out_put;
>   }
>  
> 
> base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
> -- 
> 2.30.2
> 



Re: [PATCH v2 1/2] fuse: Add support for FUSE_SETXATTR_V2

2021-03-29 Thread Vivek Goyal
On Mon, Mar 29, 2021 at 03:54:03PM +0100, Luis Henriques wrote:
> On Thu, Mar 25, 2021 at 11:18:22AM -0400, Vivek Goyal wrote:
> > Fuse client needs to send additional information to file server when
> > it calls SETXATTR(system.posix_acl_access). Right now there is no extra
> > space in fuse_setxattr_in. So introduce a v2 of the structure which has
> > more space in it and can be used to send extra flags.
> > 
> > "struct fuse_setxattr_in_v2" is only used if file server opts-in for it 
> > using
> > flag FUSE_SETXATTR_V2 during feature negotiations.
> > 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/fuse/acl.c |  2 +-
> >  fs/fuse/fuse_i.h  |  5 -
> >  fs/fuse/inode.c   |  4 +++-
> >  fs/fuse/xattr.c   | 21 +++--
> >  include/uapi/linux/fuse.h | 10 ++
> >  5 files changed, 33 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
> > index e9c0f916349d..d31260a139d4 100644
> > --- a/fs/fuse/acl.c
> > +++ b/fs/fuse/acl.c
> > @@ -94,7 +94,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, 
> > struct inode *inode,
> > return ret;
> > }
> >  
> > -   ret = fuse_setxattr(inode, name, value, size, 0);
> > +   ret = fuse_setxattr(inode, name, value, size, 0, 0);
> > kfree(value);
> > } else {
> > ret = fuse_removexattr(inode, name);
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 63d97a15ffde..d00bf0b9a38c 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -668,6 +668,9 @@ struct fuse_conn {
> > /** Is setxattr not implemented by fs? */
> > unsigned no_setxattr:1;
> >  
> > +   /** Does file server support setxattr_v2 */
> > +   unsigned setxattr_v2:1;
> > +
> > /** Is getxattr not implemented by fs? */
> > unsigned no_getxattr:1;
> >  
> > @@ -1170,7 +1173,7 @@ void fuse_unlock_inode(struct inode *inode, bool 
> > locked);
> >  bool fuse_lock_inode(struct inode *inode);
> >  
> >  int fuse_setxattr(struct inode *inode, const char *name, const void *value,
> > - size_t size, int flags);
> > + size_t size, int flags, unsigned extra_flags);
> >  ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
> >   size_t size);
> >  ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index b0e18b470e91..1c726df13f80 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1052,6 +1052,8 @@ static void process_init_reply(struct fuse_mount *fm, 
> > struct fuse_args *args,
> > fc->handle_killpriv_v2 = 1;
> > fm->sb->s_flags |= SB_NOSEC;
> > }
> > +   if (arg->flags & FUSE_SETXATTR_V2)
> > +   fc->setxattr_v2 = 1;
> > } else {
> > ra_pages = fc->max_read / PAGE_SIZE;
> > fc->no_lock = 1;
> > @@ -1095,7 +1097,7 @@ void fuse_send_init(struct fuse_mount *fm)
> > FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
> > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> > FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> > -   FUSE_HANDLE_KILLPRIV_V2;
> > +   FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_V2;
> >  #ifdef CONFIG_FUSE_DAX
> > if (fm->fc->dax)
> > ia->in.flags |= FUSE_MAP_ALIGNMENT;
> > diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
> > index 1a7d7ace54e1..f2aae72653dc 100644
> > --- a/fs/fuse/xattr.c
> > +++ b/fs/fuse/xattr.c
> > @@ -12,24 +12,33 @@
> >  #include 
> >  
> >  int fuse_setxattr(struct inode *inode, const char *name, const void *value,
> > - size_t size, int flags)
> > + size_t size, int flags, unsigned extra_flags)
> >  {
> > struct fuse_mount *fm = get_fuse_mount(inode);
> > FUSE_ARGS(args);
> > struct fuse_setxattr_in inarg;
> > +   struct fuse_setxattr_in_v2 inarg_v2;
> > +   bool setxattr_v2 = fm->fc->setxattr_v2;
> > int err;
> >  
> > if (fm->fc->no_setxattr)
> > return -EOPNOTSUPP;
> >  
> > memset(, 0, sizeof(inarg));
> > -   inarg.size = size;
> > -   inarg.fla

Re: [PATCH v2 1/2] fuse: Add support for FUSE_SETXATTR_V2

2021-03-29 Thread Vivek Goyal
On Mon, Mar 29, 2021 at 03:50:37PM +0100, Luis Henriques wrote:
> On Thu, Mar 25, 2021 at 11:18:22AM -0400, Vivek Goyal wrote:
> > Fuse client needs to send additional information to file server when
> > it calls SETXATTR(system.posix_acl_access). Right now there is no extra
> > space in fuse_setxattr_in. So introduce a v2 of the structure which has
> > more space in it and can be used to send extra flags.
> > 
> > "struct fuse_setxattr_in_v2" is only used if file server opts-in for it 
> > using
> > flag FUSE_SETXATTR_V2 during feature negotiations.
> > 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/fuse/acl.c |  2 +-
> >  fs/fuse/fuse_i.h  |  5 -
> >  fs/fuse/inode.c   |  4 +++-
> >  fs/fuse/xattr.c   | 21 +++--
> >  include/uapi/linux/fuse.h | 10 ++
> >  5 files changed, 33 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
> > index e9c0f916349d..d31260a139d4 100644
> > --- a/fs/fuse/acl.c
> > +++ b/fs/fuse/acl.c
> > @@ -94,7 +94,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, 
> > struct inode *inode,
> > return ret;
> > }
> >  
> > -   ret = fuse_setxattr(inode, name, value, size, 0);
> > +   ret = fuse_setxattr(inode, name, value, size, 0, 0);
> > kfree(value);
> > } else {
> > ret = fuse_removexattr(inode, name);
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 63d97a15ffde..d00bf0b9a38c 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -668,6 +668,9 @@ struct fuse_conn {
> > /** Is setxattr not implemented by fs? */
> > unsigned no_setxattr:1;
> >  
> > +   /** Does file server support setxattr_v2 */
> > +   unsigned setxattr_v2:1;
> > +
> 
> Minor (pedantic!) comment: most of the fields here start with 'no_*', so
> maybe it's worth setting the logic to use 'no_setxattr_v2' instead?

Hi Luis,

"setxattr_v2" kind of makes more sense to me because it is disabled
by default untile and unless client opts in. If I use no_setxattr_v2,
then it means by default I will have to initialize it to 1. Right
now, following automatically takes care of it.

fc = kzalloc(sizeof(struct fuse_conn), GFP_KERNEL);

Also, there are other examples which don't use "no_" prefix.

auto_inval_data, explicit_inval_data, do_readdirplus, readdirplus_auto, 
async_dio. and list goes on.

Vivek



[PATCH v2 2/2] fuse: Add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGID

2021-03-25 Thread Vivek Goyal
When posix access ACL is set, it can have an effect on file mode and
it can also need to clear SGID if.

- None of caller's group/supplementary groups match file owner group.
AND
- Caller is not priviliged (No CAP_FSETID).

As of now fuser server is responsible for changing the file mode as well. But
it does not know whether to clear SGID or not.

So add a flag FUSE_SETXATTR_ACL_KILL_SGID and send this info with
SETXATTR to let file server know that sgid needs to be cleared as well.

Reported-by: Luis Henriques 
Signed-off-by: Vivek Goyal 
---
 fs/fuse/acl.c | 8 +++-
 include/uapi/linux/fuse.h | 7 +++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index d31260a139d4..8819ceb0a4e5 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -71,6 +71,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct 
inode *inode,
return -EINVAL;
 
if (acl) {
+   unsigned extra_flags = 0;
/*
 * Fuse userspace is responsible for updating access
 * permissions in the inode, if needed. fuse_setxattr
@@ -94,7 +95,12 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct 
inode *inode,
return ret;
}
 
-   ret = fuse_setxattr(inode, name, value, size, 0, 0);
+   if (fc->setxattr_v2 &&
+   !in_group_p(i_gid_into_mnt(_user_ns, inode)) &&
+   !capable_wrt_inode_uidgid(_user_ns, inode, CAP_FSETID))
+   extra_flags |= FUSE_SETXATTR_ACL_KILL_SGID;
+
+   ret = fuse_setxattr(inode, name, value, size, 0, extra_flags);
kfree(value);
} else {
ret = fuse_removexattr(inode, name);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 1bb555c1c117..08c11a7beaa7 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -180,6 +180,7 @@
  *  - add FUSE_HANDLE_KILLPRIV_V2, FUSE_WRITE_KILL_SUIDGID, FATTR_KILL_SUIDGID
  *  - add FUSE_OPEN_KILL_SUIDGID
  *  - add FUSE_SETXATTR_V2
+ *  - add FUSE_SETXATTR_ACL_KILL_SGID
  */
 
 #ifndef _LINUX_FUSE_H
@@ -454,6 +455,12 @@ struct fuse_file_lock {
  */
 #define FUSE_OPEN_KILL_SUIDGID (1 << 0)
 
+/**
+ * setxattr flags
+ * FUSE_SETXATTR_ACL_KILL_SGID: Clear SGID when system.posix_acl_access is set
+ */
+#define FUSE_SETXATTR_ACL_KILL_SGID(1 << 0)
+
 enum fuse_opcode {
FUSE_LOOKUP = 1,
FUSE_FORGET = 2,  /* no reply */
-- 
2.25.4



[PATCH v2 1/2] fuse: Add support for FUSE_SETXATTR_V2

2021-03-25 Thread Vivek Goyal
Fuse client needs to send additional information to file server when
it calls SETXATTR(system.posix_acl_access). Right now there is no extra
space in fuse_setxattr_in. So introduce a v2 of the structure which has
more space in it and can be used to send extra flags.

"struct fuse_setxattr_in_v2" is only used if file server opts-in for it using
flag FUSE_SETXATTR_V2 during feature negotiations.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/acl.c |  2 +-
 fs/fuse/fuse_i.h  |  5 -
 fs/fuse/inode.c   |  4 +++-
 fs/fuse/xattr.c   | 21 +++--
 include/uapi/linux/fuse.h | 10 ++
 5 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index e9c0f916349d..d31260a139d4 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -94,7 +94,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct 
inode *inode,
return ret;
}
 
-   ret = fuse_setxattr(inode, name, value, size, 0);
+   ret = fuse_setxattr(inode, name, value, size, 0, 0);
kfree(value);
} else {
ret = fuse_removexattr(inode, name);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 63d97a15ffde..d00bf0b9a38c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -668,6 +668,9 @@ struct fuse_conn {
/** Is setxattr not implemented by fs? */
unsigned no_setxattr:1;
 
+   /** Does file server support setxattr_v2 */
+   unsigned setxattr_v2:1;
+
/** Is getxattr not implemented by fs? */
unsigned no_getxattr:1;
 
@@ -1170,7 +1173,7 @@ void fuse_unlock_inode(struct inode *inode, bool locked);
 bool fuse_lock_inode(struct inode *inode);
 
 int fuse_setxattr(struct inode *inode, const char *name, const void *value,
- size_t size, int flags);
+ size_t size, int flags, unsigned extra_flags);
 ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
  size_t size);
 ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b0e18b470e91..1c726df13f80 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1052,6 +1052,8 @@ static void process_init_reply(struct fuse_mount *fm, 
struct fuse_args *args,
fc->handle_killpriv_v2 = 1;
fm->sb->s_flags |= SB_NOSEC;
}
+   if (arg->flags & FUSE_SETXATTR_V2)
+   fc->setxattr_v2 = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1095,7 +1097,7 @@ void fuse_send_init(struct fuse_mount *fm)
FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
-   FUSE_HANDLE_KILLPRIV_V2;
+   FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_V2;
 #ifdef CONFIG_FUSE_DAX
if (fm->fc->dax)
ia->in.flags |= FUSE_MAP_ALIGNMENT;
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 1a7d7ace54e1..f2aae72653dc 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -12,24 +12,33 @@
 #include 
 
 int fuse_setxattr(struct inode *inode, const char *name, const void *value,
- size_t size, int flags)
+ size_t size, int flags, unsigned extra_flags)
 {
struct fuse_mount *fm = get_fuse_mount(inode);
FUSE_ARGS(args);
struct fuse_setxattr_in inarg;
+   struct fuse_setxattr_in_v2 inarg_v2;
+   bool setxattr_v2 = fm->fc->setxattr_v2;
int err;
 
if (fm->fc->no_setxattr)
return -EOPNOTSUPP;
 
memset(, 0, sizeof(inarg));
-   inarg.size = size;
-   inarg.flags = flags;
+   memset(_v2, 0, sizeof(inarg_v2));
+   if (setxattr_v2) {
+   inarg_v2.size = size;
+   inarg_v2.flags = flags;
+   inarg_v2.setxattr_flags = extra_flags;
+   } else {
+   inarg.size = size;
+   inarg.flags = flags;
+   }
args.opcode = FUSE_SETXATTR;
args.nodeid = get_node_id(inode);
args.in_numargs = 3;
-   args.in_args[0].size = sizeof(inarg);
-   args.in_args[0].value = 
+   args.in_args[0].size = setxattr_v2 ? sizeof(inarg_v2) : sizeof(inarg);
+   args.in_args[0].value = setxattr_v2 ? _v2 : (void *)
args.in_args[1].size = strlen(name) + 1;
args.in_args[1].value = name;
args.in_args[2].size = size;
@@ -199,7 +208,7 @@ static int fuse_xattr_set(const struct xattr_handler 
*handler,
if (!value)
return fuse_removexattr(inode, name);
 
-   return fuse_setxattr(inode, name, value, size, flags

[PATCH v2 0/2] fuse: Fix clearing SGID when access ACL is set

2021-03-25 Thread Vivek Goyal


Hi,

This is V2 of the patchset. Posted V1 here.

https://lore.kernel.org/linux-fsdevel/20210319195547.427371-1-vgo...@redhat.com/

Changes since V1:

- Dropped the helper to determine if SGID should be cleared and open
  coded it instead. I will follow up on helper separately in a different
  patch series. There are few places already which open code this, so
  for now fuse can do the same. Atleast I can make progress on this
  and virtiofs can enable ACL support.

Luis reported that xfstests generic/375 fails with virtiofs. Little
debugging showed that when posix access acl is set that in some
cases SGID needs to be cleared and that does not happen with virtiofs.

Setting posix access acl can lead to mode change and it can also lead
to clear of SGID. fuse relies on file server taking care of all
the mode changes. But file server does not have enough information to
determine whether SGID should be cleared or not.

Hence this patch series add support to send a flag in SETXATTR message
to tell server to clear SGID.

I have staged corresponding virtiofsd patches here.

https://github.com/rhvgoyal/qemu/commits/acl-sgid-setxattr-flag

With these patches applied "./check -g acl" passes now on virtiofs.

Thanks
Vivek

Vivek Goyal (2):
  fuse: Add support for FUSE_SETXATTR_V2
  fuse: Add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGID

 fs/fuse/acl.c |  8 +++-
 fs/fuse/fuse_i.h  |  5 -
 fs/fuse/inode.c   |  4 +++-
 fs/fuse/xattr.c   | 21 +++--
 include/uapi/linux/fuse.h | 17 +
 5 files changed, 46 insertions(+), 9 deletions(-)

-- 
2.25.4



Re: [PATCH 1/3] posic_acl: Add a helper determine if SGID should be cleared

2021-03-23 Thread Vivek Goyal
On Tue, Mar 23, 2021 at 10:32:33AM +0100, Christian Brauner wrote:
> On Mon, Mar 22, 2021 at 01:01:11PM -0400, Vivek Goyal wrote:
> > On Sat, Mar 20, 2021 at 11:03:22AM +0100, Christian Brauner wrote:
> > > On Fri, Mar 19, 2021 at 11:42:48PM +0100, Andreas Grünbacher wrote:
> > > > Hi,
> > > > 
> > > > Am Fr., 19. März 2021 um 20:58 Uhr schrieb Vivek Goyal 
> > > > :
> > > > > posix_acl_update_mode() determines what's the equivalent mode and if 
> > > > > SGID
> > > > > needs to be cleared or not. I need to make use of this code in fuse
> > > > > as well. Fuse will send this information to virtiofs file server and
> > > > > file server will take care of clearing SGID if it needs to be done.
> > > > >
> > > > > Hence move this code in a separate helper so that more than one place
> > > > > can call into it.
> > > > >
> > > > > Cc: Jan Kara 
> > > > > Cc: Andreas Gruenbacher 
> > > > > Cc: Alexander Viro 
> > > > > Signed-off-by: Vivek Goyal 
> > > > > ---
> > > > >  fs/posix_acl.c|  3 +--
> > > > >  include/linux/posix_acl.h | 11 +++
> > > > >  2 files changed, 12 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> > > > > index f3309a7edb49..2d62494c4a5b 100644
> > > > > --- a/fs/posix_acl.c
> > > > > +++ b/fs/posix_acl.c
> > > > > @@ -684,8 +684,7 @@ int posix_acl_update_mode(struct user_namespace 
> > > > > *mnt_userns,
> > > > > return error;
> > > > > if (error == 0)
> > > > > *acl = NULL;
> > > > > -   if (!in_group_p(i_gid_into_mnt(mnt_userns, inode)) &&
> > > > > -   !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
> > > > > +   if (posix_acl_mode_clear_sgid(mnt_userns, inode))
> > > > > mode &= ~S_ISGID;
> > > > > *mode_p = mode;
> > > > > return 0;
> > > > > diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
> > > > > index 307094ebb88c..073c5e546de3 100644
> > > > > --- a/include/linux/posix_acl.h
> > > > > +++ b/include/linux/posix_acl.h
> > > > > @@ -59,6 +59,17 @@ posix_acl_release(struct posix_acl *acl)
> > > > >  }
> > > > >
> > > > >
> > > > > +static inline bool
> > > > > +posix_acl_mode_clear_sgid(struct user_namespace *mnt_userns,
> > > > > + struct inode *inode)
> > > > > +{
> > > > > +   if (!in_group_p(i_gid_into_mnt(mnt_userns, inode)) &&
> > > > > +   !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
> > > > > +   return true;
> > > > > +
> > > > > +   return false;
> > > > 
> > > > That's just
> > > > 
> > > > return !in_group_p(i_gid_into_mnt(mnt_userns, inode)) &&
> > > > !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID);
> > > > 
> > > > The same pattern we have in posix_acl_update_mode also exists in
> > > > setattr_copy and inode_init_owner, and almost the same pattern exists
> > > > in setattr_prepare, so can this be cleaned up as well? The function
> > > > also isn't POSIX ACL specific, so the function name is misleading.
> > > 
> > > Good idea but that should probably be spun into a separate patchset that
> > > only touches the vfs parts.
> > 
> > IIUC, suggestion is that I should write a VFS helper (and not posix
> > acl helper) and use that helper at other places too in the code. 
> 
> If there are other callers outside of acls (which should be iirc) then
> yes.
> 
> > 
> > I will do that and post in a separate patch series.
> 
> Yeah, I think that makes more sense to have this be a separate change
> instead of putting it together with the fuse change if it touches more
> than one place.

I do see that there are few places where this pattern is used and atleast
some of them should be straight forward conversion.

I will follow this up in separate patch series. I suspect that this
might take little bit of back and forth, so will follow with fuse
changes in parallel and open code there. Once this series gets merged
will send another patch for fuse.

Thanks
Vivek



Re: [PATCH] fuse: Fix a potential double free in virtio_fs_get_tree

2021-03-23 Thread Vivek Goyal
On Mon, Mar 22, 2021 at 10:18:31PM -0700, Lv Yunlong wrote:
> In virtio_fs_get_tree, fm is allocated by kzalloc() and
> assigned to fsc->s_fs_info by fsc->s_fs_info=fm statement.
> If the kzalloc() failed, it will goto err directly, so that
> fsc->s_fs_info must be non-NULL and fm will be freed.

sget_fc() will either consume fsc->s_fs_info in case a new super
block is allocated and set fsc->s_fs_info. In that case we don't
free fc or fm.

Or, sget_fc() will return with fsc->s_fs_info set in case we already
found a super block. In that case we need to free fc and fm.

In case of error from sget_fc(), fc/fm need to be freed first and
then error needs to be returned to caller.

if (IS_ERR(sb))
return PTR_ERR(sb);


If we allocated a new super block in sget_fc(), then next step is
to initialize it.

if (!sb->s_root) {
err = virtio_fs_fill_super(sb, fsc);
}

If we run into errors here, then fc/fm need to be freed.

So current code looks fine to me.

Vivek

> 
> But later fm is freed again when virtio_fs_fill_super() fialed.
> I think the statement if (fsc->s_fs_info) {kfree(fm);} is
> misplaced.
> 
> My patch puts this statement in the correct palce to avoid
> double free.
> 
> Signed-off-by: Lv Yunlong 
> ---
>  fs/fuse/virtio_fs.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 8868ac31a3c0..727cf436828f 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1437,10 +1437,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
>  
>   fsc->s_fs_info = fm;
>   sb = sget_fc(fsc, virtio_fs_test_super, set_anon_super_fc);
> - if (fsc->s_fs_info) {
> - fuse_conn_put(fc);
> - kfree(fm);
> - }
> +
>   if (IS_ERR(sb))
>   return PTR_ERR(sb);
>  
> @@ -1457,6 +1454,11 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
>   sb->s_flags |= SB_ACTIVE;
>   }
>  
> + if (fsc->s_fs_info) {
> + fuse_conn_put(fc);
> + kfree(fm);
> + }
> +
>   WARN_ON(fsc->root);
>   fsc->root = dget(sb->s_root);
>   return 0;
> -- 
> 2.25.1
> 
> 



Re: [PATCH 2/3] virtiofs: split requests that exceed virtqueue size

2021-03-22 Thread Vivek Goyal
On Thu, Mar 18, 2021 at 04:17:51PM +0100, Miklos Szeredi wrote:
> On Thu, Mar 18, 2021 at 08:52:22AM -0500, Connor Kuehl wrote:
> > If an incoming FUSE request can't fit on the virtqueue, the request is
> > placed onto a workqueue so a worker can try to resubmit it later where
> > there will (hopefully) be space for it next time.
> > 
> > This is fine for requests that aren't larger than a virtqueue's maximum
> > capacity. However, if a request's size exceeds the maximum capacity of
> > the virtqueue (even if the virtqueue is empty), it will be doomed to a
> > life of being placed on the workqueue, removed, discovered it won't fit,
> > and placed on the workqueue yet again.
> > 
> > Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect
> > Descriptors) of the virtio spec:
> > 
> >   "A driver MUST NOT create a descriptor chain longer than the Queue
> >   Size of the device."
> > 
> > To fix this, limit the number of pages FUSE will use for an overall
> > request. This way, each request can realistically fit on the virtqueue
> > when it is decomposed into a scattergather list and avoid violating
> > section 2.6.5.3.1 of the virtio spec.
> 
> I removed the conditional compilation and renamed the limit.  Also made
> virtio_fs_get_tree() bail out if it hit the WARN_ON().  Updated patch below.
> 
> The virtio_ring patch in this series should probably go through the respective
> subsystem tree.
> 
> 
> Thanks,
> Miklos
> 
> ---
> From: Connor Kuehl 
> Subject: virtiofs: split requests that exceed virtqueue size
> Date: Thu, 18 Mar 2021 08:52:22 -0500
> 
> If an incoming FUSE request can't fit on the virtqueue, the request is
> placed onto a workqueue so a worker can try to resubmit it later where
> there will (hopefully) be space for it next time.
> 
> This is fine for requests that aren't larger than a virtqueue's maximum
> capacity.  However, if a request's size exceeds the maximum capacity of the
> virtqueue (even if the virtqueue is empty), it will be doomed to a life of
> being placed on the workqueue, removed, discovered it won't fit, and placed
> on the workqueue yet again.
> 
> Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect
> Descriptors) of the virtio spec:
> 
>   "A driver MUST NOT create a descriptor chain longer than the Queue
>   Size of the device."
> 
> To fix this, limit the number of pages FUSE will use for an overall
> request.  This way, each request can realistically fit on the virtqueue
> when it is decomposed into a scattergather list and avoid violating section
> 2.6.5.3.1 of the virtio spec.
> 
> Signed-off-by: Connor Kuehl 
> Signed-off-by: Miklos Szeredi 
> ---

Looks good to me.

Reviewed-by: Vivek Goyal 

Vivek

>  fs/fuse/fuse_i.h|3 +++
>  fs/fuse/inode.c |3 ++-
>  fs/fuse/virtio_fs.c |   19 +--
>  3 files changed, 22 insertions(+), 3 deletions(-)
> 
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -555,6 +555,9 @@ struct fuse_conn {
>   /** Maxmum number of pages that can be used in a single request */
>   unsigned int max_pages;
>  
> + /** Constrain ->max_pages to this value during feature negotiation */
> + unsigned int max_pages_limit;
> +
>   /** Input queue */
>   struct fuse_iqueue iq;
>  
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -712,6 +712,7 @@ void fuse_conn_init(struct fuse_conn *fc
>   fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
>   fc->user_ns = get_user_ns(user_ns);
>   fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
> + fc->max_pages_limit = FUSE_MAX_MAX_PAGES;
>  
>   INIT_LIST_HEAD(>mounts);
>   list_add(>fc_entry, >mounts);
> @@ -1040,7 +1041,7 @@ static void process_init_reply(struct fu
>   fc->abort_err = 1;
>   if (arg->flags & FUSE_MAX_PAGES) {
>   fc->max_pages =
> - min_t(unsigned int, FUSE_MAX_MAX_PAGES,
> + min_t(unsigned int, fc->max_pages_limit,
>   max_t(unsigned int, arg->max_pages, 1));
>   }
>   if (IS_ENABLED(CONFIG_FUSE_DAX) &&
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -18,6 +18,12 @@
>  #include 
>  #include "fuse_i.h"
>  
> +/* Used to help calculate the FUSE connection's max_pages limit for a 
> request's
> + * size. Parts of the struct fuse_req are sliced into scattergather lists in
> + * addition to the pa

Re: [PATCH 1/3] posic_acl: Add a helper determine if SGID should be cleared

2021-03-22 Thread Vivek Goyal
On Sat, Mar 20, 2021 at 11:03:22AM +0100, Christian Brauner wrote:
> On Fri, Mar 19, 2021 at 11:42:48PM +0100, Andreas Grünbacher wrote:
> > Hi,
> > 
> > Am Fr., 19. März 2021 um 20:58 Uhr schrieb Vivek Goyal :
> > > posix_acl_update_mode() determines what's the equivalent mode and if SGID
> > > needs to be cleared or not. I need to make use of this code in fuse
> > > as well. Fuse will send this information to virtiofs file server and
> > > file server will take care of clearing SGID if it needs to be done.
> > >
> > > Hence move this code in a separate helper so that more than one place
> > > can call into it.
> > >
> > > Cc: Jan Kara 
> > > Cc: Andreas Gruenbacher 
> > > Cc: Alexander Viro 
> > > Signed-off-by: Vivek Goyal 
> > > ---
> > >  fs/posix_acl.c|  3 +--
> > >  include/linux/posix_acl.h | 11 +++
> > >  2 files changed, 12 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> > > index f3309a7edb49..2d62494c4a5b 100644
> > > --- a/fs/posix_acl.c
> > > +++ b/fs/posix_acl.c
> > > @@ -684,8 +684,7 @@ int posix_acl_update_mode(struct user_namespace 
> > > *mnt_userns,
> > > return error;
> > > if (error == 0)
> > > *acl = NULL;
> > > -   if (!in_group_p(i_gid_into_mnt(mnt_userns, inode)) &&
> > > -   !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
> > > +   if (posix_acl_mode_clear_sgid(mnt_userns, inode))
> > > mode &= ~S_ISGID;
> > > *mode_p = mode;
> > > return 0;
> > > diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
> > > index 307094ebb88c..073c5e546de3 100644
> > > --- a/include/linux/posix_acl.h
> > > +++ b/include/linux/posix_acl.h
> > > @@ -59,6 +59,17 @@ posix_acl_release(struct posix_acl *acl)
> > >  }
> > >
> > >
> > > +static inline bool
> > > +posix_acl_mode_clear_sgid(struct user_namespace *mnt_userns,
> > > + struct inode *inode)
> > > +{
> > > +   if (!in_group_p(i_gid_into_mnt(mnt_userns, inode)) &&
> > > +   !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
> > > +   return true;
> > > +
> > > +   return false;
> > 
> > That's just
> > 
> > return !in_group_p(i_gid_into_mnt(mnt_userns, inode)) &&
> > !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID);
> > 
> > The same pattern we have in posix_acl_update_mode also exists in
> > setattr_copy and inode_init_owner, and almost the same pattern exists
> > in setattr_prepare, so can this be cleaned up as well? The function
> > also isn't POSIX ACL specific, so the function name is misleading.
> 
> Good idea but that should probably be spun into a separate patchset that
> only touches the vfs parts.

IIUC, suggestion is that I should write a VFS helper (and not posix
acl helper) and use that helper at other places too in the code. 

I will do that and post in a separate patch series.

Thanks
Vivek



[PATCH 2/3] fuse: Add support for FUSE_SETXATTR_V2

2021-03-19 Thread Vivek Goyal
Fuse client needs to send additional information to file server when
it calls SETXATTR(system.posix_acl_access). Right now there is no extra
space in fuse_setxattr_in. So introduce a v2 of the structure which has
more space in it and can be used to send extra flags.

"struct fuse_setxattr_in_v2" is only used if file server opts-in for it using
flag FUSE_SETXATTR_V2 during feature negotiations.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/acl.c |  2 +-
 fs/fuse/fuse_i.h  |  5 -
 fs/fuse/inode.c   |  4 +++-
 fs/fuse/xattr.c   | 21 +++--
 include/uapi/linux/fuse.h | 10 ++
 5 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index e9c0f916349d..d31260a139d4 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -94,7 +94,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct 
inode *inode,
return ret;
}
 
-   ret = fuse_setxattr(inode, name, value, size, 0);
+   ret = fuse_setxattr(inode, name, value, size, 0, 0);
kfree(value);
} else {
ret = fuse_removexattr(inode, name);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 63d97a15ffde..d00bf0b9a38c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -668,6 +668,9 @@ struct fuse_conn {
/** Is setxattr not implemented by fs? */
unsigned no_setxattr:1;
 
+   /** Does file server support setxattr_v2 */
+   unsigned setxattr_v2:1;
+
/** Is getxattr not implemented by fs? */
unsigned no_getxattr:1;
 
@@ -1170,7 +1173,7 @@ void fuse_unlock_inode(struct inode *inode, bool locked);
 bool fuse_lock_inode(struct inode *inode);
 
 int fuse_setxattr(struct inode *inode, const char *name, const void *value,
- size_t size, int flags);
+ size_t size, int flags, unsigned extra_flags);
 ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
  size_t size);
 ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b0e18b470e91..1c726df13f80 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1052,6 +1052,8 @@ static void process_init_reply(struct fuse_mount *fm, 
struct fuse_args *args,
fc->handle_killpriv_v2 = 1;
fm->sb->s_flags |= SB_NOSEC;
}
+   if (arg->flags & FUSE_SETXATTR_V2)
+   fc->setxattr_v2 = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1095,7 +1097,7 @@ void fuse_send_init(struct fuse_mount *fm)
FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
-   FUSE_HANDLE_KILLPRIV_V2;
+   FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_V2;
 #ifdef CONFIG_FUSE_DAX
if (fm->fc->dax)
ia->in.flags |= FUSE_MAP_ALIGNMENT;
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 1a7d7ace54e1..f2aae72653dc 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -12,24 +12,33 @@
 #include 
 
 int fuse_setxattr(struct inode *inode, const char *name, const void *value,
- size_t size, int flags)
+ size_t size, int flags, unsigned extra_flags)
 {
struct fuse_mount *fm = get_fuse_mount(inode);
FUSE_ARGS(args);
struct fuse_setxattr_in inarg;
+   struct fuse_setxattr_in_v2 inarg_v2;
+   bool setxattr_v2 = fm->fc->setxattr_v2;
int err;
 
if (fm->fc->no_setxattr)
return -EOPNOTSUPP;
 
memset(, 0, sizeof(inarg));
-   inarg.size = size;
-   inarg.flags = flags;
+   memset(_v2, 0, sizeof(inarg_v2));
+   if (setxattr_v2) {
+   inarg_v2.size = size;
+   inarg_v2.flags = flags;
+   inarg_v2.setxattr_flags = extra_flags;
+   } else {
+   inarg.size = size;
+   inarg.flags = flags;
+   }
args.opcode = FUSE_SETXATTR;
args.nodeid = get_node_id(inode);
args.in_numargs = 3;
-   args.in_args[0].size = sizeof(inarg);
-   args.in_args[0].value = 
+   args.in_args[0].size = setxattr_v2 ? sizeof(inarg_v2) : sizeof(inarg);
+   args.in_args[0].value = setxattr_v2 ? _v2 : (void *)
args.in_args[1].size = strlen(name) + 1;
args.in_args[1].value = name;
args.in_args[2].size = size;
@@ -199,7 +208,7 @@ static int fuse_xattr_set(const struct xattr_handler 
*handler,
if (!value)
return fuse_removexattr(inode, name);
 
-   return fuse_setxattr(inode, name, value, size, flags

[PATCH 1/3] posic_acl: Add a helper determine if SGID should be cleared

2021-03-19 Thread Vivek Goyal
posix_acl_update_mode() determines what's the equivalent mode and if SGID
needs to be cleared or not. I need to make use of this code in fuse
as well. Fuse will send this information to virtiofs file server and
file server will take care of clearing SGID if it needs to be done.

Hence move this code in a separate helper so that more than one place
can call into it.

Cc: Jan Kara 
Cc: Andreas Gruenbacher 
Cc: Alexander Viro 
Signed-off-by: Vivek Goyal 
---
 fs/posix_acl.c|  3 +--
 include/linux/posix_acl.h | 11 +++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index f3309a7edb49..2d62494c4a5b 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -684,8 +684,7 @@ int posix_acl_update_mode(struct user_namespace *mnt_userns,
return error;
if (error == 0)
*acl = NULL;
-   if (!in_group_p(i_gid_into_mnt(mnt_userns, inode)) &&
-   !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
+   if (posix_acl_mode_clear_sgid(mnt_userns, inode))
mode &= ~S_ISGID;
*mode_p = mode;
return 0;
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 307094ebb88c..073c5e546de3 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -59,6 +59,17 @@ posix_acl_release(struct posix_acl *acl)
 }
 
 
+static inline bool
+posix_acl_mode_clear_sgid(struct user_namespace *mnt_userns,
+ struct inode *inode)
+{
+   if (!in_group_p(i_gid_into_mnt(mnt_userns, inode)) &&
+   !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
+   return true;
+
+   return false;
+}
+
 /* posix_acl.c */
 
 extern void posix_acl_init(struct posix_acl *, int);
-- 
2.25.4



[PATCH 0/3] fuse: Fix clearing SGID when access ACL is set

2021-03-19 Thread Vivek Goyal
Hi,

Luis reported that xfstests generic/375 fails with virtiofs. Little
debugging showed that when posix access acl is set that in some
cases SGID needs to be cleared and that does not happen with virtiofs.

Setting posix access acl can lead to mode change and it can also lead
to clear of SGID. fuse relies on file server taking care of all
the mode changes. But file server does not have enough information to
determine whether SGID should be cleared or not.

Hence this patch series add support to send a flag in SETXATTR message
to tell server to clear SGID.

I have staged corresponding virtiofsd patches here.

https://github.com/rhvgoyal/qemu/commits/acl-sgid-setxattr-flag

With these patches applied "./check -g acl" passes now on virtiofs.

Vivek Goyal (3):
  posic_acl: Add a helper determine if SGID should be cleared
  fuse: Add support for FUSE_SETXATTR_V2
  fuse: Add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGID

 fs/fuse/acl.c |  7 ++-
 fs/fuse/fuse_i.h  |  5 -
 fs/fuse/inode.c   |  4 +++-
 fs/fuse/xattr.c   | 21 +++--
 fs/posix_acl.c|  3 +--
 include/linux/posix_acl.h | 11 +++
 include/uapi/linux/fuse.h | 17 +
 7 files changed, 57 insertions(+), 11 deletions(-)

-- 
2.25.4



[PATCH 3/3] fuse: Add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGID

2021-03-19 Thread Vivek Goyal
When posix access ACL is set, it can have an effect on file mode and
it can also need to clear SGID if.

- None of caller's group/supplementary groups match file owner group.
AND
- Caller is not priviliged (No CAP_FSETID).

As of now fuser server is responsible for changing the file mode as well. But
it does not know whether to clear SGID or not.

So add a flag FUSE_SETXATTR_ACL_KILL_SGID and send this info with
SETXATTR to let file server know that sgid needs to be cleared as well.

Reported-by: Luis Henriques 
Signed-off-by: Vivek Goyal 
---
 fs/fuse/acl.c | 7 ++-
 include/uapi/linux/fuse.h | 7 +++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index d31260a139d4..45358124181a 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -71,6 +71,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct 
inode *inode,
return -EINVAL;
 
if (acl) {
+   unsigned extra_flags = 0;
/*
 * Fuse userspace is responsible for updating access
 * permissions in the inode, if needed. fuse_setxattr
@@ -94,7 +95,11 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct 
inode *inode,
return ret;
}
 
-   ret = fuse_setxattr(inode, name, value, size, 0, 0);
+   if (fc->setxattr_v2 &&
+   posix_acl_mode_clear_sgid(_user_ns, inode))
+   extra_flags |= FUSE_SETXATTR_ACL_KILL_SGID;
+
+   ret = fuse_setxattr(inode, name, value, size, 0, extra_flags);
kfree(value);
} else {
ret = fuse_removexattr(inode, name);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 1bb555c1c117..08c11a7beaa7 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -180,6 +180,7 @@
  *  - add FUSE_HANDLE_KILLPRIV_V2, FUSE_WRITE_KILL_SUIDGID, FATTR_KILL_SUIDGID
  *  - add FUSE_OPEN_KILL_SUIDGID
  *  - add FUSE_SETXATTR_V2
+ *  - add FUSE_SETXATTR_ACL_KILL_SGID
  */
 
 #ifndef _LINUX_FUSE_H
@@ -454,6 +455,12 @@ struct fuse_file_lock {
  */
 #define FUSE_OPEN_KILL_SUIDGID (1 << 0)
 
+/**
+ * setxattr flags
+ * FUSE_SETXATTR_ACL_KILL_SGID: Clear SGID when system.posix_acl_access is set
+ */
+#define FUSE_SETXATTR_ACL_KILL_SGID(1 << 0)
+
 enum fuse_opcode {
FUSE_LOOKUP = 1,
FUSE_FORGET = 2,  /* no reply */
-- 
2.25.4



Re: [PATCH 2/3] virtiofs: split requests that exceed virtqueue size

2021-03-19 Thread Vivek Goyal
On Thu, Mar 18, 2021 at 08:52:22AM -0500, Connor Kuehl wrote:
> If an incoming FUSE request can't fit on the virtqueue, the request is
> placed onto a workqueue so a worker can try to resubmit it later where
> there will (hopefully) be space for it next time.
> 
> This is fine for requests that aren't larger than a virtqueue's maximum
> capacity. However, if a request's size exceeds the maximum capacity of
> the virtqueue (even if the virtqueue is empty), it will be doomed to a
> life of being placed on the workqueue, removed, discovered it won't fit,
> and placed on the workqueue yet again.
> 
> Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect
> Descriptors) of the virtio spec:
> 
>   "A driver MUST NOT create a descriptor chain longer than the Queue
>   Size of the device."
> 
> To fix this, limit the number of pages FUSE will use for an overall
> request. This way, each request can realistically fit on the virtqueue
> when it is decomposed into a scattergather list and avoid violating
> section 2.6.5.3.1 of the virtio spec.

Hi Connor,

So as of now if a request is bigger than what virtqueue can support,
it never gets dispatched and caller waits infinitely? So this patch
will fix it by forcing fuse to split the request. That sounds good.


[..]
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 8868ac31a3c0..a6ffba85d59a 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -18,6 +18,12 @@
>  #include 
>  #include "fuse_i.h"
>  
> +/* Used to help calculate the FUSE connection's max_pages limit for a 
> request's
> + * size. Parts of the struct fuse_req are sliced into scattergather lists in
> + * addition to the pages used, so this can help account for that overhead.
> + */
> +#define FUSE_HEADER_OVERHEAD4

How did yo arrive at this overhead. Is it following.

- One sg element for fuse_in_header.
- One sg element for input arguments.
- One sg element for fuse_out_header.
- One sg element for output args.

Thanks
Vivek



Re: Question about sg_count_fuse_req() in linux/fs/fuse/virtio_fs.c

2021-03-18 Thread Vivek Goyal
On Wed, Mar 17, 2021 at 01:12:01PM -0500, Connor Kuehl wrote:
> Hi,
> 
> I've been familiarizing myself with the virtiofs guest kernel module and I'm
> trying to better understand how virtiofs maps a FUSE request into
> scattergather lists.
> 
> sg_count_fuse_req() starts knowing that there will be at least one in
> header, as shown here (which makes sense):
> 
> unsigned int size, total_sgs = 1 /* fuse_in_header */;
> 
> However, I'm confused about this snippet right beneath it:
> 
> if (args->in_numargs - args->in_pages)
> total_sgs += 1;
> 
> What is the significance of the sg that is needed in the cases where this
> branch is taken? I'm not sure what its relationship is with args->in_numargs
> since it will increment total_sgs regardless args->in_numargs is 3, 2, or
> even 1 if args->in_pages is false.

Hi Conor,

I think all the in args are being mapped into a single scatter gather
element and that's why it does not matter whether in_numargs is 3, 2 or 1.
They will be mapped in a single element.

sg_init_fuse_args()
{
len = fuse_len_args(numargs - argpages, args);
if (len)
sg_init_one([total_sgs++], argbuf, len);
}

out_sgs += sg_init_fuse_args([out_sgs], req,
 (struct fuse_arg *)args->in_args,
 args->in_numargs, args->in_pages,
 req->argbuf, _used);

When we are sending some data in some pages, then we set args->in_pages
to true. And in that case, last element of args->in_args[] contains the
total size of bytes in additional pages we are sending and is not part
of in_args being mapped to scatter gather element. That's why this
check.

if (args->in_numargs - args->in_pages)
total_sgs += 1;

Not sure when we will have a case where args->in_numargs = 1 and
args->in_pages=true. Do we ever hit that.

Thanks
Vivek

> 
> Especially since the block right below it counts pages if args->in_pages is
> true:
> 
> if (args->in_pages) {
> size = args->in_args[args->in_numargs - 1].size;
> total_sgs += sg_count_fuse_pages(ap->descs, ap->num_pages,
>  size);
> }
> 
> The rest of the routine goes on similarly but for the 'out' components.
> 
> I doubt incrementing 'total_sgs' in the first if-statement I showed above is
> vestigial, I just think my mental model of what is happening here is
> incomplete.
> 
> Any clarification is much appreciated!

> 
> Connor
> 



Re: [PATCH v2] virtiofs: fix memory leak in virtio_fs_probe()

2021-03-17 Thread Vivek Goyal
On Wed, Mar 17, 2021 at 08:44:43AM +, Luis Henriques wrote:
> When accidentally passing twice the same tag to qemu, kmemleak ended up
> reporting a memory leak in virtiofs.  Also, looking at the log I saw the
> following error (that's when I realised the duplicated tag):
> 
>   virtiofs: probe of virtio5 failed with error -17
> 
> Here's the kmemleak log for reference:
> 
> unreferenced object 0x888103d47800 (size 1024):
>   comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s)
>   hex dump (first 32 bytes):
> 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .N..
> ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff  
>   backtrace:
> [<0ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs]
> [<f8aca419>] virtio_dev_probe+0x15f/0x210
> [<4d6baf3c>] really_probe+0xea/0x430
> [<a6ceeac8>] device_driver_attach+0xa8/0xb0
> [<196f47a7>] __driver_attach+0x98/0x140
> [<0b20601d>] bus_for_each_dev+0x7b/0xc0
> [<399c7b7f>] bus_add_driver+0x11b/0x1f0
> [<32b09ba7>] driver_register+0x8f/0xe0
> [<cdd55998>] 0xa002c013
> [<0ea196a2>] do_one_initcall+0x64/0x2e0
> [<08f727ce>] do_init_module+0x5c/0x260
> [<3cdedab6>] __do_sys_finit_module+0xb5/0x120
> [<ad2f48c6>] do_syscall_64+0x33/0x40
> [<809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Luis Henriques 

Reviewed-by: Vivek Goyal 

Thanks
Vivek

> ---
> Changes since v1:
> - Use kfree() to free fs->vqs instead of calling virtio_fs_put()
> 
>  fs/fuse/virtio_fs.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 8868ac31a3c0..989ef4f88636 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -896,6 +896,7 @@ static int virtio_fs_probe(struct virtio_device *vdev)
>  out_vqs:
>   vdev->config->reset(vdev);
>   virtio_fs_cleanup_vqs(vdev, fs);
> + kfree(fs->vqs);
>  
>  out:
>   vdev->priv = NULL;
> 



Re: [PATCH] virtiofs: fix memory leak in virtio_fs_probe()

2021-03-16 Thread Vivek Goyal
On Tue, Mar 16, 2021 at 05:02:34PM +, Luis Henriques wrote:
> When accidentally passing twice the same tag to qemu, kmemleak ended up
> reporting a memory leak in virtiofs.  Also, looking at the log I saw the
> following error (that's when I realised the duplicated tag):
> 
>   virtiofs: probe of virtio5 failed with error -17
> 
> Here's the kmemleak log for reference:
> 
> unreferenced object 0x888103d47800 (size 1024):
>   comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s)
>   hex dump (first 32 bytes):
> 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .N..
> ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff  
>   backtrace:
> [<0ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs]
> [] virtio_dev_probe+0x15f/0x210
> [<4d6baf3c>] really_probe+0xea/0x430
> [] device_driver_attach+0xa8/0xb0
> [<196f47a7>] __driver_attach+0x98/0x140
> [<0b20601d>] bus_for_each_dev+0x7b/0xc0
> [<399c7b7f>] bus_add_driver+0x11b/0x1f0
> [<32b09ba7>] driver_register+0x8f/0xe0
> [] 0xa002c013
> [<0ea196a2>] do_one_initcall+0x64/0x2e0
> [<08f727ce>] do_init_module+0x5c/0x260
> [<3cdedab6>] __do_sys_finit_module+0xb5/0x120
> [] do_syscall_64+0x33/0x40
> [<809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Luis Henriques 

Hi Luis,

Thanks for the report and the fix. So looks like leak is happening
because we are not doing kfree(fs->vqs) in error path.

> ---
>  fs/fuse/virtio_fs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 8868ac31a3c0..4e6ef9f24e84 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -899,7 +899,7 @@ static int virtio_fs_probe(struct virtio_device *vdev)
>  
>  out:
>   vdev->priv = NULL;
> - kfree(fs);
> + virtio_fs_put(fs);

[ CC virtio-fs list ]

fs object is not fully formed. So calling virtio_fs_put() is little odd.
I will expect it to be called if somebody takes a reference using _get()
or in the final virtio_fs_remove() when creation reference should go
away.

How about open coding it and free fs->vqs explicitly. Something like
as follows.

@@ -896,7 +896,7 @@ static int virtio_fs_probe(struct virtio
 out_vqs:
vdev->config->reset(vdev);
virtio_fs_cleanup_vqs(vdev, fs);
-
+   kfree(fs->vqs);
 out:
vdev->priv = NULL;
kfree(fs);

Thanks
Vivek



Re: [RFC PATCH] fuse: Clear SGID bit when setting mode in setacl

2021-03-02 Thread Vivek Goyal
On Tue, Mar 02, 2021 at 11:00:33AM -0500, Vivek Goyal wrote:
> On Mon, Mar 01, 2021 at 06:20:30PM +, Luis Henriques wrote:
> > On Mon, Mar 01, 2021 at 11:33:24AM -0500, Vivek Goyal wrote:
> > > On Fri, Feb 26, 2021 at 06:33:57PM +, Luis Henriques wrote:
> > > > Setting file permissions with POSIX ACLs (setxattr) isn't clearing the
> > > > setgid bit.  This seems to be CVE-2016-7097, detected by running fstest
> > > > generic/375 in virtiofs.  Unfortunately, when the fix for this CVE 
> > > > landed
> > > > in the kernel with commit 073931017b49 ("posix_acl: Clear SGID bit when
> > > > setting file permissions"), FUSE didn't had ACLs support yet.
> > > 
> > > Hi Luis,
> > > 
> > > Interesting. I did not know that "chmod" can lead to clearing of SGID
> > > as well. Recently we implemented FUSE_HANDLE_KILLPRIV_V2 flag which
> > > means that file server is responsible for clearing of SUID/SGID/caps
> > > as per following rules.
> > > 
> > > - caps are always cleared on chown/write/truncate
> > > - suid is always cleared on chown, while for truncate/write it is 
> > > cleared
> > >   only if caller does not have CAP_FSETID.
> > > - sgid is always cleared on chown, while for truncate/write it is 
> > > cleared
> > >   only if caller does not have CAP_FSETID as well as file has group 
> > > execute
> > >   permission.
> > > 
> > > And we don't have anything about "chmod" in this list. Well, I will test
> > > this and come back to this little later.
> > > 
> > > I see following comment in fuse_set_acl().
> > > 
> > > /*
> > >  * Fuse userspace is responsible for updating access
> > >  * permissions in the inode, if needed. fuse_setxattr
> > >  * invalidates the inode attributes, which will force
> > >  * them to be refreshed the next time they are used,
> > >  * and it also updates i_ctime.
> > >  */
> > > 
> > > So looks like that original code has been written with intent that
> > > file server is responsible for updating inode permissions. I am
> > > assuming this will include clearing of S_ISGID if needed.
> > > 
> > > But question is, does file server has enough information to be able
> > > to handle proper clearing of S_ISGID info. IIUC, file server will need
> > > two pieces of information atleast.
> > > 
> > > - gid of the caller.
> > > - Whether caller has CAP_FSETID or not.
> > > 
> > > I think we have first piece of information but not the second one. May
> > > be we need to send this in fuse_setxattr_in->flags. And file server
> > > can drop CAP_FSETID while doing setxattr().
> > > 
> > > What about "gid" info. We don't change to caller's uid/gid while doing
> > > setxattr(). So host might not clear S_ISGID or clear it when it should
> > > not. I am wondering that can we switch to caller's uid/gid in setxattr(),
> > > atleast while setting acls.
> > 
> > Thank for looking into this.  To be honest, initially I thought that the
> > fix should be done in the server too, but when I looked into the code I
> > couldn't find an easy way to get that done (without modifying the data
> > being passed from the kernel in setxattr).
> > 
> > So, what I've done was to look at what other filesystems were doing in the
> > ACL code, and that's where I found out about this CVE.  The CVE fix for
> > the other filesystems looked easy enough to be included in FUSE too.
> 
> Hi Luis,
> 
> I still feel that it should probably be fixed in virtiofsd, given fuse client
> is expecting file server to take care of any change of mode (file
> permission bits).

Havid said that, there is one disadvantage of relying on server to
do this. Now idmapped mount patches have been merged. If virtiofs
were to ever support idmapped mounts, this will become an issue.
Server does not know about idmapped mounts, and it does not have
information on how to shift inode gid to determine if SGID should
be cleared or not.

So if we were to keep possible future support of idmapped mounts in mind,
then solving it in client makes more sense.  (/me is afraid that there
might be other dependencies like this elsewhere).

Miklos, WDYT.

Thanks
Vivek

> 
> I wrote a proof of concept patch and this should fix this. But it
> drop CAP_FSETID always. So I will nee

Re: [RFC PATCH] fuse: Clear SGID bit when setting mode in setacl

2021-03-02 Thread Vivek Goyal
On Mon, Mar 01, 2021 at 06:20:30PM +, Luis Henriques wrote:
> On Mon, Mar 01, 2021 at 11:33:24AM -0500, Vivek Goyal wrote:
> > On Fri, Feb 26, 2021 at 06:33:57PM +, Luis Henriques wrote:
> > > Setting file permissions with POSIX ACLs (setxattr) isn't clearing the
> > > setgid bit.  This seems to be CVE-2016-7097, detected by running fstest
> > > generic/375 in virtiofs.  Unfortunately, when the fix for this CVE landed
> > > in the kernel with commit 073931017b49 ("posix_acl: Clear SGID bit when
> > > setting file permissions"), FUSE didn't had ACLs support yet.
> > 
> > Hi Luis,
> > 
> > Interesting. I did not know that "chmod" can lead to clearing of SGID
> > as well. Recently we implemented FUSE_HANDLE_KILLPRIV_V2 flag which
> > means that file server is responsible for clearing of SUID/SGID/caps
> > as per following rules.
> > 
> > - caps are always cleared on chown/write/truncate
> > - suid is always cleared on chown, while for truncate/write it is 
> > cleared
> >   only if caller does not have CAP_FSETID.
> > - sgid is always cleared on chown, while for truncate/write it is 
> > cleared
> >   only if caller does not have CAP_FSETID as well as file has group 
> > execute
> >   permission.
> > 
> > And we don't have anything about "chmod" in this list. Well, I will test
> > this and come back to this little later.
> > 
> > I see following comment in fuse_set_acl().
> > 
> > /*
> >  * Fuse userspace is responsible for updating access
> >  * permissions in the inode, if needed. fuse_setxattr
> >  * invalidates the inode attributes, which will force
> >  * them to be refreshed the next time they are used,
> >  * and it also updates i_ctime.
> >  */
> > 
> > So looks like that original code has been written with intent that
> > file server is responsible for updating inode permissions. I am
> > assuming this will include clearing of S_ISGID if needed.
> > 
> > But question is, does file server has enough information to be able
> > to handle proper clearing of S_ISGID info. IIUC, file server will need
> > two pieces of information atleast.
> > 
> > - gid of the caller.
> > - Whether caller has CAP_FSETID or not.
> > 
> > I think we have first piece of information but not the second one. May
> > be we need to send this in fuse_setxattr_in->flags. And file server
> > can drop CAP_FSETID while doing setxattr().
> > 
> > What about "gid" info. We don't change to caller's uid/gid while doing
> > setxattr(). So host might not clear S_ISGID or clear it when it should
> > not. I am wondering that can we switch to caller's uid/gid in setxattr(),
> > atleast while setting acls.
> 
> Thank for looking into this.  To be honest, initially I thought that the
> fix should be done in the server too, but when I looked into the code I
> couldn't find an easy way to get that done (without modifying the data
> being passed from the kernel in setxattr).
> 
> So, what I've done was to look at what other filesystems were doing in the
> ACL code, and that's where I found out about this CVE.  The CVE fix for
> the other filesystems looked easy enough to be included in FUSE too.

Hi Luis,

I still feel that it should probably be fixed in virtiofsd, given fuse client
is expecting file server to take care of any change of mode (file
permission bits).

I wrote a proof of concept patch and this should fix this. But it
drop CAP_FSETID always. So I will need to modify kernel to pass
this information to file server and that should properly fix
generic/375. 

Please have a look. This applies on top of fuse acl support V4 patches
I had posted. I have pushed all the patches on a temporary git branch
as well.

https://github.com/rhvgoyal/qemu/commits/acl-sgid

Vivek


Subject: virtiofsd: Switch creds, drop FSETID for system.posix_acl_access xattr

When posix access acls are set on a file, it can lead to adjusting file
permissions (mode) as well. If caller does not have CAP_FSETID and it
also does not have membership of owner group, this will lead to clearing
SGID bit in mode.

Current fuse code is written in such a way that it expects file server
to take care of chaning file mode (permission), if there is a need.
Right now, host kernel does not clear SGID bit because virtiofsd is
running as root and has CAP_FSETID. For host kernel to clear SGID,
virtiofsd need to switch to gid of caller in guest and also drop
CAP_FSETID (if caller did not have it to begin with).

This is a p

Re: [RFC PATCH] fuse: Clear SGID bit when setting mode in setacl

2021-03-02 Thread Vivek Goyal
On Mon, Mar 01, 2021 at 11:33:24AM -0500, Vivek Goyal wrote:
> On Fri, Feb 26, 2021 at 06:33:57PM +, Luis Henriques wrote:
> > Setting file permissions with POSIX ACLs (setxattr) isn't clearing the
> > setgid bit.  This seems to be CVE-2016-7097, detected by running fstest
> > generic/375 in virtiofs.  Unfortunately, when the fix for this CVE landed
> > in the kernel with commit 073931017b49 ("posix_acl: Clear SGID bit when
> > setting file permissions"), FUSE didn't had ACLs support yet.
> 
> Hi Luis,
> 
> Interesting. I did not know that "chmod" can lead to clearing of SGID
> as well. Recently we implemented FUSE_HANDLE_KILLPRIV_V2 flag which
> means that file server is responsible for clearing of SUID/SGID/caps
> as per following rules.
> 
> - caps are always cleared on chown/write/truncate
> - suid is always cleared on chown, while for truncate/write it is cleared
>   only if caller does not have CAP_FSETID.
> - sgid is always cleared on chown, while for truncate/write it is cleared
>   only if caller does not have CAP_FSETID as well as file has group 
> execute
>   permission.
> 
> And we don't have anything about "chmod" in this list. Well, I will test
> this and come back to this little later.

Looks like I did not notice the setattr_prepare() call in
fuse_do_setattr() which clears SGID in client itself and server does not
have to do anything extra. So it works.

IOW, FUSE_HANDLE_KILLPRIV_V2 will not handle this particular case and
fuse client will clear SGID on chmod, if need be.

Vivek



Re: [RFC PATCH] fuse: Clear SGID bit when setting mode in setacl

2021-03-01 Thread Vivek Goyal
On Fri, Feb 26, 2021 at 06:33:57PM +, Luis Henriques wrote:
> Setting file permissions with POSIX ACLs (setxattr) isn't clearing the
> setgid bit.  This seems to be CVE-2016-7097, detected by running fstest
> generic/375 in virtiofs.  Unfortunately, when the fix for this CVE landed
> in the kernel with commit 073931017b49 ("posix_acl: Clear SGID bit when
> setting file permissions"), FUSE didn't had ACLs support yet.

Hi Luis,

Interesting. I did not know that "chmod" can lead to clearing of SGID
as well. Recently we implemented FUSE_HANDLE_KILLPRIV_V2 flag which
means that file server is responsible for clearing of SUID/SGID/caps
as per following rules.

- caps are always cleared on chown/write/truncate
- suid is always cleared on chown, while for truncate/write it is cleared
  only if caller does not have CAP_FSETID.
- sgid is always cleared on chown, while for truncate/write it is cleared
  only if caller does not have CAP_FSETID as well as file has group execute
  permission.

And we don't have anything about "chmod" in this list. Well, I will test
this and come back to this little later.

I see following comment in fuse_set_acl().

/*
 * Fuse userspace is responsible for updating access
 * permissions in the inode, if needed. fuse_setxattr
 * invalidates the inode attributes, which will force
 * them to be refreshed the next time they are used,
 * and it also updates i_ctime.
 */

So looks like that original code has been written with intent that
file server is responsible for updating inode permissions. I am
assuming this will include clearing of S_ISGID if needed.

But question is, does file server has enough information to be able
to handle proper clearing of S_ISGID info. IIUC, file server will need
two pieces of information atleast.

- gid of the caller.
- Whether caller has CAP_FSETID or not.

I think we have first piece of information but not the second one. May
be we need to send this in fuse_setxattr_in->flags. And file server
can drop CAP_FSETID while doing setxattr().

What about "gid" info. We don't change to caller's uid/gid while doing
setxattr(). So host might not clear S_ISGID or clear it when it should
not. I am wondering that can we switch to caller's uid/gid in setxattr(),
atleast while setting acls.

Thanks
Vivek

> 
> Signed-off-by: Luis Henriques 
> ---
>  fs/fuse/acl.c | 29 ++---
>  1 file changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
> index f529075a2ce8..1b273277c1c9 100644
> --- a/fs/fuse/acl.c
> +++ b/fs/fuse/acl.c
> @@ -54,7 +54,9 @@ int fuse_set_acl(struct inode *inode, struct posix_acl 
> *acl, int type)
>  {
>   struct fuse_conn *fc = get_fuse_conn(inode);
>   const char *name;
> + umode_t mode = inode->i_mode;
>   int ret;
> + bool update_mode = false;
>  
>   if (fuse_is_bad(inode))
>   return -EIO;
> @@ -62,11 +64,18 @@ int fuse_set_acl(struct inode *inode, struct posix_acl 
> *acl, int type)
>   if (!fc->posix_acl || fc->no_setxattr)
>   return -EOPNOTSUPP;
>  
> - if (type == ACL_TYPE_ACCESS)
> + if (type == ACL_TYPE_ACCESS) {
>   name = XATTR_NAME_POSIX_ACL_ACCESS;
> - else if (type == ACL_TYPE_DEFAULT)
> + if (acl) {
> + ret = posix_acl_update_mode(inode, , );
> + if (ret)
> + return ret;
> + if (inode->i_mode != mode)
> + update_mode = true;
> + }
> + } else if (type == ACL_TYPE_DEFAULT) {
>   name = XATTR_NAME_POSIX_ACL_DEFAULT;
> - else
> + } else
>   return -EINVAL;
>  
>   if (acl) {
> @@ -98,6 +107,20 @@ int fuse_set_acl(struct inode *inode, struct posix_acl 
> *acl, int type)
>   } else {
>   ret = fuse_removexattr(inode, name);
>   }
> + if (!ret && update_mode) {
> + struct dentry *entry;
> + struct iattr attr;
> +
> + entry = d_find_alias(inode);
> + if (entry) {
> + memset(, 0, sizeof(attr));
> + attr.ia_valid = ATTR_MODE | ATTR_CTIME;
> + attr.ia_mode = mode;
> + attr.ia_ctime = current_time(inode);
> + ret = fuse_do_setattr(entry, , NULL);
> + dput(entry);
> + }
> + }
>   forget_all_cached_acls(inode);
>   fuse_invalidate_attr(inode);
>  
> 



Re: [PATCH v3 1/1] kernel/crash_core: Add crashkernel=auto for vmcore creation

2021-02-17 Thread Vivek Goyal
On Wed, Feb 17, 2021 at 02:26:53PM -0500, Steven Rostedt wrote:
> On Wed, 17 Feb 2021 12:40:43 -0600
> john.p.donne...@oracle.com wrote:
> 
> > Hello.
> > 
> > Ping.
> > 
> > Can we get this reviewed and staged ?
> > 
> > Thank you.
> 
> Andrew,
> 
> Seems you are the only one pushing patches in for kexec/crash. Is this
> maintained by anyone?

Dave Young and Baoquan He still maintain kexec/kdump stuff, AFAIK. I
don't get time to look into this stuff now a days. 

Vivek



Re: [PATCH 03/18] ovl: stack miscattr

2021-02-04 Thread Vivek Goyal
On Wed, Feb 03, 2021 at 01:40:57PM +0100, Miklos Szeredi wrote:
> Add stacking for the miscattr operations.
> 
> Signed-off-by: Miklos Szeredi 
> ---
>  fs/overlayfs/dir.c   |  2 ++
>  fs/overlayfs/inode.c | 43 
>  fs/overlayfs/overlayfs.h |  2 ++
>  3 files changed, 47 insertions(+)
> 
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index 28a075b5f5b2..77c6b44f8d83 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -1300,4 +1300,6 @@ const struct inode_operations ovl_dir_inode_operations 
> = {
>   .listxattr  = ovl_listxattr,
>   .get_acl= ovl_get_acl,
>   .update_time= ovl_update_time,
> + .miscattr_get   = ovl_miscattr_get,
> + .miscattr_set   = ovl_miscattr_set,
>  };
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index d739e14c6814..97d36d1f28c3 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "overlayfs.h"
>  
>  
> @@ -495,6 +496,46 @@ static int ovl_fiemap(struct inode *inode, struct 
> fiemap_extent_info *fieinfo,
>   return err;
>  }
>  
> +int ovl_miscattr_set(struct dentry *dentry, struct miscattr *ma)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct dentry *upperdentry;
> + const struct cred *old_cred;
> + int err;
> +
> + err = ovl_want_write(dentry);
> + if (err)
> + goto out;
> +
> + err = ovl_copy_up(dentry);
> + if (!err) {
> + upperdentry = ovl_dentry_upper(dentry);
> +
> + old_cred = ovl_override_creds(inode->i_sb);
> + /* err = security_file_ioctl(real.file, cmd, arg); */

Is this an comment intended?

Vivek

> + err = vfs_miscattr_set(upperdentry, ma);
> + revert_creds(old_cred);
> + ovl_copyflags(ovl_inode_real(inode), inode);
> + }
> + ovl_drop_write(dentry);
> +out:
> + return err;
> +}
> +
> +int ovl_miscattr_get(struct dentry *dentry, struct miscattr *ma)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct dentry *realdentry = ovl_dentry_real(dentry);
> + const struct cred *old_cred;
> + int err;
> +
> + old_cred = ovl_override_creds(inode->i_sb);
> + err = vfs_miscattr_get(realdentry, ma);
> + revert_creds(old_cred);
> +
> + return err;
> +}
> +
>  static const struct inode_operations ovl_file_inode_operations = {
>   .setattr= ovl_setattr,
>   .permission = ovl_permission,
> @@ -503,6 +544,8 @@ static const struct inode_operations 
> ovl_file_inode_operations = {
>   .get_acl= ovl_get_acl,
>   .update_time= ovl_update_time,
>   .fiemap = ovl_fiemap,
> + .miscattr_get   = ovl_miscattr_get,
> + .miscattr_set   = ovl_miscattr_set,
>  };
>  
>  static const struct inode_operations ovl_symlink_inode_operations = {
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index b487e48c7fd4..d3ad02c34cca 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -509,6 +509,8 @@ int __init ovl_aio_request_cache_init(void);
>  void ovl_aio_request_cache_destroy(void);
>  long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
>  long ovl_compat_ioctl(struct file *file, unsigned int cmd, unsigned long 
> arg);
> +int ovl_miscattr_get(struct dentry *dentry, struct miscattr *ma);
> +int ovl_miscattr_set(struct dentry *dentry, struct miscattr *ma);
>  
>  /* copy_up.c */
>  int ovl_copy_up(struct dentry *dentry);
> -- 
> 2.26.2
> 



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-05 Thread Vivek Goyal
On Tue, Jan 05, 2021 at 09:11:23AM +0200, Amir Goldstein wrote:
> > >
> > > What I would rather see is:
> > > - Non-volatile: first syncfs in every container gets an error (nice to 
> > > have)
> >
> > I am not sure why are we making this behavior per container. This should
> > be no different from current semantics we have for syncfs() on regular
> > filesystem. And that will provide what you are looking for. If you
> > want single error to be reported in all ovleray mounts, then make
> > sure you have one fd open in each mount after mount, then call syncfs()
> > on that fd.
> >
> 
> Ok.
> 
> > Not sure why overlayfs behavior/semantics should be any differnt
> > than what regular filessytems like ext4/xfs are offering. Once we
> > get page cache sharing sorted out with xfs reflink, then people
> > will not even need overlayfs and be able to launch containers
> > just using xfs reflink and share base image. In that case also
> > they will need to keep an fd open per container they want to
> > see an error in.
> >
> > So my patches exactly provide that. syncfs() behavior is same with
> > overlayfs as application gets it on other filesystems. And to me
> > its important to keep behavior same.
> >
> > > - Volatile: every syncfs and every fsync in every container gets an error
> > >   (important IMO)
> >
> > For volatile mounts, I agree that we need to fail overlayfs instance
> > as soon as first error is detected since mount. And this applies to
> > not only syncfs()/fsync() but to read/write and other operations too.
> >
> > For that we will need additional patches which are floating around
> > to keep errseq sample in overlay and check for errors in all
> > paths syncfs/fsync/read/write/ and fail fs.
> 
> > But these patches build on top of my patches.
> 
> Here we disagree.
> 
> I don't see how Jeff's patch is "building on top of your patches"
> seeing that it is perfectly well contained and does not in fact depend
> on your patches.

Jeff's patches are solving problem only for volatile mounts and they
are propagating error to overlayfs sb.

My patches are solving the issue both for volatile mount as well as
non-volatile mounts and solve it using same method so there is no
confusion.

So there are multiple pieces to this puzzle and IMHO, it probably
should be fixed in this order.

A. First fix the syncfs() path to return error both for volatile as
   as well non-volatile mounts.

B. And then add patches to fail filesystem for volatile mount as soon
   as first error is detected (either in syncfs path or in other paths
   like read/write/...). This probably will require to save errseq
   in ovl_fs, and then compare with upper_sb in critical paths and fail
   filesystem as soon as error is detected.

C. Finally fix the issues related to mount/remount error detection which
   Sargun is wanting to fix. This will be largerly solved by B except
   saving errseq on disk.

My patches should fix the first problem. And more patches can be
applied on top to fix issue B and issue C.

Now if we agree with this, in this context I see that fixing problem
B and C is building on top of my patches which fixes problem A.

> 
> And I do insist that the fix for volatile mounts syncfs/fsync error
> reporting should be applied before your patches or at the very least
> not heavily depend on them.

I still don't understand that why volatile syncfs() error reporting
is more important than non-volatile syncfs(). But I will stop harping
on this point now.

My issue with Jeff's patches is that syncfs() error reporting should
be dealt in same way both for volatile and non-volatile mount. That
is compare file->f_sb_err and upper_sb->s_wb_err to figure out if
there is an error to report to user space. Currently this patches
only solve the problem for volatile mounts and use propagation to
overlay sb which is conflicting for non-volatile mounts.

IIUC, your primary concern with volatile mount is that you want to
detect as soon as writeback error happens, and flag it to container
manager so that container manager can stop container, throw away
upper layer and restart from scratch. If yes, what you want can
be solved by solving problem B and backporting it to LTS kernel.
I think patches for that will be well contained within overlayfs
(And no VFS) changes and should be relatively easy to backport.

IOW, backportability to LTS kernel should not be a concern/blocker
for my patch series which fixes syncfs() issue for overlayfs.

Thanks
Vivek

> 
> volatile mount was introduced in fresh new v5.10, which is also an
> LTS kernel. It would be inconsiderate of volatile mount users and developers
> to make backporting that fix to v5.10.y any harder than it should be.

> 
> > My patches don't solve this problem of failing overlay mount for
> > the volatile mount case.
> >
> 
> Here we agree.
> 
> > >
> > > This is why I prefer to sample upper sb error on mount and propagate
> > > new errors to overlayfs sb (Jeff's patch).
> >
> > Ok, I think 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Mon, Jan 04, 2021 at 11:42:51PM +0200, Amir Goldstein wrote:
> On Mon, Jan 4, 2021 at 5:40 PM Vivek Goyal  wrote:
> >
> > On Mon, Jan 04, 2021 at 05:22:07PM +0200, Amir Goldstein wrote:
> > > > > Since Jeff's patch is minimal, I think that it should be the fix 
> > > > > applied
> > > > > first and proposed for stable (with adaptations for non-volatile 
> > > > > overlay).
> > > >
> > > > Does stable fix has to be same as mainline fix. IOW, I think atleast in
> > > > mainline we should first fix it the right way and then think how to fix
> > > > it for stable. If fixes taken in mainline are not realistic for stable,
> > > > can we push a different small fix just for stable?
> > >
> > > We can do a lot of things.
> > > But if we are able to create a series with minimal (and most critical) 
> > > fixes
> > > followed by other fixes, it would be easier for everyone involved.
> >
> > I am not sure this is really critical. writeback error reporting for
> > overlayfs are broken since the beginning for regular mounts. There is no
> > notion of these errors being reported to user space. If that did not
> > create a major issue, then why suddenly volatile mounts make it
> > a critical issue.
> >
> 
> Volatile mounts didn't make this a critical issue.
> But this discussion made us notice a mildly serious issue.
> It is not surprising to me that users did not report this issue.
> Do you know what it takes for a user to notice that writeback had failed,
> but an application did fsync and error did not get reported?
> Filesystem durability guaranties are hard to prove especially with so
> many subsystem layers and with fsync that does return an error correctly.
> I once found a durability bug in fsync of xfs that existed for 12 years.
> That fact does not at all make it any less critical.
> 
> > To me we should fix the issue properly which is easy to maintain
> > down the line and then worry about doing a stable fix if need be.
> >
> > >
> > > >
> > > > IOW, because we have to push a fix in stable, should not determine
> > > > what should be problem solution for mainline, IMHO.
> > > >
> > >
> > > I find in this case there is a correlation between the simplest fix and 
> > > the
> > > most relevant fix for stable.
> > >
> > > > The porblem I have with Jeff's fix is that its only works for volatile
> > > > mounts. While I prefer a solution where syncfs() is fixed both for
> > > > volatile as well as non-volatile mount and then there is less confusion.
> > > >
> > >
> > > I proposed a variation on Jeff's patch that covers both cases.
> > > Sargun is going to work on it.
> >
> > What's the problem with my patches which fixes syncfs() error reporting
> > for overlayfs both for volatile and non-volatile mount?
> >
> 
> - mount 1000 overlays
> - 1 writeback error recorded in upper sb
> - syncfs (new fd) inside each of the 1000 containers
> 
> With your patch 3/3 only one syncfs will report an error for
> both volatile and non-volatile cases. Right?

Right. If you don't have an old fd open in each container, then only
one container will see the error. If you want to see error in each
container, then one fd needs to be kept opened in each container
before error hapens and call syncfs() on that fd, and then each
container should see the error.

> 
> What I would rather see is:
> - Non-volatile: first syncfs in every container gets an error (nice to have)

I am not sure why are we making this behavior per container. This should
be no different from current semantics we have for syncfs() on regular
filesystem. And that will provide what you are looking for. If you
want single error to be reported in all ovleray mounts, then make
sure you have one fd open in each mount after mount, then call syncfs()
on that fd.

Not sure why overlayfs behavior/semantics should be any differnt
than what regular filessytems like ext4/xfs are offering. Once we
get page cache sharing sorted out with xfs reflink, then people
will not even need overlayfs and be able to launch containers
just using xfs reflink and share base image. In that case also
they will need to keep an fd open per container they want to
see an error in.

So my patches exactly provide that. syncfs() behavior is same with
overlayfs as application gets it on other filesystems. And to me
its important to keep behavior same.

> - Volatile: every syncfs and every fsync in every container gets an error
>   (important IMO)

For volatile mounts, I agree that we need to fail overlayfs i

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > Currently syncfs() and fsync() seem to be two interfaces which check and
> > return writeback errors on superblock to user space. fsync() should
> > work fine with overlayfs as it relies on underlying filesystem to
> > do the check and return error. For example, if ext4 is on upper filesystem,
> > then ext4_sync_file() calls file_check_and_advance_wb_err(file) on
> > upper file and returns error. So overlayfs does not have to do anything
> > special.
> > 
> > But with syncfs(), error check happens in vfs in syncfs() w.r.t
> > overlay_sb->s_wb_err. Given overlayfs is stacked filesystem, it
> > does not do actual writeback and all writeback errors are recorded
> > on underlying filesystem. So sb->s_wb_err is never updated hence
> > syncfs() does not work with overlay.
> > 
> > Jeff suggested that instead of trying to propagate errors to overlay
> > super block, why not simply check for errors against upper filesystem
> > super block. I implemented this idea.
> > 
> > Overlay file has "since" value which needs to be initialized at open
> > time. Overlay overrides VFS initialization and re-initializes
> > f->f_sb_err w.r.t upper super block. Later when
> > ovl_sb->errseq_check_advance() is called, f->f_sb_err is used as
> > since value to figure out if any error on upper sb has happened since
> > then.
> > 
> > Note, Right now this patch only deals with regular file and directories.
> > Yet to deal with special files like device inodes, socket, fifo etc.
> > 
> > Suggested-by: Jeff Layton 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/overlayfs/file.c  |  1 +
> >  fs/overlayfs/overlayfs.h |  1 +
> >  fs/overlayfs/readdir.c   |  1 +
> >  fs/overlayfs/super.c | 23 +++
> >  fs/overlayfs/util.c  | 13 +
> >  5 files changed, 39 insertions(+)
> > 
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index efccb7c1f9bc..7b58a44dcb71 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -163,6 +163,7 @@ static int ovl_open(struct inode *inode, struct file 
> > *file)
> > return PTR_ERR(realfile);
> >  
> > file->private_data = realfile;
> > +   ovl_init_file_errseq(file);
> >  
> > return 0;
> >  }
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index f8880aa2ba0e..47838abbfb3d 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct 
> > dentry *dentry);
> >  bool ovl_is_metacopy_dentry(struct dentry *dentry);
> >  char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
> >  int padding);
> > +void ovl_init_file_errseq(struct file *file);
> >  
> >  static inline bool ovl_is_impuredir(struct super_block *sb,
> > struct dentry *dentry)
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 01620ebae1bd..0c48f1545483 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -960,6 +960,7 @@ static int ovl_dir_open(struct inode *inode, struct 
> > file *file)
> > od->is_real = ovl_dir_is_real(file->f_path.dentry);
> > od->is_upper = OVL_TYPE_UPPER(type);
> > file->private_data = od;
> > +   ovl_init_file_errseq(file);
> >  
> > return 0;
> >  }
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 290983bcfbb3..d99867983722 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -390,6 +390,28 @@ static int ovl_remount(struct super_block *sb, int 
> > *flags, char *data)
> > return ret;
> >  }
> >  
> > +static int ovl_errseq_check_advance(struct super_block *sb, struct file 
> > *file)
> > +{
> > +   struct ovl_fs *ofs = sb->s_fs_info;
> > +   struct super_block *upper_sb;
> > +   int ret;
> > +
> > +   if (!ovl_upper_mnt(ofs))
> > +   return 0;
> > +
> > +   upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +
> > +   if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> > +   return 0;
> > +
> > +   /* Something changed, must use slow path */
> > +   spin_lock(>f_lock);
> > +   ret = errseq_check_and_advance(_sb->s_wb_err, &

Re: [PATCH 2/3] vfs: Add a super block operation to check for writeback errors

2021-01-04 Thread Vivek Goyal
On Wed, Dec 23, 2020 at 07:48:52AM -0500, Jeff Layton wrote:
> On Mon, 2020-12-21 at 14:50 -0500, Vivek Goyal wrote:
> > Right now we check for errors on super block in syncfs().
> > 
> > ret2 = errseq_check_and_advance(>s_wb_err, >f_sb_err);
> > 
> > overlayfs does not update sb->s_wb_err and it is tracked on upper 
> > filesystem.
> > So provide a superblock operation to check errors so that filesystem
> > can provide override generic method and provide its own method to
> > check for writeback errors.
> > 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/sync.c  | 5 -
> >  include/linux/fs.h | 1 +
> >  2 files changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/sync.c b/fs/sync.c
> > index b5fb83a734cd..57e43a16dfca 100644
> > --- a/fs/sync.c
> > +++ b/fs/sync.c
> > @@ -176,7 +176,10 @@ SYSCALL_DEFINE1(syncfs, int, fd)
> >     ret = sync_filesystem(sb);
> >     up_read(>s_umount);
> >  
> > 
> > -   ret2 = errseq_check_and_advance(>s_wb_err, >f_sb_err);
> > +   if (sb->s_op->errseq_check_advance)
> > +   ret2 = sb->s_op->errseq_check_advance(sb, f.file);
> > +   else
> > +   ret2 = errseq_check_and_advance(>s_wb_err, 
> > >f_sb_err);
> >  
> > 
> >     fdput(f);
> >     return ret ? ret : ret2;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 8667d0cdc71e..4297b6127adf 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1965,6 +1965,7 @@ struct super_operations {
> >   struct shrink_control *);
> >     long (*free_cached_objects)(struct super_block *,
> >     struct shrink_control *);
> > +   int (*errseq_check_advance)(struct super_block *, struct file *);
> >  };
> >  
> > 
> >  /*
> 
> Also, the other super_operations generally don't take a superblock
> pointer when you pass in a different fs object pointer. This should
> probably just take a struct file * and then the operation can chase
> pointers to the superblock from there.

Ok, I will drop super_block * argument and just pass in "struct file *".

Vivek

>  
> -- 
> Jeff Layton 
> 



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Mon, Dec 28, 2020 at 03:56:18PM +, Matthew Wilcox wrote:
> On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> > To be clear, the main thing you'll lose with the method above is the
> > ability to see an unseen error on a newly opened fd, if there was an
> > overlayfs mount using the same upper sb before your open occurred.
> > 
> > IOW, consider two overlayfs mounts using the same upper layer sb:
> > 
> > ovlfs1  ovlfs2
> > --
> > mount
> > open fd1
> > write to fd1
> > 
> > mount (upper errseq_t SEEN flag marked)
> > open fd2
> > syncfs(fd2)
> > syncfs(fd1)
> > 
> > 
> > On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> > calls. The first one has a sample from before the error occurred, and
> > the second one has a sample of 0, due to the fact that the error was
> > unseen at open time.
> > 
> > On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> > return an error and syncfs(fd2) will not. If we split the SEEN flag into
> > two, then we can ensure that they both still get an error in this
> > situation.
> 
> But do we need to?  If the inode has been evicted we also lose the errno.

That's for the case of fsync(), right? For the case of syncfs() we will
not lose error as its stored in super_block.

Even for the case of fsync(), inode can be evicted only if no other
fd is opened for the file. So in above example, fd1 is opened so
inode can't be evicted, that means we will see error on syncfs(fd2)
and not lose it.

So if we start consuming upper fs on overlay mount(), it will be a
change of behavior for applications using same upper fs. So far
overlay mount() does not consume unseen error and even if an fd
is opened after the error, application will see error on super
block. If we consume error on mount(), we change behavior.

I am not saying that's necessarily bad, I am just trying to point
out that its a user space visible behavior change and worried
if somebody starts calling it a regression.

Anyway, I looks like two problems got mixed into same thread. One
problem we need to solve is that syncfs() on overlayfs should
report back writeback errors (as well as other errors) to applications.
And that's what this patch series is solving.

And then second issue is detecting writeback errors over remount
for volatile mounts. And that's where this question comes whether
we should split seen flag or we should simply consume error on
mount. So this can be further discussed when patches for this
changes are posted again.

For now, I will focus on trying to fix first issue and post patches
for that again after more testing.

Vivek


> The guarantee we provide is that a fd that was open before the error
> occurred will see the error.  An fd that's opened after the error occurred
> may or may not see the error.



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Mon, Dec 28, 2020 at 05:51:06PM +0200, Amir Goldstein wrote:
> On Mon, Dec 28, 2020 at 3:25 PM Jeff Layton  wrote:
> >
> > On Fri, 2020-12-25 at 08:50 +0200, Amir Goldstein wrote:
> > > On Thu, Dec 24, 2020 at 2:13 PM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> > > > > In current master, syncfs() on any file by any container user will
> > > > > result in full syncfs() of the upperfs, which is very bad for 
> > > > > container
> > > > > isolation. This has been partly fixed by Chengguang Xu [1] and I 
> > > > > expect
> > > > > his work will be merged soon. Overlayfs still does not do the 
> > > > > writeback
> > > > > and syncfs() in overlay still waits for all upper fs writeback to 
> > > > > complete,
> > > > > but at least syncfs() in overlay only kicks writeback for upper fs 
> > > > > files
> > > > > dirtied by this overlay.
> > > > >
> > > > > [1] 
> > > > > https://lore.kernel.org/linux-unionfs/cajfpegsbb4itxw8zyurfvnc63zg7ku7vzpsnuzhasyzh-d5...@mail.gmail.com/
> > > > >
> > > > > Sharing the same SEEN flag among thousands of containers is also
> > > > > far from ideal, because effectively this means that any given workload
> > > > > in any single container has very little chance of observing the SEEN 
> > > > > flag.
> > > >
> > > > Perhaps you misunderstand how errseq works.  If each container samples
> > > > the errseq at startup, then they will all see any error which occurs
> > > > during their lifespan
> > >
> > > Meant to say "...very little chance of NOT observing the SEEN flag",
> > > but We are not in disagreement.
> > > My argument against sharing the SEEN flag refers to Vivek's patch of
> > > stacked errseq_sample()/errseq_check_and_advance() which does NOT
> > > sample errseq at overlayfs mount time. That is why my next sentence is:
> > > "I do agree with Matthew that overlayfs should sample errseq...".
> > >
> > > > (and possibly an error which occurred before they started up).
> > > >
> > >
> > > Right. And this is where the discussion of splitting the SEEN flag 
> > > started.
> > > Some of us want to treat overlayfs mount time as a true epoc for errseq.
> > > The new container didn't write any files yet, so it should not care about
> > > writeback errors from the past.
> > >
> > > I agree that it may not be very critical, but as I wrote before, I think 
> > > we
> > > should do our best to try and isolate container workloads.
> > >
> > > > > To this end, I do agree with Matthew that overlayfs should sample 
> > > > > errseq
> > > > > and the best patchset to implement it so far IMO is Jeff's patchset 
> > > > > [2].
> > > > > This patch set was written to cater only "volatile" overlayfs mount, 
> > > > > but
> > > > > there is no reason not to use the same mechanism for regular overlay
> > > > > mount. The only difference being that "volatile" overlay only checks 
> > > > > for
> > > > > error since mount on syncfs() (because "volatile" overlay does NOT
> > > > > syncfs upper fs) and regular overlay checks and advances the overlay's
> > > > > errseq sample on syncfs (and does syncfs upper fs).
> > > > >
> > > > > Matthew, I hope that my explanation of the use case and Jeff's answer
> > > > > is sufficient to understand why the split of the SEEN flag is needed.
> > > > >
> > > > > [2] 
> > > > > https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlay...@kernel.org/
> > > >
> > > > No, it still feels weird and wrong.
> > > >
> > >
> > > All right. Considering your reservations, I think perhaps the split of the
> > > SEEN flag can wait for a later time after more discussions and maybe
> > > not as suitable for stable as we thought.
> > >
> > > I think that for stable, it would be sufficient to adapt Surgun's original
> > > syncfs for volatile mount patch [1] to cover the non-volatile case:
> > > on mout:
> > > - errseq_sample() upper fs
> > > - on volatile mount, errseq_check() upper fs and fail mount on un-SEEN 
> > > error
> > > on syncfs:
> > > - errseq_check() for volatile mount
> > > - errseq_check_and_advance() for non-volatile mount
> > > - errseq_set() overlay sb on upper fs error
> > >
> > > Now errseq_set() is not only a hack around __sync_filesystem ignoring
> > > return value of ->sync_fs(). It is really needed for per-overlay SEEN
> > > error isolation in the non-volatile case.
> > >
> > > Unless I am missing something, I think we do not strictly need Vivek's
> > > 1/3 patch [2] for stable, but not sure.
> > >
> > > Sargun,
> > >
> > > Do you agree with the above proposal?
> > > Will you make it into a patch?
> > >
> > > Vivek, Jefff,
> > >
> > > Do you agree that overlay syncfs observing writeback errors that predate
> > > overlay mount time is an issue that can be deferred (maybe forever)?
> > >
> >
> > That's very application dependent.
> >
> > To be clear, the main thing you'll lose with the method above is the
> > ability to see an unseen error on a newly opened fd, if there was an
> > 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Mon, Jan 04, 2021 at 05:22:07PM +0200, Amir Goldstein wrote:
> > > Since Jeff's patch is minimal, I think that it should be the fix applied
> > > first and proposed for stable (with adaptations for non-volatile overlay).
> >
> > Does stable fix has to be same as mainline fix. IOW, I think atleast in
> > mainline we should first fix it the right way and then think how to fix
> > it for stable. If fixes taken in mainline are not realistic for stable,
> > can we push a different small fix just for stable?
> 
> We can do a lot of things.
> But if we are able to create a series with minimal (and most critical) fixes
> followed by other fixes, it would be easier for everyone involved.

I am not sure this is really critical. writeback error reporting for
overlayfs are broken since the beginning for regular mounts. There is no
notion of these errors being reported to user space. If that did not
create a major issue, then why suddenly volatile mounts make it
a critical issue.

To me we should fix the issue properly which is easy to maintain
down the line and then worry about doing a stable fix if need be.

> 
> >
> > IOW, because we have to push a fix in stable, should not determine
> > what should be problem solution for mainline, IMHO.
> >
> 
> I find in this case there is a correlation between the simplest fix and the
> most relevant fix for stable.
> 
> > The porblem I have with Jeff's fix is that its only works for volatile
> > mounts. While I prefer a solution where syncfs() is fixed both for
> > volatile as well as non-volatile mount and then there is less confusion.
> >
> 
> I proposed a variation on Jeff's patch that covers both cases.
> Sargun is going to work on it.

What's the problem with my patches which fixes syncfs() error reporting
for overlayfs both for volatile and non-volatile mount?

Thanks
Vivek



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> On Wed, Dec 23, 2020 at 10:44 PM Matthew Wilcox  wrote:
> >
> > On Wed, Dec 23, 2020 at 08:21:41PM +, Sargun Dhillon wrote:
> > > On Wed, Dec 23, 2020 at 08:07:46PM +, Matthew Wilcox wrote:
> > > > On Wed, Dec 23, 2020 at 07:29:41PM +, Sargun Dhillon wrote:
> > > > > On Wed, Dec 23, 2020 at 06:50:44PM +, Matthew Wilcox wrote:
> > > > > > On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> > > > > > > I fail to see why this is neccessary if you incorporate error 
> > > > > > > reporting into the
> > > > > > > sync_fs callback. Why is this separate from that callback? If you 
> > > > > > > pickup Jeff's
> > > > > > > patch that adds the 2nd flag to errseq for "observed", you should 
> > > > > > > be able to
> > > > > > > stash the first errseq seen in the ovl_fs struct, and do the 
> > > > > > > check-and-return
> > > > > > > in there instead instead of adding this new infrastructure.
> > > > > >
> > > > > > You still haven't explained why you want to add the "observed" flag.
> > > > >
> > > > >
> > > > > In the overlayfs model, many users may be using the same filesystem 
> > > > > (super block)
> > > > > for their upperdir. Let's say you have something like this:
> > > > >
> > > > > /workdir [Mounted FS]
> > > > > /workdir/upperdir1 [overlayfs upperdir]
> > > > > /workdir/upperdir2 [overlayfs upperdir]
> > > > > /workdir/userscratchspace
> > > > >
> > > > > The user needs to be able to do something like:
> > > > > sync -f ${overlayfs1}/file
> > > > >
> > > > > which in turn will call sync on the the underlying filesystem (the 
> > > > > one mounted
> > > > > on /workdir), and can check if the errseq has changed since the 
> > > > > overlayfs was
> > > > > mounted, and use that to return an error to the user.
> > > >
> > > > OK, but I don't see why the current scheme doesn't work for this.  If
> > > > (each instance of) overlayfs samples the errseq at mount time and then
> > > > check_and_advances it at sync time, it will see any error that has 
> > > > occurred
> > > > since the mount happened (and possibly also an error which occurred 
> > > > before
> > > > the mount happened, but hadn't been reported to anybody before).
> > > >
> > >
> > > If there is an outstanding error at mount time, and the SEEN flag is 
> > > unset,
> > > subsequent errors will not increment the counter, until the user calls 
> > > sync on
> > > the upperdir's filesystem. If overlayfs calls check_and_advance on the 
> > > upperdir's
> > > super block at any point, it will then set the seen block, and if the 
> > > user calls
> > > syncfs on the upperdir, it will not return that there is an outstanding 
> > > error,
> > > since overlayfs just cleared it.
> >
> > Your concern is this case:
> >
> > fs is mounted on /workdir
> > /workdir/A is written to and then closed.
> > writeback happens and -EIO happens, but there's nobody around to care.
> > /workdir/upperdir1 becomes part of an overlayfs mount
> > overlayfs samples the error
> > a user writes to /workdir/B, another -EIO occurs, but nothing happens
> > someone calls syncfs on /workdir/upperdir/A, gets the EIO.
> > a user opens /workdir/B and calls syncfs, but sees no error
> >
> > do i have that right?  or is it something else?
> 
> IMO it is something else. Others may disagree.
> IMO the level of interference between users accessing overlay and users
> accessing upper fs directly is not well defined and it can stay this way.
> 
> Concurrent access to  /workdir/upperdir/A via overlay and underlying fs
> is explicitly warranted against in Documentation/filesystems/overlayfs.rst#
> Changes to underlying filesystems:
> "Changes to the underlying filesystems while part of a mounted overlay
> filesystem are not allowed.  If the underlying filesystem is changed,
> the behavior of the overlay is undefined, though it will not result in
> a crash or deadlock."

I think people use same underlying filesystem both as upper for multiple
overlayfs mounts as well as root filesystem. For example, when you
run podman (or docker), they all share same filesystem for all containers
as well as other non-containered apps use same filesystem.

IIUC, what we meant to say is that lowerdir/workdir/upperdir being
used for overlayfs mount should be left untouched. Right?

What I am trying to say is that while discussing this problem and
solution, we should assume that both a regular application might
be using same upper fs as being used by overlayfs. It seems to
be a very common operating model.

> 
> The question is whether syncfs(open(/workdir/B)) is considered
> "Changes to the underlying filesystems". Regardless of the answer,
> this is not an interesting case IMO.
> 
> The real issue is with interference between overlays that share the
> same upper fs, because this is by far and large the common use case
> that is creating real problems for a lot of container users.
> 
> Workloads running inside containers 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-22 Thread Vivek Goyal
On Tue, Dec 22, 2020 at 05:46:37PM +, Matthew Wilcox wrote:
> On Tue, Dec 22, 2020 at 11:29:25AM -0500, Vivek Goyal wrote:
> > On Tue, Dec 22, 2020 at 04:20:27PM +, Matthew Wilcox wrote:
> > > On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > > > +static int ovl_errseq_check_advance(struct super_block *sb, struct 
> > > > file *file)
> > > > +{
> > > > +   struct ovl_fs *ofs = sb->s_fs_info;
> > > > +   struct super_block *upper_sb;
> > > > +   int ret;
> > > > +
> > > > +   if (!ovl_upper_mnt(ofs))
> > > > +   return 0;
> > > > +
> > > > +   upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > +
> > > > +   if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> > > > +   return 0;
> > > > +
> > > > +   /* Something changed, must use slow path */
> > > > +   spin_lock(>f_lock);
> > > > +   ret = errseq_check_and_advance(_sb->s_wb_err, 
> > > > >f_sb_err);
> > > > +   spin_unlock(>f_lock);
> > > 
> > > Why are you microoptimising syncfs()?  Are there really applications which
> > > call syncfs() in a massively parallel manner on the same file descriptor?
> > 
> > This is atleast theoritical race. I am not aware which application can
> > trigger this race. So to me it makes sense to fix the race.
> > 
> > Jeff Layton also posted a fix for syncfs().
> > 
> > https://lore.kernel.org/linux-fsdevel/20201219134804.20034-1-jlay...@kernel.org/
> > 
> > To me it makes sense to fix the race irrespective of the fact if somebody
> > hit it or not. People end up copying code in other parts of kernel and
> > and they will atleast copy race free code.
> 
> Let me try again.  "Why are you trying to avoid taking the spinlock?"

Aha.., sorry, I misunderstood your question. I don't have a good answer.
I just copied the code from Jeff Layton's patch.

Agreed that cost of taking spin lock will not be significant until
syncfs() is called at high frequency. Having said that, most of the
time taking spin lock will not be needed, so avoiding it with
a simple call to errseq_check() sounds reasonable too.

I don't have any strong opinions here. I am fine with any of the
implementation people like.

Vivek



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-22 Thread Vivek Goyal
On Tue, Dec 22, 2020 at 04:20:27PM +, Matthew Wilcox wrote:
> On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > +static int ovl_errseq_check_advance(struct super_block *sb, struct file 
> > *file)
> > +{
> > +   struct ovl_fs *ofs = sb->s_fs_info;
> > +   struct super_block *upper_sb;
> > +   int ret;
> > +
> > +   if (!ovl_upper_mnt(ofs))
> > +   return 0;
> > +
> > +   upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +
> > +   if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> > +   return 0;
> > +
> > +   /* Something changed, must use slow path */
> > +   spin_lock(>f_lock);
> > +   ret = errseq_check_and_advance(_sb->s_wb_err, >f_sb_err);
> > +   spin_unlock(>f_lock);
> 
> Why are you microoptimising syncfs()?  Are there really applications which
> call syncfs() in a massively parallel manner on the same file descriptor?

This is atleast theoritical race. I am not aware which application can
trigger this race. So to me it makes sense to fix the race.

Jeff Layton also posted a fix for syncfs().

https://lore.kernel.org/linux-fsdevel/20201219134804.20034-1-jlay...@kernel.org/

To me it makes sense to fix the race irrespective of the fact if somebody
hit it or not. People end up copying code in other parts of kernel and
and they will atleast copy race free code.

Vivek



Re: [PATCH 2/3] vfs: Add a super block operation to check for writeback errors

2020-12-22 Thread Vivek Goyal
On Tue, Dec 22, 2020 at 04:19:00PM +, Matthew Wilcox wrote:
> On Mon, Dec 21, 2020 at 02:50:54PM -0500, Vivek Goyal wrote:
> > -   ret2 = errseq_check_and_advance(>s_wb_err, >f_sb_err);
> > +   if (sb->s_op->errseq_check_advance)
> > +   ret2 = sb->s_op->errseq_check_advance(sb, f.file);
> 
> What a terrible name for an fs operation.  You don't seem to be able
> to distinguish between semantics and implementation.  How about
> check_error()?

check_error() sounds better. I was not very happy with the name either.
Thought of starting with something.

Vivek



Re: [PATCH 1/3] vfs: Do not ignore return code from s_op->sync_fs

2020-12-22 Thread Vivek Goyal
On Tue, Dec 22, 2020 at 12:23:11PM +1100, NeilBrown wrote:

[...]
> > diff --git a/fs/sync.c b/fs/sync.c
> > index 1373a610dc78..b5fb83a734cd 100644
> > --- a/fs/sync.c
> > +++ b/fs/sync.c
> > @@ -30,14 +30,18 @@
> >   */
> >  static int __sync_filesystem(struct super_block *sb, int wait)
> >  {
> > +   int ret, ret2;
> > +
> > if (wait)
> > sync_inodes_sb(sb);
> > else
> > writeback_inodes_sb(sb, WB_REASON_SYNC);
> >  
> > if (sb->s_op->sync_fs)
> > -   sb->s_op->sync_fs(sb, wait);
> > -   return __sync_blockdev(sb->s_bdev, wait);
> > +   ret = sb->s_op->sync_fs(sb, wait);
> > +   ret2 = __sync_blockdev(sb->s_bdev, wait);
> > +
> > +   return ret ? ret : ret2;
> 
> I'm surprised that the compiler didn't complain that 'ret' might be used
> uninitialized.

Indeed. That "ret" can be used uninitialized. Here is the fixed patch.


Subject: vfs: Do not ignore return code from s_op->sync_fs

Current implementation of __sync_filesystem() ignores the
return code from ->sync_fs(). I am not sure why that's the case.

Ignoring ->sync_fs() return code is problematic for overlayfs where
it can return error if sync_filesystem() on upper super block failed.
That error will simply be lost and sycnfs(overlay_fd), will get
success (despite the fact it failed).

Al Viro noticed that there are other filesystems which can sometimes
return error in ->sync_fs() and these errors will be ignored too.

fs/btrfs/super.c:2412:  .sync_fs= btrfs_sync_fs,
fs/exfat/super.c:204:   .sync_fs= exfat_sync_fs,
fs/ext4/super.c:1674:   .sync_fs= ext4_sync_fs,
fs/f2fs/super.c:2480:   .sync_fs= f2fs_sync_fs,
fs/gfs2/super.c:1600:   .sync_fs= gfs2_sync_fs,
fs/hfsplus/super.c:368: .sync_fs= hfsplus_sync_fs,
fs/nilfs2/super.c:689:  .sync_fs= nilfs_sync_fs,
fs/ocfs2/super.c:139:   .sync_fs= ocfs2_sync_fs,
fs/overlayfs/super.c:399:   .sync_fs= ovl_sync_fs,
fs/ubifs/super.c:2052:  .sync_fs   = ubifs_sync_fs,

Hence, this patch tries to fix it and capture error returned
by ->sync_fs() and return to caller. I am specifically interested
in syncfs() path and return error to user.

I am assuming that we want to continue to call __sync_blockdev()
despite the fact that there have been errors reported from
->sync_fs(). So this patch continues to call __sync_blockdev()
even if ->sync_fs() returns an error.

Al noticed that there are few other callsites where ->sync_fs() error
code is being ignored. 

sync_fs_one_sb(): For this it seems desirable to ignore the return code.

dquot_disable(): Jan Kara mentioned that ignoring return code here is fine
 because we don't want to fail dquot_disable() just beacuse
 caches might be incoherent.

dquot_quota_sync(): Jan thinks that it might make some sense to capture
return code here. But I am leaving it untouched for
   now. When somebody needs it, they can easily fix it.

Signed-off-by: Vivek Goyal 
---
 fs/sync.c |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Index: redhat-linux/fs/sync.c
===
--- redhat-linux.orig/fs/sync.c 2020-12-22 09:56:04.543483440 -0500
+++ redhat-linux/fs/sync.c  2020-12-22 10:01:28.560483440 -0500
@@ -30,14 +30,18 @@
  */
 static int __sync_filesystem(struct super_block *sb, int wait)
 {
+   int ret = 0, ret2;
+
if (wait)
sync_inodes_sb(sb);
else
writeback_inodes_sb(sb, WB_REASON_SYNC);
 
if (sb->s_op->sync_fs)
-   sb->s_op->sync_fs(sb, wait);
-   return __sync_blockdev(sb->s_bdev, wait);
+   ret = sb->s_op->sync_fs(sb, wait);
+   ret2 = __sync_blockdev(sb->s_bdev, wait);
+
+   return ret ? ret : ret2;
 }
 
 /*



[PATCH 2/3] vfs: Add a super block operation to check for writeback errors

2020-12-21 Thread Vivek Goyal
Right now we check for errors on super block in syncfs().

ret2 = errseq_check_and_advance(>s_wb_err, >f_sb_err);

overlayfs does not update sb->s_wb_err and it is tracked on upper filesystem.
So provide a superblock operation to check errors so that filesystem
can provide override generic method and provide its own method to
check for writeback errors.

Signed-off-by: Vivek Goyal 
---
 fs/sync.c  | 5 -
 include/linux/fs.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/sync.c b/fs/sync.c
index b5fb83a734cd..57e43a16dfca 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -176,7 +176,10 @@ SYSCALL_DEFINE1(syncfs, int, fd)
ret = sync_filesystem(sb);
up_read(>s_umount);
 
-   ret2 = errseq_check_and_advance(>s_wb_err, >f_sb_err);
+   if (sb->s_op->errseq_check_advance)
+   ret2 = sb->s_op->errseq_check_advance(sb, f.file);
+   else
+   ret2 = errseq_check_and_advance(>s_wb_err, 
>f_sb_err);
 
fdput(f);
return ret ? ret : ret2;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8667d0cdc71e..4297b6127adf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1965,6 +1965,7 @@ struct super_operations {
  struct shrink_control *);
long (*free_cached_objects)(struct super_block *,
struct shrink_control *);
+   int (*errseq_check_advance)(struct super_block *, struct file *);
 };
 
 /*
-- 
2.25.4



[RFC PATCH 0/3][v3] vfs, overlayfs: Fix syncfs() to return correct errors

2020-12-21 Thread Vivek Goyal
Hi,

This is v3 of patches which try to fix syncfs() error handling issues
w.r.t overlayfs and other filesystems.

Previous version of patches are here.
v2: 
https://lore.kernel.org/linux-fsdevel/20201216233149.39025-1-vgo...@redhat.com/
v1:
https://lore.kernel.org/linux-fsdevel/20201216143802.ga10...@redhat.com/

This series basically is trying to fix two problems.

- First problem is that we ignore error code returned by ->sync_fs().
  overlayfs file system can return error and there are other file
  systems which can return error in certain cases. So to fix this issue,
  first patch captures the return code from ->sync_fs and returns to
  user space.

- Second problem is that current syncfs(), writeback error detection
  logic does not work for overlayfs. current logic relies on all
  sb->s_wb_err being update when errors occur but that's not true for
  overlayfs. Real errors happen on underlyig filessytem and overlayfs
  has no clue about these. To fix this issue, it has been proposed
  that for filesystems like overlayfs, this check should be moved into
  filesystem and then filesystem can check for error w.r.t upper super
  block.

  There seem to be multiple ways of how this can be done.

  A. Add a "struct file" argument to ->sync_fs() and modify all helpers.
  B. Add a separate file operation say "f_op->syncfs()" and call that
 in syncfs().
  C. Add a separate super block operation to check and advance errors.

Option A involves lot of changes all across the code. Also it is little
problematic in the sense that for filesystems having a block device,
looks like we want to check for errors after ___sync_blockdev() has
returned. But ->sync_fs() is called before that. That means
__sync_blockdev() will have to be pushed in side filesystem code as
well. Jeff Layton gave something like this a try here.

https://lore.kernel.org/linux-fsdevel/20180518123415.28181-1-jlay...@kernel.org/

I posted patches for option B in V2. 

https://lore.kernel.org/linux-fsdevel/20201216233149.39025-1-vgo...@redhat.com/

Now this is V3 of patches which implements option C. I think this is
simplest in terms of implementation atleast.

These patches are only compile tested. Will do more testing once I get
a sense which option has a chance to fly.

I think patch 1 should be applied irrespective of what option we end
up choosing for fixing the writeback error issue.

Thanks
Vivek

Vivek Goyal (3):
  vfs: Do not ignore return code from s_op->sync_fs
  vfs: Add a super block operation to check for writeback errors
  overlayfs: Report writeback errors on upper

 fs/overlayfs/file.c  |  1 +
 fs/overlayfs/overlayfs.h |  1 +
 fs/overlayfs/readdir.c   |  1 +
 fs/overlayfs/super.c | 23 +++
 fs/overlayfs/util.c  | 13 +
 fs/sync.c| 13 ++---
 include/linux/fs.h   |  1 +
 7 files changed, 50 insertions(+), 3 deletions(-)

-- 
2.25.4



[PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-21 Thread Vivek Goyal
Currently syncfs() and fsync() seem to be two interfaces which check and
return writeback errors on superblock to user space. fsync() should
work fine with overlayfs as it relies on underlying filesystem to
do the check and return error. For example, if ext4 is on upper filesystem,
then ext4_sync_file() calls file_check_and_advance_wb_err(file) on
upper file and returns error. So overlayfs does not have to do anything
special.

But with syncfs(), error check happens in vfs in syncfs() w.r.t
overlay_sb->s_wb_err. Given overlayfs is stacked filesystem, it
does not do actual writeback and all writeback errors are recorded
on underlying filesystem. So sb->s_wb_err is never updated hence
syncfs() does not work with overlay.

Jeff suggested that instead of trying to propagate errors to overlay
super block, why not simply check for errors against upper filesystem
super block. I implemented this idea.

Overlay file has "since" value which needs to be initialized at open
time. Overlay overrides VFS initialization and re-initializes
f->f_sb_err w.r.t upper super block. Later when
ovl_sb->errseq_check_advance() is called, f->f_sb_err is used as
since value to figure out if any error on upper sb has happened since
then.

Note, Right now this patch only deals with regular file and directories.
Yet to deal with special files like device inodes, socket, fifo etc.

Suggested-by: Jeff Layton 
Signed-off-by: Vivek Goyal 
---
 fs/overlayfs/file.c  |  1 +
 fs/overlayfs/overlayfs.h |  1 +
 fs/overlayfs/readdir.c   |  1 +
 fs/overlayfs/super.c | 23 +++
 fs/overlayfs/util.c  | 13 +
 5 files changed, 39 insertions(+)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index efccb7c1f9bc..7b58a44dcb71 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -163,6 +163,7 @@ static int ovl_open(struct inode *inode, struct file *file)
return PTR_ERR(realfile);
 
file->private_data = realfile;
+   ovl_init_file_errseq(file);
 
return 0;
 }
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f8880aa2ba0e..47838abbfb3d 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct 
dentry *dentry);
 bool ovl_is_metacopy_dentry(struct dentry *dentry);
 char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
 int padding);
+void ovl_init_file_errseq(struct file *file);
 
 static inline bool ovl_is_impuredir(struct super_block *sb,
struct dentry *dentry)
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 01620ebae1bd..0c48f1545483 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -960,6 +960,7 @@ static int ovl_dir_open(struct inode *inode, struct file 
*file)
od->is_real = ovl_dir_is_real(file->f_path.dentry);
od->is_upper = OVL_TYPE_UPPER(type);
file->private_data = od;
+   ovl_init_file_errseq(file);
 
return 0;
 }
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..d99867983722 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -390,6 +390,28 @@ static int ovl_remount(struct super_block *sb, int *flags, 
char *data)
return ret;
 }
 
+static int ovl_errseq_check_advance(struct super_block *sb, struct file *file)
+{
+   struct ovl_fs *ofs = sb->s_fs_info;
+   struct super_block *upper_sb;
+   int ret;
+
+   if (!ovl_upper_mnt(ofs))
+   return 0;
+
+   upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
+
+   if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
+   return 0;
+
+   /* Something changed, must use slow path */
+   spin_lock(>f_lock);
+   ret = errseq_check_and_advance(_sb->s_wb_err, >f_sb_err);
+   spin_unlock(>f_lock);
+
+   return ret;
+}
+
 static const struct super_operations ovl_super_operations = {
.alloc_inode= ovl_alloc_inode,
.free_inode = ovl_free_inode,
@@ -400,6 +422,7 @@ static const struct super_operations ovl_super_operations = 
{
.statfs = ovl_statfs,
.show_options   = ovl_show_options,
.remount_fs = ovl_remount,
+   .errseq_check_advance   = ovl_errseq_check_advance,
 };
 
 enum {
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 23f475627d07..a1742847f3a8 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -950,3 +950,16 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct 
dentry *dentry,
kfree(buf);
return ERR_PTR(res);
 }
+
+void ovl_init_file_errseq(struct file *file)
+{
+   struct super_block *sb = file_dentry(file)->d_sb;
+   struct ovl_fs *ofs = sb->s_fs_info;
+   struct super_block *upper_sb;
+
+   if (!ovl_upper_mnt(ofs))
+   return;
+
+   upper_sb = ovl_upper_

[PATCH 1/3] vfs: Do not ignore return code from s_op->sync_fs

2020-12-21 Thread Vivek Goyal
Current implementation of __sync_filesystem() ignores the
return code from ->sync_fs(). I am not sure why that's the case.

Ignoring ->sync_fs() return code is problematic for overlayfs where
it can return error if sync_filesystem() on upper super block failed.
That error will simply be lost and sycnfs(overlay_fd), will get
success (despite the fact it failed).

Al Viro noticed that there are other filesystems which can sometimes
return error in ->sync_fs() and these errors will be ignored too.

fs/btrfs/super.c:2412:  .sync_fs= btrfs_sync_fs,
fs/exfat/super.c:204:   .sync_fs= exfat_sync_fs,
fs/ext4/super.c:1674:   .sync_fs= ext4_sync_fs,
fs/f2fs/super.c:2480:   .sync_fs= f2fs_sync_fs,
fs/gfs2/super.c:1600:   .sync_fs= gfs2_sync_fs,
fs/hfsplus/super.c:368: .sync_fs= hfsplus_sync_fs,
fs/nilfs2/super.c:689:  .sync_fs= nilfs_sync_fs,
fs/ocfs2/super.c:139:   .sync_fs= ocfs2_sync_fs,
fs/overlayfs/super.c:399:   .sync_fs= ovl_sync_fs,
fs/ubifs/super.c:2052:  .sync_fs   = ubifs_sync_fs,

Hence, this patch tries to fix it and capture error returned
by ->sync_fs() and return to caller. I am specifically interested
in syncfs() path and return error to user.

I am assuming that we want to continue to call __sync_blockdev()
despite the fact that there have been errors reported from
->sync_fs(). So this patch continues to call __sync_blockdev()
even if ->sync_fs() returns an error.

Al noticed that there are few other callsites where ->sync_fs() error
code is being ignored.

sync_fs_one_sb(): For this it seems desirable to ignore the return code.

dquot_disable(): Jan Kara mentioned that ignoring return code here is fine
 because we don't want to fail dquot_disable() just beacuse
 caches might be incoherent.

dquot_quota_sync(): Jan thinks that it might make some sense to capture
return code here. But I am leaving it untouched for
   now. When somebody needs it, they can easily fix it.

Signed-off-by: Vivek Goyal 
---
 fs/sync.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index 1373a610dc78..b5fb83a734cd 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -30,14 +30,18 @@
  */
 static int __sync_filesystem(struct super_block *sb, int wait)
 {
+   int ret, ret2;
+
if (wait)
sync_inodes_sb(sb);
else
writeback_inodes_sb(sb, WB_REASON_SYNC);
 
if (sb->s_op->sync_fs)
-   sb->s_op->sync_fs(sb, wait);
-   return __sync_blockdev(sb->s_bdev, wait);
+   ret = sb->s_op->sync_fs(sb, wait);
+   ret2 = __sync_blockdev(sb->s_bdev, wait);
+
+   return ret ? ret : ret2;
 }
 
 /*
-- 
2.25.4



Re: [PATCH 3/3] overlayfs: Check writeback errors w.r.t upper in ->syncfs()

2020-12-18 Thread Vivek Goyal
On Fri, Dec 18, 2020 at 10:02:58AM -0500, Jeff Layton wrote:
> On Fri, Dec 18, 2020 at 09:44:18AM -0500, Vivek Goyal wrote:
> > On Thu, Dec 17, 2020 at 03:08:56PM -0500, Jeffrey Layton wrote:
> > > On Wed, Dec 16, 2020 at 06:31:49PM -0500, Vivek Goyal wrote:
> > > > Check for writeback error on overlay super block w.r.t "struct file"
> > > > passed in ->syncfs().
> > > > 
> > > > As of now real error happens on upper sb. So this patch first propagates
> > > > error from upper sb to overlay sb and then checks error w.r.t struct
> > > > file passed in.
> > > > 
> > > > Jeff, I know you prefer that I should rather file upper file and check
> > > > error directly on on upper sb w.r.t this real upper file.  While I was
> > > > implementing that I thought what if file is on lower (and has not been
> > > > copied up yet). In that case shall we not check writeback errors and
> > > > return back to user space? That does not sound right though because,
> > > > we are not checking for writeback errors on this file. Rather we
> > > > are checking for any error on superblock. Upper might have an error
> > > > and we should report it to user even if file in question is a lower
> > > > file. And that's why I fell back to this approach. But I am open to
> > > > change it if there are issues in this method.
> > > > 
> > > > Signed-off-by: Vivek Goyal 
> > > > ---
> > > >  fs/overlayfs/ovl_entry.h |  2 ++
> > > >  fs/overlayfs/super.c | 15 ---
> > > >  2 files changed, 14 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> > > > index 1b5a2094df8e..a08fd719ee7b 100644
> > > > --- a/fs/overlayfs/ovl_entry.h
> > > > +++ b/fs/overlayfs/ovl_entry.h
> > > > @@ -79,6 +79,8 @@ struct ovl_fs {
> > > > atomic_long_t last_ino;
> > > > /* Whiteout dentry cache */
> > > > struct dentry *whiteout;
> > > > +   /* Protects multiple sb->s_wb_err update from upper_sb . */
> > > > +   spinlock_t errseq_lock;
> > > >  };
> > > >  
> > > >  static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
> > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > index b4d92e6fa5ce..e7bc4492205e 100644
> > > > --- a/fs/overlayfs/super.c
> > > > +++ b/fs/overlayfs/super.c
> > > > @@ -291,7 +291,7 @@ int ovl_syncfs(struct file *file)
> > > > struct super_block *sb = file->f_path.dentry->d_sb;
> > > > struct ovl_fs *ofs = sb->s_fs_info;
> > > > struct super_block *upper_sb;
> > > > -   int ret;
> > > > +   int ret, ret2;
> > > >  
> > > > ret = 0;
> > > > down_read(>s_umount);
> > > > @@ -310,10 +310,18 @@ int ovl_syncfs(struct file *file)
> > > > ret = sync_filesystem(upper_sb);
> > > > up_read(_sb->s_umount);
> > > >  
> > > > +   /* Update overlay sb->s_wb_err */
> > > > +   if (errseq_check(_sb->s_wb_err, sb->s_wb_err)) {
> > > > +   /* Upper sb has errors since last time */
> > > > +   spin_lock(>errseq_lock);
> > > > +   errseq_check_and_advance(_sb->s_wb_err, 
> > > > >s_wb_err);
> > > > +   spin_unlock(>errseq_lock);
> > > > +   }
> > > 
> > > So, the problem here is that the resulting value in sb->s_wb_err is
> > > going to end up with the REPORTED flag set (using the naming in my
> > > latest set). So, a later opener of a file on sb->s_wb_err won't see it.
> > > 
> > > For instance, suppose you call sync() on the box and does the above
> > > check and advance. Then, you open the file and call syncfs() and get
> > > back no error because REPORTED flag was set when you opened. That error
> > > will then be lost.
> > 
> > Hi Jeff,
> > 
> > In this patch, I am doing this only in ->syncfs() path and not in
> > ->sync_fs() path. IOW, errseq_check_and_advance() will take place
> > only if there is a valid "struct file" passed in. That means there
> > is a consumer of the error and that means it should be fine to
> > set the sb->s_wb_err as SEEN/

Re: [PATCH 3/3] overlayfs: Check writeback errors w.r.t upper in ->syncfs()

2020-12-18 Thread Vivek Goyal
On Thu, Dec 17, 2020 at 03:08:56PM -0500, Jeffrey Layton wrote:
> On Wed, Dec 16, 2020 at 06:31:49PM -0500, Vivek Goyal wrote:
> > Check for writeback error on overlay super block w.r.t "struct file"
> > passed in ->syncfs().
> > 
> > As of now real error happens on upper sb. So this patch first propagates
> > error from upper sb to overlay sb and then checks error w.r.t struct
> > file passed in.
> > 
> > Jeff, I know you prefer that I should rather file upper file and check
> > error directly on on upper sb w.r.t this real upper file.  While I was
> > implementing that I thought what if file is on lower (and has not been
> > copied up yet). In that case shall we not check writeback errors and
> > return back to user space? That does not sound right though because,
> > we are not checking for writeback errors on this file. Rather we
> > are checking for any error on superblock. Upper might have an error
> > and we should report it to user even if file in question is a lower
> > file. And that's why I fell back to this approach. But I am open to
> > change it if there are issues in this method.
> > 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/overlayfs/ovl_entry.h |  2 ++
> >  fs/overlayfs/super.c | 15 ---
> >  2 files changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> > index 1b5a2094df8e..a08fd719ee7b 100644
> > --- a/fs/overlayfs/ovl_entry.h
> > +++ b/fs/overlayfs/ovl_entry.h
> > @@ -79,6 +79,8 @@ struct ovl_fs {
> > atomic_long_t last_ino;
> > /* Whiteout dentry cache */
> > struct dentry *whiteout;
> > +   /* Protects multiple sb->s_wb_err update from upper_sb . */
> > +   spinlock_t errseq_lock;
> >  };
> >  
> >  static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index b4d92e6fa5ce..e7bc4492205e 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -291,7 +291,7 @@ int ovl_syncfs(struct file *file)
> > struct super_block *sb = file->f_path.dentry->d_sb;
> > struct ovl_fs *ofs = sb->s_fs_info;
> > struct super_block *upper_sb;
> > -   int ret;
> > +   int ret, ret2;
> >  
> > ret = 0;
> > down_read(>s_umount);
> > @@ -310,10 +310,18 @@ int ovl_syncfs(struct file *file)
> > ret = sync_filesystem(upper_sb);
> > up_read(_sb->s_umount);
> >  
> > +   /* Update overlay sb->s_wb_err */
> > +   if (errseq_check(_sb->s_wb_err, sb->s_wb_err)) {
> > +   /* Upper sb has errors since last time */
> > +   spin_lock(>errseq_lock);
> > +   errseq_check_and_advance(_sb->s_wb_err, >s_wb_err);
> > +   spin_unlock(>errseq_lock);
> > +   }
> 
> So, the problem here is that the resulting value in sb->s_wb_err is
> going to end up with the REPORTED flag set (using the naming in my
> latest set). So, a later opener of a file on sb->s_wb_err won't see it.
> 
> For instance, suppose you call sync() on the box and does the above
> check and advance. Then, you open the file and call syncfs() and get
> back no error because REPORTED flag was set when you opened. That error
> will then be lost.

Hi Jeff,

In this patch, I am doing this only in ->syncfs() path and not in
->sync_fs() path. IOW, errseq_check_and_advance() will take place
only if there is a valid "struct file" passed in. That means there
is a consumer of the error and that means it should be fine to
set the sb->s_wb_err as SEEN/REPORTED, right?

If we end up plumbming "struct file" in existing ->sync_fs() routine,
then I will call this only if a non NULL struct file has been 
passed in. Otherwise skip this step. 

IOW, sync() call will not result in errseq_check_and_advance() instead
a syncfs() call will. 

> 
> >  
> > +   ret2 = errseq_check_and_advance(>s_wb_err, >f_sb_err);
> >  out:
> > up_read(>s_umount);
> > -   return ret;
> > +   return ret ? ret : ret2;
> >  }
> >  
> >  /**
> > @@ -1903,6 +1911,7 @@ static int ovl_fill_super(struct super_block *sb, 
> > void *data, int silent)
> > if (!cred)
> > goto out_err;
> >  
> > +   spin_lock_init(>errseq_lock);
> > /* Is there a reason anyone would want not to share whiteouts? */
> > ofs->share_whiteout = true;
> >  
> > @@ -1975,7 +1984,7 @@ static int ovl_fill_super(struct super_block *sb, 
> > void *data, in

Re: [PATCH 1/3] vfs: add new f_op->syncfs vector

2020-12-17 Thread Vivek Goyal
On Thu, Dec 17, 2020 at 10:57:28AM +0100, Jan Kara wrote:
> On Thu 17-12-20 00:49:35, Al Viro wrote:
> > [Christoph added to Cc...]
> > On Wed, Dec 16, 2020 at 06:31:47PM -0500, Vivek Goyal wrote:
> > > Current implementation of __sync_filesystem() ignores the return code
> > > from ->sync_fs(). I am not sure why that's the case. There must have
> > > been some historical reason for this.
> > > 
> > > Ignoring ->sync_fs() return code is problematic for overlayfs where
> > > it can return error if sync_filesystem() on upper super block failed.
> > > That error will simply be lost and sycnfs(overlay_fd), will get
> > > success (despite the fact it failed).
> > > 
> > > If we modify existing implementation, there is a concern that it will
> > > lead to user space visible behavior changes and break things. So
> > > instead implement a new file_operations->syncfs() call which will
> > > be called in syncfs() syscall path. Return code from this new
> > > call will be captured. And all the writeback error detection
> > > logic can go in there as well. Only filesystems which implement
> > > this call get affected by this change. Others continue to fallback
> > > to existing mechanism.
> > 
> > That smells like a massive source of confusion down the road.  I'd just
> > looked through the existing instances; many always return 0, but quite
> > a few sometimes try to return an error:
> > fs/btrfs/super.c:2412:  .sync_fs= btrfs_sync_fs,
> > fs/exfat/super.c:204:   .sync_fs= exfat_sync_fs,
> > fs/ext4/super.c:1674:   .sync_fs= ext4_sync_fs,
> > fs/f2fs/super.c:2480:   .sync_fs= f2fs_sync_fs,
> > fs/gfs2/super.c:1600:   .sync_fs= gfs2_sync_fs,
> > fs/hfsplus/super.c:368: .sync_fs= hfsplus_sync_fs,
> > fs/nilfs2/super.c:689:  .sync_fs= nilfs_sync_fs,
> > fs/ocfs2/super.c:139:   .sync_fs= ocfs2_sync_fs,
> > fs/overlayfs/super.c:399:   .sync_fs= ovl_sync_fs,
> > fs/ubifs/super.c:2052:  .sync_fs   = ubifs_sync_fs,
> > is the list of such.  There are 4 method callers:
> > dquot_quota_sync(), dquot_disable(), __sync_filesystem() and
> > sync_fs_one_sb().  For sync_fs_one_sb() we want to ignore the
> > return value; for __sync_filesystem() we almost certainly
> > do *not* - it ends with return __sync_blockdev(sb->s_bdev, wait),
> > after all.  The question for that one is whether we want
> > __sync_blockdev() called even in case of ->sync_fs() reporting
> > a failure, and I suspect that it's safer to call it anyway and
> > return the first error value we'd got.  No idea about quota
> > situation.
> 
> WRT quota situation: All the ->sync_fs() calls there are due to cache
> coherency reasons (we need to get quota changes to disk, then prune quota
> files's page cache, and then userspace can read current quota structures
> from the disk). We don't want to fail dquot_disable() just because caches
> might be incoherent so ignoring ->sync_fs() return value there is fine.
> With dquot_quota_sync() it might make some sense to return the error -
> that's just a backend for Q_SYNC quotactl(2). OTOH I'm not sure anybody
> really cares - Q_SYNC is rarely used.

Thanks Jan. May be I will leave dquot_quota_sync() untouched for now. When
somebody has a need to capture return code from ->sync_fs() there, it
can be easily added.

Vivek



Re: [PATCH 1/3] vfs: add new f_op->syncfs vector

2020-12-17 Thread Vivek Goyal
On Thu, Dec 17, 2020 at 12:49:35AM +, Al Viro wrote:
> [Christoph added to Cc...]
> On Wed, Dec 16, 2020 at 06:31:47PM -0500, Vivek Goyal wrote:
> > Current implementation of __sync_filesystem() ignores the return code
> > from ->sync_fs(). I am not sure why that's the case. There must have
> > been some historical reason for this.
> > 
> > Ignoring ->sync_fs() return code is problematic for overlayfs where
> > it can return error if sync_filesystem() on upper super block failed.
> > That error will simply be lost and sycnfs(overlay_fd), will get
> > success (despite the fact it failed).
> > 
> > If we modify existing implementation, there is a concern that it will
> > lead to user space visible behavior changes and break things. So
> > instead implement a new file_operations->syncfs() call which will
> > be called in syncfs() syscall path. Return code from this new
> > call will be captured. And all the writeback error detection
> > logic can go in there as well. Only filesystems which implement
> > this call get affected by this change. Others continue to fallback
> > to existing mechanism.
> 
> That smells like a massive source of confusion down the road.  I'd just
> looked through the existing instances; many always return 0, but quite
> a few sometimes try to return an error:
> fs/btrfs/super.c:2412:  .sync_fs= btrfs_sync_fs,
> fs/exfat/super.c:204:   .sync_fs= exfat_sync_fs,
> fs/ext4/super.c:1674:   .sync_fs= ext4_sync_fs,
> fs/f2fs/super.c:2480:   .sync_fs= f2fs_sync_fs,
> fs/gfs2/super.c:1600:   .sync_fs= gfs2_sync_fs,
> fs/hfsplus/super.c:368: .sync_fs= hfsplus_sync_fs,
> fs/nilfs2/super.c:689:  .sync_fs= nilfs_sync_fs,
> fs/ocfs2/super.c:139:   .sync_fs= ocfs2_sync_fs,
> fs/overlayfs/super.c:399:   .sync_fs= ovl_sync_fs,
> fs/ubifs/super.c:2052:  .sync_fs   = ubifs_sync_fs,
> is the list of such.  There are 4 method callers:
> dquot_quota_sync(), dquot_disable(), __sync_filesystem() and
> sync_fs_one_sb().  For sync_fs_one_sb() we want to ignore the
> return value; for __sync_filesystem() we almost certainly
> do *not* - it ends with return __sync_blockdev(sb->s_bdev, wait),
> after all.  The question for that one is whether we want
> __sync_blockdev() called even in case of ->sync_fs() reporting
> a failure, and I suspect that it's safer to call it anyway and
> return the first error value we'd got.

I posted V1 patch to do exactly above. In __sync_filesystem(), capture
return code from ->sync_fs() but continue to call __sync_blockdev() and
and return error code from ->sync_fs() if there is one otherwise
return error code from __sync_blockdev().

https://lore.kernel.org/linux-fsdevel/20201216143802.ga10...@redhat.com/

Thanks
Vivek

> No idea about quota situation.
> 



[PATCH 1/3] vfs: add new f_op->syncfs vector

2020-12-16 Thread Vivek Goyal
Current implementation of __sync_filesystem() ignores the return code
from ->sync_fs(). I am not sure why that's the case. There must have
been some historical reason for this.

Ignoring ->sync_fs() return code is problematic for overlayfs where
it can return error if sync_filesystem() on upper super block failed.
That error will simply be lost and sycnfs(overlay_fd), will get
success (despite the fact it failed).

If we modify existing implementation, there is a concern that it will
lead to user space visible behavior changes and break things. So
instead implement a new file_operations->syncfs() call which will
be called in syncfs() syscall path. Return code from this new
call will be captured. And all the writeback error detection
logic can go in there as well. Only filesystems which implement
this call get affected by this change. Others continue to fallback
to existing mechanism.

To be clear, I mean something like this (draft, untested) patch. You'd
also need to add a new ->syncfs op for overlayfs, and that could just do
a check_and_advance against the upper layer sb's errseq_t after calling
sync_filesystem.

Vivek, fixed couple of minor compile errors in original patch.

Signed-off-by: Jeff Layton 
---
 fs/sync.c  | 29 -
 include/linux/fs.h |  1 +
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index 1373a610dc78..06caa9758d93 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -155,27 +155,38 @@ void emergency_sync(void)
}
 }
 
+static int generic_syncfs(struct file *file)
+{
+   int ret, ret2;
+   struct super_block *sb = file->f_path.dentry->d_sb;
+
+   down_read(>s_umount);
+   ret = sync_filesystem(sb);
+   up_read(>s_umount);
+
+   ret2 = errseq_check_and_advance(>s_wb_err, >f_sb_err);
+
+   return ret ? ret : ret2;
+}
+
 /*
  * sync a single super
  */
 SYSCALL_DEFINE1(syncfs, int, fd)
 {
struct fd f = fdget(fd);
-   struct super_block *sb;
-   int ret, ret2;
+   int ret;
 
if (!f.file)
return -EBADF;
-   sb = f.file->f_path.dentry->d_sb;
-
-   down_read(>s_umount);
-   ret = sync_filesystem(sb);
-   up_read(>s_umount);
 
-   ret2 = errseq_check_and_advance(>s_wb_err, >f_sb_err);
+   if (f.file->f_op->syncfs)
+   ret = f.file->f_op->syncfs(f.file);
+   else
+   ret = generic_syncfs(f.file);
 
fdput(f);
-   return ret ? ret : ret2;
+   return ret;
 }
 
 /**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8667d0cdc71e..6710469b7e33 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1859,6 +1859,7 @@ struct file_operations {
   struct file *file_out, loff_t pos_out,
   loff_t len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
+   int (*syncfs)(struct file *);
 } __randomize_layout;
 
 struct inode_operations {
-- 
2.25.4



[PATCH 3/3] overlayfs: Check writeback errors w.r.t upper in ->syncfs()

2020-12-16 Thread Vivek Goyal
Check for writeback error on overlay super block w.r.t "struct file"
passed in ->syncfs().

As of now real error happens on upper sb. So this patch first propagates
error from upper sb to overlay sb and then checks error w.r.t struct
file passed in.

Jeff, I know you prefer that I should rather file upper file and check
error directly on on upper sb w.r.t this real upper file.  While I was
implementing that I thought what if file is on lower (and has not been
copied up yet). In that case shall we not check writeback errors and
return back to user space? That does not sound right though because,
we are not checking for writeback errors on this file. Rather we
are checking for any error on superblock. Upper might have an error
and we should report it to user even if file in question is a lower
file. And that's why I fell back to this approach. But I am open to
change it if there are issues in this method.

Signed-off-by: Vivek Goyal 
---
 fs/overlayfs/ovl_entry.h |  2 ++
 fs/overlayfs/super.c | 15 ---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 1b5a2094df8e..a08fd719ee7b 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -79,6 +79,8 @@ struct ovl_fs {
atomic_long_t last_ino;
/* Whiteout dentry cache */
struct dentry *whiteout;
+   /* Protects multiple sb->s_wb_err update from upper_sb . */
+   spinlock_t errseq_lock;
 };
 
 static inline struct vfsmount *ovl_upper_mnt(struct ovl_fs *ofs)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index b4d92e6fa5ce..e7bc4492205e 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -291,7 +291,7 @@ int ovl_syncfs(struct file *file)
struct super_block *sb = file->f_path.dentry->d_sb;
struct ovl_fs *ofs = sb->s_fs_info;
struct super_block *upper_sb;
-   int ret;
+   int ret, ret2;
 
ret = 0;
down_read(>s_umount);
@@ -310,10 +310,18 @@ int ovl_syncfs(struct file *file)
ret = sync_filesystem(upper_sb);
up_read(_sb->s_umount);
 
+   /* Update overlay sb->s_wb_err */
+   if (errseq_check(_sb->s_wb_err, sb->s_wb_err)) {
+   /* Upper sb has errors since last time */
+   spin_lock(>errseq_lock);
+   errseq_check_and_advance(_sb->s_wb_err, >s_wb_err);
+   spin_unlock(>errseq_lock);
+   }
 
+   ret2 = errseq_check_and_advance(>s_wb_err, >f_sb_err);
 out:
up_read(>s_umount);
-   return ret;
+   return ret ? ret : ret2;
 }
 
 /**
@@ -1903,6 +1911,7 @@ static int ovl_fill_super(struct super_block *sb, void 
*data, int silent)
if (!cred)
goto out_err;
 
+   spin_lock_init(>errseq_lock);
/* Is there a reason anyone would want not to share whiteouts? */
ofs->share_whiteout = true;
 
@@ -1975,7 +1984,7 @@ static int ovl_fill_super(struct super_block *sb, void 
*data, int silent)
 
sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth;
sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran;
-
+   sb->s_wb_err = 
errseq_sample(_upper_mnt(ofs)->mnt_sb->s_wb_err);
}
oe = ovl_get_lowerstack(sb, splitlower, numlower, ofs, layers);
err = PTR_ERR(oe);
-- 
2.25.4



[PATCH 2/3] overlayfs: Implement f_op->syncfs() call

2020-12-16 Thread Vivek Goyal
Provide an implementation for ->syncfs(). Now if there is an error
returned by sync_filesystem(upper_sb), it will be visible to user
space. Currently in ovl_sync_fs() path, this error is ignored by VFS.

A later patch also adds logic to detect writeback error.

Signed-off-by: Vivek Goyal 
---
 fs/overlayfs/file.c  |  1 +
 fs/overlayfs/overlayfs.h |  3 +++
 fs/overlayfs/readdir.c   |  1 +
 fs/overlayfs/super.c | 30 ++
 4 files changed, 35 insertions(+)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index efccb7c1f9bc..affc1ba63202 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -806,6 +806,7 @@ const struct file_operations ovl_file_operations = {
 
.copy_file_range= ovl_copy_file_range,
.remap_file_range   = ovl_remap_file_range,
+   .syncfs = ovl_syncfs,
 };
 
 int __init ovl_aio_request_cache_init(void)
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f8880aa2ba0e..1efb13800755 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -520,3 +520,6 @@ int ovl_set_origin(struct dentry *dentry, struct dentry 
*lower,
 
 /* export.c */
 extern const struct export_operations ovl_export_operations;
+
+/* super.c */
+int ovl_syncfs(struct file *file);
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 01620ebae1bd..e89b450c8f8f 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -975,6 +975,7 @@ const struct file_operations ovl_dir_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ovl_compat_ioctl,
 #endif
+   .syncfs = ovl_syncfs,
 };
 
 int ovl_check_empty_dir(struct dentry *dentry, struct list_head *list)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..b4d92e6fa5ce 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -286,6 +286,36 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
return ret;
 }
 
+int ovl_syncfs(struct file *file)
+{
+   struct super_block *sb = file->f_path.dentry->d_sb;
+   struct ovl_fs *ofs = sb->s_fs_info;
+   struct super_block *upper_sb;
+   int ret;
+
+   ret = 0;
+   down_read(>s_umount);
+   if (sb_rdonly(sb))
+   goto out;
+
+   if (!ovl_upper_mnt(ofs))
+   goto out;
+
+   if (!ovl_should_sync(ofs))
+   goto out;
+
+   upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
+
+   down_read(_sb->s_umount);
+   ret = sync_filesystem(upper_sb);
+   up_read(_sb->s_umount);
+
+
+out:
+   up_read(>s_umount);
+   return ret;
+}
+
 /**
  * ovl_statfs
  * @sb: The overlayfs super block
-- 
2.25.4



[RFC PATCH 0/3] vfs, overlayfs: Fix syncfs() to return error

2020-12-16 Thread Vivek Goyal
Hi,

This is V2 of patches which tries to fix syncfs() for overlayfs to
return error code when sync_filesystem(upper_sb) returns error or
there are writeback errors on upper_sb.

I posted V1 of patch here.

https://lore.kernel.org/linux-fsdevel/20201216143802.ga10...@redhat.com/

This is just compile tested patch series. Trying to get early feedback
to figure out what direction to move in to fix this issue. 

Thanks
Vivek

Vivek Goyal (3):
  vfs: add new f_op->syncfs vector
  overlayfs: Implement f_op->syncfs() call
  overlayfs: Check writeback errors w.r.t upper in ->syncfs()

 fs/overlayfs/file.c  |  1 +
 fs/overlayfs/overlayfs.h |  3 +++
 fs/overlayfs/ovl_entry.h |  2 ++
 fs/overlayfs/readdir.c   |  1 +
 fs/overlayfs/super.c | 41 +++-
 fs/sync.c| 29 +++-
 include/linux/fs.h   |  1 +
 7 files changed, 68 insertions(+), 10 deletions(-)

-- 
2.25.4



Re: [PATCH] vfs, syncfs: Do not ignore return code from ->sync_fs()

2020-12-16 Thread Vivek Goyal
On Wed, Dec 16, 2020 at 10:53:16AM -0500, Jeff Layton wrote:
> On Wed, 2020-12-16 at 10:44 -0500, Jeff Layton wrote:
> > On Wed, 2020-12-16 at 10:14 -0500, Vivek Goyal wrote:
> > > On Wed, Dec 16, 2020 at 09:57:49AM -0500, Jeff Layton wrote:
> > > > On Wed, 2020-12-16 at 09:38 -0500, Vivek Goyal wrote:
> > > > > I see that current implementation of __sync_filesystem() ignores the
> > > > > return code from ->sync_fs(). I am not sure why that's the case.
> > > > > 
> > > > > Ignoring ->sync_fs() return code is problematic for overlayfs where
> > > > > it can return error if sync_filesystem() on upper super block failed.
> > > > > That error will simply be lost and sycnfs(overlay_fd), will get
> > > > > success (despite the fact it failed).
> > > > > 
> > > > > I am assuming that we want to continue to call __sync_blockdev()
> > > > > despite the fact that there have been errors reported from
> > > > > ->sync_fs(). So I wrote this simple patch which captures the
> > > > > error from ->sync_fs() but continues to call __sync_blockdev()
> > > > > and returns error from sync_fs() if there is one.
> > > > > 
> > > > > There might be some very good reasons to not capture ->sync_fs()
> > > > > return code, I don't know. Hence thought of proposing this patch.
> > > > > Atleast I will get to know the reason. I still need to figure
> > > > > a way out how to propagate overlay sync_fs() errors to user
> > > > > space.
> > > > > 
> > > > > Signed-off-by: Vivek Goyal 
> > > > > ---
> > > > >  fs/sync.c |8 ++--
> > > > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > > > 
> > > > > Index: redhat-linux/fs/sync.c
> > > > > ===
> > > > > --- redhat-linux.orig/fs/sync.c   2020-12-16 09:15:49.831565653 
> > > > > -0500
> > > > > +++ redhat-linux/fs/sync.c2020-12-16 09:23:42.499853207 -0500
> > > > > @@ -30,14 +30,18 @@
> > > > >   */
> > > > >  static int __sync_filesystem(struct super_block *sb, int wait)
> > > > >  {
> > > > > + int ret, ret2;
> > > > > +
> > > > >   if (wait)
> > > > >   sync_inodes_sb(sb);
> > > > >   else
> > > > >   writeback_inodes_sb(sb, WB_REASON_SYNC);
> > > > >  
> > > > > 
> > > > >   if (sb->s_op->sync_fs)
> > > > > - sb->s_op->sync_fs(sb, wait);
> > > > > - return __sync_blockdev(sb->s_bdev, wait);
> > > > > + ret = sb->s_op->sync_fs(sb, wait);
> > > > > + ret2 = __sync_blockdev(sb->s_bdev, wait);
> > > > > +
> > > > > + return ret ? ret : ret2;
> > > > >  }
> > > > >  
> > > > > 
> > > > >  /*
> > > > > 
> > > > 
> > > > I posted a patchset that took a similar approach a couple of years ago,
> > > > and we decided not to go with it [1].
> > > > 
> > > > While it's not ideal to ignore the error here, I think this is likely to
> > > > break stuff.
> > > 
> > > So one side affect I see is that syncfs() might start returning errors
> > > in some cases which were not reported at all. I am wondering will that
> > > count as breakage.
> > > 
> > > > What may be better is to just make sync_fs void return, so
> > > > people don't think that returned errors there mean anything.
> > > 
> > > May be. 
> > > 
> > > But then question remains that how do we return error to user space
> > > in syncfs(fd) for overlayfs. I will not be surprised if other
> > > filesystems want to return errors as well.
> > > 
> > > Shall I create new helpers and call these in case of syncfs(). But
> > > that too will start returning new errors on syncfs(). So it does
> > > not solve that problem (if it is a problem).
> > > 
> > > Or we can define a new super block op say ->sync_fs2() and call that
> > > first and in that case capture return code. That way it will not
> > > impact existing cases and overlayfs can possibly make use o

Re: [PATCH] vfs, syncfs: Do not ignore return code from ->sync_fs()

2020-12-16 Thread Vivek Goyal
On Wed, Dec 16, 2020 at 09:57:49AM -0500, Jeff Layton wrote:
> On Wed, 2020-12-16 at 09:38 -0500, Vivek Goyal wrote:
> > I see that current implementation of __sync_filesystem() ignores the
> > return code from ->sync_fs(). I am not sure why that's the case.
> > 
> > Ignoring ->sync_fs() return code is problematic for overlayfs where
> > it can return error if sync_filesystem() on upper super block failed.
> > That error will simply be lost and sycnfs(overlay_fd), will get
> > success (despite the fact it failed).
> > 
> > I am assuming that we want to continue to call __sync_blockdev()
> > despite the fact that there have been errors reported from
> > ->sync_fs(). So I wrote this simple patch which captures the
> > error from ->sync_fs() but continues to call __sync_blockdev()
> > and returns error from sync_fs() if there is one.
> > 
> > There might be some very good reasons to not capture ->sync_fs()
> > return code, I don't know. Hence thought of proposing this patch.
> > Atleast I will get to know the reason. I still need to figure
> > a way out how to propagate overlay sync_fs() errors to user
> > space.
> > 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/sync.c |8 ++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > Index: redhat-linux/fs/sync.c
> > ===
> > --- redhat-linux.orig/fs/sync.c 2020-12-16 09:15:49.831565653 -0500
> > +++ redhat-linux/fs/sync.c  2020-12-16 09:23:42.499853207 -0500
> > @@ -30,14 +30,18 @@
> >   */
> >  static int __sync_filesystem(struct super_block *sb, int wait)
> >  {
> > +   int ret, ret2;
> > +
> >     if (wait)
> >     sync_inodes_sb(sb);
> >     else
> >     writeback_inodes_sb(sb, WB_REASON_SYNC);
> >  
> > 
> >     if (sb->s_op->sync_fs)
> > -   sb->s_op->sync_fs(sb, wait);
> > -   return __sync_blockdev(sb->s_bdev, wait);
> > +   ret = sb->s_op->sync_fs(sb, wait);
> > +   ret2 = __sync_blockdev(sb->s_bdev, wait);
> > +
> > +   return ret ? ret : ret2;
> >  }
> >  
> > 
> >  /*
> > 
> 
> I posted a patchset that took a similar approach a couple of years ago,
> and we decided not to go with it [1].
> 
> While it's not ideal to ignore the error here, I think this is likely to
> break stuff.

So one side affect I see is that syncfs() might start returning errors
in some cases which were not reported at all. I am wondering will that
count as breakage.

> What may be better is to just make sync_fs void return, so
> people don't think that returned errors there mean anything.

May be. 

But then question remains that how do we return error to user space
in syncfs(fd) for overlayfs. I will not be surprised if other
filesystems want to return errors as well.

Shall I create new helpers and call these in case of syncfs(). But
that too will start returning new errors on syncfs(). So it does
not solve that problem (if it is a problem).

Or we can define a new super block op say ->sync_fs2() and call that
first and in that case capture return code. That way it will not
impact existing cases and overlayfs can possibly make use of
->sync_fs2() and return error. IOW, impact will be limited to
only file systems which chose to implement ->sync_fs2().

Thanks
Vivek

> 
> [1]: 
> https://lore.kernel.org/linux-fsdevel/20180518123415.28181-1-jlay...@kernel.org/
> -- 
> Jeff Layton 
> 



[PATCH] vfs, syncfs: Do not ignore return code from ->sync_fs()

2020-12-16 Thread Vivek Goyal
I see that current implementation of __sync_filesystem() ignores the
return code from ->sync_fs(). I am not sure why that's the case.

Ignoring ->sync_fs() return code is problematic for overlayfs where
it can return error if sync_filesystem() on upper super block failed.
That error will simply be lost and sycnfs(overlay_fd), will get
success (despite the fact it failed).

I am assuming that we want to continue to call __sync_blockdev()
despite the fact that there have been errors reported from
->sync_fs(). So I wrote this simple patch which captures the
error from ->sync_fs() but continues to call __sync_blockdev()
and returns error from sync_fs() if there is one.

There might be some very good reasons to not capture ->sync_fs()
return code, I don't know. Hence thought of proposing this patch.
Atleast I will get to know the reason. I still need to figure
a way out how to propagate overlay sync_fs() errors to user
space.

Signed-off-by: Vivek Goyal 
---
 fs/sync.c |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Index: redhat-linux/fs/sync.c
===
--- redhat-linux.orig/fs/sync.c 2020-12-16 09:15:49.831565653 -0500
+++ redhat-linux/fs/sync.c  2020-12-16 09:23:42.499853207 -0500
@@ -30,14 +30,18 @@
  */
 static int __sync_filesystem(struct super_block *sb, int wait)
 {
+   int ret, ret2;
+
if (wait)
sync_inodes_sb(sb);
else
writeback_inodes_sb(sb, WB_REASON_SYNC);
 
if (sb->s_op->sync_fs)
-   sb->s_op->sync_fs(sb, wait);
-   return __sync_blockdev(sb->s_bdev, wait);
+   ret = sb->s_op->sync_fs(sb, wait);
+   ret2 = __sync_blockdev(sb->s_bdev, wait);
+
+   return ret ? ret : ret2;
 }
 
 /*



Re: Unbreakable loop in fuse_fill_write_pages()

2020-10-13 Thread Vivek Goyal
On Tue, Oct 13, 2020 at 03:53:19PM -0400, Qian Cai wrote:
> On Tue, 2020-10-13 at 14:58 -0400, Vivek Goyal wrote:
> 
> > I am wondering if virtiofsd still alive and responding to requests? I
> > see another task which is blocked on getdents() for more than 120s.
> > 
> > [10580.142571][  T348] INFO: task trinity-c36:254165 blocked for more than 
> > 123
> > +seconds.
> > [10580.143924][  T348]   Tainted: G   O  5.9.0-next-20201013+ #2
> > [10580.145158][  T348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > +disables this message.
> > [10580.146636][  T348] task:trinity-c36 state:D stack:26704 pid:254165
> > ppid:
> > +87180 flags:0x0004
> > [10580.148260][  T348] Call Trace:
> > [10580.148789][  T348]  __schedule+0x71d/0x1b50
> > [10580.149532][  T348]  ? __sched_text_start+0x8/0x8
> > [10580.150343][  T348]  schedule+0xbf/0x270
> > [10580.151044][  T348]  schedule_preempt_disabled+0xc/0x20
> > [10580.152006][  T348]  __mutex_lock+0x9f1/0x1360
> > [10580.152777][  T348]  ? __fdget_pos+0x9c/0xb0
> > [10580.153484][  T348]  ? mutex_lock_io_nested+0x1240/0x1240
> > [10580.154432][  T348]  ? find_held_lock+0x33/0x1c0
> > [10580.155220][  T348]  ? __fdget_pos+0x9c/0xb0
> > [10580.155934][  T348]  __fdget_pos+0x9c/0xb0
> > [10580.156660][  T348]  __x64_sys_getdents+0xff/0x230
> > 
> > May be virtiofsd crashed and hence no requests are completing leading
> > to a hard lockup?
> Virtiofsd is still working. Once this happened, I manually create a file on 
> the
> guest (in virtiofs) and then I can see the content of it from the host.

Hmm..., So how do I reproduce it. Just run trinity as root and it will
reproduce after some time?

Vivek



Re: Unbreakable loop in fuse_fill_write_pages()

2020-10-13 Thread Vivek Goyal
On Tue, Oct 13, 2020 at 02:40:26PM -0400, Vivek Goyal wrote:
> On Tue, Oct 13, 2020 at 01:11:05PM -0400, Qian Cai wrote:
> > Running some fuzzing on virtiofs with an unprivileged user on today's 
> > linux-next 
> > could trigger soft-lockups below.
> > 
> > # virtiofsd --socket-path=/tmp/vhostqemu -o source=$TESTDIR -o cache=always 
> > -o no_posix_lock
> > 
> > Basically, everything was blocking on inode_lock(inode) because one thread
> > (trinity-c33) was holding it but stuck in the loop in 
> > fuse_fill_write_pages()
> > and unable to exit for more than 10 minutes before I executed sysrq-t.
> > Afterwards, the systems was totally unresponsive:
> > 
> > kernel:NMI watchdog: Watchdog detected hard LOCKUP on cpu 8
> > 
> > To exit the loop, it needs,
> > 
> > iov_iter_advance(ii, tmp) to set "tmp" to non-zero for each iteration.
> > 
> > and
> > 
> > } while (iov_iter_count(ii) && count < fc->max_write &&
> >  ap->num_pages < max_pages && offset == 0);
> > 
> > == the thread is stuck in the loop ==
> > [10813.290694] task:trinity-c33 state:D stack:25888 pid:254219 ppid: 
> > 87180
> > flags:0x4004
> > [10813.292671] Call Trace:
> > [10813.293379]  __schedule+0x71d/0x1b50
> > [10813.294182]  ? __sched_text_start+0x8/0x8
> > [10813.295146]  ? mark_held_locks+0xb0/0x110
> > [10813.296117]  schedule+0xbf/0x270
> > [10813.296782]  ? __lock_page_killable+0x276/0x830
> > [10813.297867]  io_schedule+0x17/0x60
> > [10813.298772]  __lock_page_killable+0x33b/0x830
> 
> This seems to suggest that filemap_fault() is blocked on page lock and
> is sleeping. For some reason it never wakes up. Not sure why.
> 
> And this will be called from.
> 
> fuse_fill_write_pages()
>iov_iter_fault_in_readable()
> 
> So fuse code will take inode_lock() and then looks like same process
> is sleeping waiting on page lock. And rest of the processes get blocked
> behind inode lock.
> 
> If we are woken up (while waiting on page lock), we should make forward
> progress. Question is what page it is and why the entity which is
> holding lock is not releasing lock.

I am wondering if virtiofsd still alive and responding to requests? I
see another task which is blocked on getdents() for more than 120s.

[10580.142571][  T348] INFO: task trinity-c36:254165 blocked for more than 123
+seconds.
[10580.143924][  T348]   Tainted: G   O  5.9.0-next-20201013+ #2
[10580.145158][  T348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
+disables this message.
[10580.146636][  T348] task:trinity-c36 state:D stack:26704 pid:254165 ppid:
+87180 flags:0x0004
[10580.148260][  T348] Call Trace:
[10580.148789][  T348]  __schedule+0x71d/0x1b50
[10580.149532][  T348]  ? __sched_text_start+0x8/0x8
[10580.150343][  T348]  schedule+0xbf/0x270
[10580.151044][  T348]  schedule_preempt_disabled+0xc/0x20
[10580.152006][  T348]  __mutex_lock+0x9f1/0x1360
[10580.152777][  T348]  ? __fdget_pos+0x9c/0xb0
[10580.153484][  T348]  ? mutex_lock_io_nested+0x1240/0x1240
[10580.154432][  T348]  ? find_held_lock+0x33/0x1c0
[10580.155220][  T348]  ? __fdget_pos+0x9c/0xb0
[10580.155934][  T348]  __fdget_pos+0x9c/0xb0
[10580.156660][  T348]  __x64_sys_getdents+0xff/0x230

May be virtiofsd crashed and hence no requests are completing leading
to a hard lockup?

Vivek



Re: Unbreakable loop in fuse_fill_write_pages()

2020-10-13 Thread Vivek Goyal
On Tue, Oct 13, 2020 at 01:11:05PM -0400, Qian Cai wrote:
> Running some fuzzing on virtiofs with an unprivileged user on today's 
> linux-next 
> could trigger soft-lockups below.
> 
> # virtiofsd --socket-path=/tmp/vhostqemu -o source=$TESTDIR -o cache=always 
> -o no_posix_lock
> 
> Basically, everything was blocking on inode_lock(inode) because one thread
> (trinity-c33) was holding it but stuck in the loop in fuse_fill_write_pages()
> and unable to exit for more than 10 minutes before I executed sysrq-t.
> Afterwards, the systems was totally unresponsive:
> 
> kernel:NMI watchdog: Watchdog detected hard LOCKUP on cpu 8
> 
> To exit the loop, it needs,
> 
> iov_iter_advance(ii, tmp) to set "tmp" to non-zero for each iteration.
> 
> and
> 
>   } while (iov_iter_count(ii) && count < fc->max_write &&
>ap->num_pages < max_pages && offset == 0);
> 
> == the thread is stuck in the loop ==
> [10813.290694] task:trinity-c33 state:D stack:25888 pid:254219 ppid: 87180
> flags:0x4004
> [10813.292671] Call Trace:
> [10813.293379]  __schedule+0x71d/0x1b50
> [10813.294182]  ? __sched_text_start+0x8/0x8
> [10813.295146]  ? mark_held_locks+0xb0/0x110
> [10813.296117]  schedule+0xbf/0x270
> [10813.296782]  ? __lock_page_killable+0x276/0x830
> [10813.297867]  io_schedule+0x17/0x60
> [10813.298772]  __lock_page_killable+0x33b/0x830

This seems to suggest that filemap_fault() is blocked on page lock and
is sleeping. For some reason it never wakes up. Not sure why.

And this will be called from.

fuse_fill_write_pages()
   iov_iter_fault_in_readable()

So fuse code will take inode_lock() and then looks like same process
is sleeping waiting on page lock. And rest of the processes get blocked
behind inode lock.

If we are woken up (while waiting on page lock), we should make forward
progress. Question is what page it is and why the entity which is
holding lock is not releasing lock.

Thanks
Vivek

> [10813.299695]  ? wait_on_page_bit+0x710/0x710
> [10813.300609]  ? __lock_page_or_retry+0x3c0/0x3c0
> [10813.301894]  ? up_read+0x1a3/0x730
> [10813.302791]  ? page_cache_free_page.isra.45+0x390/0x390
> [10813.304077]  filemap_fault+0x2bd/0x2040
> [10813.305019]  ? read_cache_page_gfp+0x10/0x10
> [10813.306041]  ? lock_downgrade+0x700/0x700
> [10813.306958]  ? replace_page_cache_page+0x1130/0x1130
> [10813.308124]  __do_fault+0xf5/0x530
> [10813.308968]  handle_mm_fault+0x1c0e/0x25b0
> [10813.309955]  ? copy_page_range+0xfe0/0xfe0
> [10813.310895]  do_user_addr_fault+0x383/0x820
> [10813.312084]  exc_page_fault+0x56/0xb0
> [10813.312979]  asm_exc_page_fault+0x1e/0x30
> [10813.313978] RIP: 0010:iov_iter_fault_in_readable+0x271/0x350
> fault_in_pages_readable at include/linux/pagemap.h:745
> (inlined by) iov_iter_fault_in_readable at lib/iov_iter.c:438
> [10813.315293] Code: 48 39 d7 0f 82 1a ff ff ff 0f 01 cb 0f ae e8 44 89 c0 8a 
> 0a
> 0f 01 ca 88 4c 24 70 85 c0 74 da e9 f8 fe ff ff 0f 01 cb 0f ae e8 <8a> 11 0f 
> 01
> ca 88 54 24 30 85 c0 0f 85 04 ff ff ff 48 29 ee e9
>  45
> [10813.319196] RSP: 0018:c90017ccf830 EFLAGS: 00050246
> [10813.320446] RAX:  RBX: 192002f99f08 RCX: 
> 7fe284f1004c
> [10813.322202] RDX: 0001 RSI: 1000 RDI: 
> 8887a7664000
> [10813.323729] RBP: 1000 R08:  R09: 
> 
> [10813.325282] R10: c90017ccfd48 R11: ed102789d5ff R12: 
> 8887a7664020
> [10813.326898] R13: c90017ccfd40 R14: dc00 R15: 
> 00e0df6a
> [10813.328456]  ? iov_iter_revert+0x8e0/0x8e0
> [10813.329404]  ? copyin+0x96/0xc0
> [10813.330230]  ? iov_iter_copy_from_user_atomic+0x1f0/0xa40
> [10813.331742]  fuse_perform_write+0x3eb/0xf20 [fuse]
> fuse_fill_write_pages at fs/fuse/file.c:1150
> (inlined by) fuse_perform_write at fs/fuse/file.c:1226
> [10813.332880]  ? fuse_file_fallocate+0x5f0/0x5f0 [fuse]
> [10813.334090]  fuse_file_write_iter+0x6b7/0x900 [fuse]
> [10813.335191]  do_iter_readv_writev+0x42b/0x6d0
> [10813.336161]  ? new_sync_write+0x610/0x610
> [10813.337194]  do_iter_write+0x11f/0x5b0
> [10813.338177]  ? __sb_start_write+0x229/0x2d0
> [10813.339169]  vfs_writev+0x16d/0x2d0
> [10813.339973]  ? vfs_iter_write+0xb0/0xb0
> [10813.340950]  ? __fdget_pos+0x9c/0xb0
> [10813.342039]  ? rcu_read_lock_sched_held+0x9c/0xd0
> [10813.343120]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [10813.344104]  ? find_held_lock+0x33/0x1c0
> [10813.345050]  do_writev+0xfb/0x1e0
> [10813.345920]  ? vfs_writev+0x2d0/0x2d0
> [10813.346802]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> [10813.348026]  ? syscall_enter_from_user_mode+0x1c/0x50
> [10813.349197]  do_syscall_64+0x33/0x40
> [10813.350026]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vivek Goyal
On Tue, Oct 06, 2020 at 10:17:04AM -0700, Sean Christopherson wrote:

[..]
> > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> > > into the guest is not an option.  
> > 
> > Not even #MC? So sad :-)
> 
> Heh, #MC isn't allowed either, yet...

If #MC is not allowd, logic related to hwpoison memory will not work
as that seems to inject #MC.

Vivek



Re: [Virtio-fs] [PATCH v4] kvm, x86: Exit to user space in case page fault error

2020-10-06 Thread Vivek Goyal
On Tue, Oct 06, 2020 at 06:21:48PM +0100, Dr. David Alan Gilbert wrote:
> * Sean Christopherson (sean.j.christopher...@intel.com) wrote:
> > On Tue, Oct 06, 2020 at 06:39:56PM +0200, Vitaly Kuznetsov wrote:
> > > Sean Christopherson  writes:
> > > 
> > > > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
> > > >> Vivek Goyal  writes:
> > > >> > So you will have to report token (along with -EFAULT) to user space. 
> > > >> > So this
> > > >> > is basically the 3rd proposal which is extension of kvm API and will
> > > >> > report say HVA/GFN also to user space along with -EFAULT.
> > > >> 
> > > >> Right, I meant to say that guest kernel has full register state of the
> > > >> userspace process which caused APF to get queued and instead of trying
> > > >> to extract it in KVM and pass to userspace in case of a (later) failure
> > > >> we limit KVM api change to contain token or GFN only and somehow keep
> > > >> the rest in the guest. This should help with TDX/SEV-ES.
> > > >
> > > > Whatever gets reported to userspace should be identical with and without
> > > > async page faults, i.e. it definitely shouldn't have token information.
> > > >
> > > 
> > > Oh, right, when the error gets reported synchronously guest's kernel is
> > > not yet aware of the issue so it won't be possible to find anything in
> > > its kdump if userspace decides to crash it immediately. The register
> > > state (if available) will be actual though.
> > > 
> > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> > > > into the guest is not an option.  
> > > 
> > > Not even #MC? So sad :-)
> > 
> > Heh, #MC isn't allowed either, yet...
> > 
> > > > Nor do I think that's "correct" behavior (see everyone's objections to
> > > > using #PF for APF fixed).  I.e. the event should probably be an IRQ.
> > > 
> > > I recall Paolo objected against making APF 'page not present' into in
> > > interrupt as it will require some very special handling to make sure it
> > > gets injected (and handled) immediately but I'm not really sure how big
> > > the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a
> > > try.
> > 
> > This shouldn't have anything to do with APF.  Again, the event injection is
> > needed even in the synchronous case as the file truncation in the host can
> > affect existing mappings in the guest.
> > 
> > I don't know that the mechanism needs to be virtiofs specific or if there 
> > can
> > be a more generic "these PFNs have disappeared", but it's most definitely
> > orthogonal to APF.
> 
> There are other cases we get 'these PFNs have disappeared' other than
> virtiofs;  the classic is when people back the guest using a tmpfs that
> then runs out of room.

I also played with nvdimm driver where device was backed a file on host.
If I truncate that file, we face similar issues.

https://lore.kernel.org/kvm/20200616214847.24482-1-vgo...@redhat.com/

I think any resource which can be backed by a file on host, can
potentially run into this issue if file is truncated.
(if guest can do load/store on these pages directly). 

Thanks
Vivek



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vivek Goyal
On Tue, Oct 06, 2020 at 09:12:00AM -0700, Sean Christopherson wrote:
> On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
> > Vivek Goyal  writes:
> > > So you will have to report token (along with -EFAULT) to user space. So 
> > > this
> > > is basically the 3rd proposal which is extension of kvm API and will
> > > report say HVA/GFN also to user space along with -EFAULT.
> > 
> > Right, I meant to say that guest kernel has full register state of the
> > userspace process which caused APF to get queued and instead of trying
> > to extract it in KVM and pass to userspace in case of a (later) failure
> > we limit KVM api change to contain token or GFN only and somehow keep
> > the rest in the guest. This should help with TDX/SEV-ES.
> 
> Whatever gets reported to userspace should be identical with and without
> async page faults, i.e. it definitely shouldn't have token information.
> 
> Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> into the guest is not an option.  Nor do I think that's "correct"
> behavior (see everyone's objections to using #PF for APF fixed).  I.e. the
> event should probably be an IRQ.

I am not sure if IRQ for "Page not Present" works. Will it have some
conflicts/issues with other high priority interrupts which can
get injected before "Page not present".

Vivek



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vivek Goyal
On Tue, Oct 06, 2020 at 04:50:44PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal  writes:
> 
> > On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote:
> >> Vivek Goyal  writes:
> >> 
> >> > A. Just exit to user space with -EFAULT (using kvm request) and don't
> >> >wait for the accessing task to run on vcpu again. 
> >> 
> >> What if we also save the required information (RIP, GFN, ...) in the
> >> guest along with the APF token
> >
> > Can you elaborate a bit more on this. You mean save GFN on stack before
> > it starts waiting for PAGE_READY event?
> 
> When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest
> kernel gets all the registers of the userspace process (except for CR2
> which is replaced with a token). In case it is not trivial to extract
> accessed GFN from this data we can extend the shared APF structure and
> add it there, KVM has it when it queues APF.
> 
> >
> >> so in case of -EFAULT we can just 'crash'
> >> the guest and the required information can easily be obtained from
> >> kdump? This will solve the debugging problem even for TDX/SEV-ES (if
> >> kdump is possible there).
> >
> > Just saving additional info in guest will not help because there might
> > be many tasks waiting and you don't know which GFN is problematic one.
> 
> But KVM knows which token caused the -EFAULT when we exit to userspace
> (and we can pass this information to it) so to debug the situation you
> take this token and then explore the kdump searching for what's
> associated with this exact token.

So you will have to report token (along with -EFAULT) to user space. So this
is basically the 3rd proposal which is extension of kvm API and will
report say HVA/GFN also to user space along with -EFAULT.

Thanks
Vivek



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vivek Goyal
On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal  writes:
> 
> > A. Just exit to user space with -EFAULT (using kvm request) and don't
> >wait for the accessing task to run on vcpu again. 
> 
> What if we also save the required information (RIP, GFN, ...) in the
> guest along with the APF token

Can you elaborate a bit more on this. You mean save GFN on stack before
it starts waiting for PAGE_READY event?

> so in case of -EFAULT we can just 'crash'
> the guest and the required information can easily be obtained from
> kdump? This will solve the debugging problem even for TDX/SEV-ES (if
> kdump is possible there).

Just saving additional info in guest will not help because there might
be many tasks waiting and you don't know which GFN is problematic one.

Thanks
Vivek



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vivek Goyal
On Mon, Oct 05, 2020 at 09:16:20AM -0700, Sean Christopherson wrote:
> On Mon, Oct 05, 2020 at 11:33:18AM -0400, Vivek Goyal wrote:
> > On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote:
> > Now I have few questions.
> > 
> > - If we exit to user space asynchronously (using kvm request), what debug
> >   information is in there which tells user which address is bad. I admit
> >   that even above trace does not seem to be telling me directly which
> >   address (HVA?) is bad.
> > 
> >   But if I take a crash dump of guest, using above information I should
> >   be able to get to GPA which is problematic. And looking at /proc/iomem
> >   it should also tell which device this memory region is in.
> > 
> >   Also using this crash dump one should be able to walk through virtiofs 
> > data
> >   structures and figure out which file and what offset with-in file does
> >   it belong to. Now one can look at filesystem on host and see file got
> >   truncated and it will become obvious it can't be faulted in. And then
> >   one can continue to debug that how did we arrive here.
> > 
> > But if we don't exit to user space synchronously, Only relevant
> > information we seem to have is -EFAULT. Apart from that, how does one
> > figure out what address is bad, or who tried to access it. Or which
> > file/offset does it belong to etc.
> >
> > I agree that problem is not necessarily in guest code. But by exiting
> > synchronously, it gives enough information that one can use crash
> > dump to get to bottom of the issue. If we exit to user space
> > asynchronously, all this information will be lost and it might make
> > it very hard to figure out (if not impossible), what's going on.
> 
> If we want userspace to be able to do something useful, KVM should explicitly
> inform userspace about the error, userspace shouldn't simply assume that
> -EFAULT means a HVA->PFN lookup failed.

I guess that's fine. But for this patch, user space is not doing anything.
Its just printing error -EFAULT and dumping guest state (Same as we do
in case of synchronous fault).

> Userspace also shouldn't have to
> query guest state to handle the error, as that won't work for protected guests
> guests like SEV-ES and TDX.

So qemu would not be able to dump vcpu register state when kvm returns
with -EFAULT for the case of SEV-ES and TDX?

> 
> I can think of two options:
> 
>   1. Send a signal, a la kvm_send_hwpoison_signal().

This works because -EHWPOISON is a special kind of error which is
different from -EFAULT. For truncation, even kvm gets -EFAULT.

if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
return -EFAULT;

Anyway, if -EFAULT is too generic, and we need something finer grained,
that can be looked into when we actually have a method where kvm/qemu
injects error into guest.

> 
>   2. Add a userspace exit reason along with a new entry in the run struct,
>  e.g. that provides the bad GPA, HVA, possibly permissions, etc...

This sounds more reasonable to me. That is kvm gives additional
information to qemu about failing HVA and GPA with -EFAULT and that
can be helpful in debugging a problem. This seems like an extension
of KVM API.

Even with this, if we want to figure out which file got truncated, we
will need to take a dump of guest and try to figure out which file
this GPA is currently mapping(By looking at virtiofs data structures).
And that becomes little easier if vcpu is running the task which 
accessed that GPA. Anyway, if we have failing GPA, I think it should
be possible to figure out inode even without accessing task being
current on vcpu.

So we seem to have 3 options.

A. Just exit to user space with -EFAULT (using kvm request) and don't
   wait for the accessing task to run on vcpu again. 

B. Store error gfn in an hash and exit to user space when task accessing
   gfn runs again.

C. Extend KVM API and report failing HVA/GFN access by guest. And that
   should allow not having to exit to user space synchronously.

Thanks
Vivek



Re: virtiofs: WARN_ON(out_sgs + in_sgs != total_sgs)

2020-10-06 Thread Vivek Goyal
On Tue, Oct 06, 2020 at 10:04:27AM +0100, Stefan Hajnoczi wrote:
> On Sun, Oct 04, 2020 at 10:31:19AM -0400, Vivek Goyal wrote:
> > On Fri, Oct 02, 2020 at 10:44:37PM -0400, Qian Cai wrote:
> > > On Fri, 2020-10-02 at 12:28 -0400, Qian Cai wrote:
> > > > Running some fuzzing on virtiofs from a non-privileged user could 
> > > > trigger a
> > > > warning in virtio_fs_enqueue_req():
> > > > 
> > > > WARN_ON(out_sgs + in_sgs != total_sgs);
> > > 
> > > Okay, I can reproduce this after running for a few hours:
> > > 
> > > out_sgs = 3, in_sgs = 2, total_sgs = 6
> > 
> > Thanks. I can also reproduce it simply by calling.
> > 
> > ioctl(fd, 0x5a004000, buf);
> > 
> > I think following WARN_ON() is not correct.
> > 
> > WARN_ON(out_sgs + in_sgs != total_sgs)
> > 
> > toal_sgs should actually be max sgs. It looks at ap->num_pages and
> > counts one sg for each page. And it assumes that same number of
> > pages will be used both for input and output.
> > 
> > But there are no such guarantees. With above ioctl() call, I noticed
> > we are using 2 pages for input (out_sgs) and one page for output (in_sgs).
> > 
> > So out_sgs=4, in_sgs=3 and total_sgs=8 and warning triggers.
> > 
> > I think total sgs is actually max number of sgs and warning
> > should probably be.
> > 
> > WARN_ON(out_sgs + in_sgs >  total_sgs)
> > 
> > Stefan, WDYT?
> 
> It should be possible to calculate total_sgs precisely (not a maximum).
> Treating it as a maximum could hide bugs.

I thought about calculating total_sgs as well. Then became little lazy.
I will redo the patch and then calculate total_sgs precisely.

> 
> Maybe sg_count_fuse_req() should count in_args/out_args[numargs -
> 1].size pages instead of adding ap->num_pages.

That should work, I guess. Will try.

Vivek
> 
> Do you have the details of struct fuse_req and struct fuse_args_pages
> fields for the ioctl in question?

> 
> Thanks,
> Stefan




Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-05 Thread Vivek Goyal
On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote:
> On Fri, Oct 02, 2020 at 04:02:14PM -0400, Vivek Goyal wrote:
> > On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote:
> > > On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote:
> > > > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> > > > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > > > > I don't think it's necessary to provide userspace with the register 
> > > > > state of
> > > > > the guest task that hit the bad page.  Other than debugging, I don't 
> > > > > see how
> > > > > userspace can do anything useful which such information.
> > > > 
> > > > I think debugging is the whole point so that user can figure out which
> > > > access by guest task resulted in bad memory access. I would think this
> > > > will be important piece of information.
> > > 
> > > But isn't this failure due to a truncation in the host?  Why would we care
> > > about debugging the guest?  It hasn't done anything wrong, has it?  Or am 
> > > I
> > > misunderstanding the original problem statement.
> > 
> > I think you understood problem statement right. If guest has right
> > context, it just gives additional information who tried to access
> > the missing memory page. 
> 
> Yes, but it's not actionable, e.g. QEMU can't do anything differently given
> a guest RIP.  It's useful information for hands-on debug, but the information
> can be easily collected through other means when doing hands-on debug.

Hi Sean,

I tried my patch and truncated file on host before guest did memcpy().
After truncation guest process tried memcpy() on truncated region and
kvm exited to user space with -EFAULT. I see following on serial console.

I am assuming qemu is printing the state of vcpu.


error: kvm run failed Bad address
RAX=7fff6e7a9750 RBX= RCX=7f513927e000 RDX=a
RSI=7f513927e000 RDI=7fff6e7a9750 RBP=7fff6e7a97b0 RSP=7fff6e7a8
R8 = R9 =0031 R10=7fff6e7a957c R11=6
R12=00401140 R13= R14= R15=0
RIP=7f51391e0547 RFL=00010202 [---] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =   00c0
CS =0033   00a0fb00 DPL=3 CS64 [-RA]
SS =002b   00c0f300 DPL=3 DS   [-WA]
DS =   00c0
FS = 7f5139246540  00c0
GS =   00c0
LDT=   
TR =0040 fe3a6000 4087 8b00 DPL=0 TSS64-busy
GDT= fe3a4000 007f
IDT= fe00 0fff
CR0=80050033 CR2=7f513927e004 CR3=00102b5eb805 CR4=00770ee0
DR0= DR1= DR2= DR3=
DR6=fffe0ff0 DR7=0400
EFER=0d01
Code=fa 6f 06 c5 fa 6f 4c 16 f0 c5 fa 7f 07 c5 fa 7f 4c 17 f0 c3 <48> 8b 4c 16 3
*

I also changed my test program to print source and destination address
for memcpy.

dst=0x0x7fff6e7a9750 src=0x0x7f513927e000

Here dst matches RDI and src matches RSI. This trace also tells me
CPL=3 so a user space access triggered this.

Now I have few questions.

- If we exit to user space asynchronously (using kvm request), what debug
  information is in there which tells user which address is bad. I admit
  that even above trace does not seem to be telling me directly which
  address (HVA?) is bad.

  But if I take a crash dump of guest, using above information I should
  be able to get to GPA which is problematic. And looking at /proc/iomem
  it should also tell which device this memory region is in.

  Also using this crash dump one should be able to walk through virtiofs data
  structures and figure out which file and what offset with-in file does
  it belong to. Now one can look at filesystem on host and see file got
  truncated and it will become obvious it can't be faulted in. And then
  one can continue to debug that how did we arrive here.

But if we don't exit to user space synchronously, Only relevant
information we seem to have is -EFAULT. Apart from that, how does one
figure out what address is bad, or who tried to access it. Or which
file/offset does it belong to etc.

I agree that problem is not necessarily in guest code. But by exiting
synchronously, it gives enough information that one can use crash
dump to get to bottom of the issue. If we exit to user space
asynchronously, all this information will be lost and it might make
it very hard to figure out 

Re: virtiofs: WARN_ON(out_sgs + in_sgs != total_sgs)

2020-10-04 Thread Vivek Goyal
On Fri, Oct 02, 2020 at 10:44:37PM -0400, Qian Cai wrote:
> On Fri, 2020-10-02 at 12:28 -0400, Qian Cai wrote:
> > Running some fuzzing on virtiofs from a non-privileged user could trigger a
> > warning in virtio_fs_enqueue_req():
> > 
> > WARN_ON(out_sgs + in_sgs != total_sgs);
> 
> Okay, I can reproduce this after running for a few hours:
> 
> out_sgs = 3, in_sgs = 2, total_sgs = 6

Thanks. I can also reproduce it simply by calling.

ioctl(fd, 0x5a004000, buf);

I think following WARN_ON() is not correct.

WARN_ON(out_sgs + in_sgs != total_sgs)

toal_sgs should actually be max sgs. It looks at ap->num_pages and
counts one sg for each page. And it assumes that same number of
pages will be used both for input and output.

But there are no such guarantees. With above ioctl() call, I noticed
we are using 2 pages for input (out_sgs) and one page for output (in_sgs).

So out_sgs=4, in_sgs=3 and total_sgs=8 and warning triggers.

I think total sgs is actually max number of sgs and warning
should probably be.

WARN_ON(out_sgs + in_sgs >  total_sgs)

Stefan, WDYT?

I will send a patch for this.

Thanks
Vivek



> 
> and this time from flush_bg_queue() instead of fuse_simple_request().
> 
> From the log, the last piece of code is:
> 
> ftruncate(fd=186, length=4)
> 
> which is a test file on virtiofs:
> 
> [main]  testfile fd:186 filename:trinity-testfile3 flags:2 fopened:1 
> fcntl_flags:2000 global:1
> [main]   start: 0x7f47c1199000 size:4KB  name: trinity-testfile3 global:1
> 
> 
> [ 9863.468502] WARNING: CPU: 16 PID: 286083 at fs/fuse/virtio_fs.c:1152 
> virtio_fs_enqueue_req+0xd36/0xde0 [virtiofs]
> [ 9863.474442] Modules linked in: dlci 8021q garp mrp bridge stp llc 
> ieee802154_socket ieee802154 vsock_loopback vmw_vsock_virtio_transport_common 
> vmw_vsock_vmci_transport vsock mpls_router vmw_vmci ip_tunnel as
> [ 9863.474555]  ata_piix fuse serio_raw libata e1000 sunrpc dm_mirror 
> dm_region_hash dm_log dm_mod
> [ 9863.535805] CPU: 16 PID: 286083 Comm: trinity-c5 Kdump: loaded Not tainted 
> 5.9.0-rc7-next-20201002+ #2
> [ 9863.544368] Hardware name: Red Hat KVM, BIOS 
> 1.14.0-1.module+el8.3.0+7638+07cf13d2 04/01/2014
> [ 9863.550129] RIP: 0010:virtio_fs_enqueue_req+0xd36/0xde0 [virtiofs]
> [ 9863.552998] Code: 60 09 23 d9 e9 44 fa ff ff e8 56 09 23 d9 e9 70 fa ff ff 
> 48 89 cf 48 89 4c 24 08 e8 44 09 23 d9 48 8b 4c 24 08 e9 7c fa ff ff <0f> 0b 
> 48 c7 c7 c0 85 60 c0 44 89 e1 44 89 fa 44 89 ee e8 e3 b7
> [ 9863.561720] RSP: 0018:888a696ef6f8 EFLAGS: 00010202
> [ 9863.565420] RAX:  RBX: 88892e030008 RCX: 
> 
> [ 9863.568735] RDX: 0005 RSI:  RDI: 
> 888a696ef8ac
> [ 9863.572037] RBP: 888a49d03d30 R08: ed114d2ddf18 R09: 
> 888a696ef8a0
> [ 9863.575383] R10: 888a696ef8bf R11: ed114d2ddf17 R12: 
> 0006
> [ 9863.578668] R13: 0003 R14: 0002 R15: 
> 0002
> [ 9863.581971] FS:  7f47c12f5740() GS:888a7f80() 
> knlGS:
> [ 9863.585752] CS:  0010 DS:  ES:  CR0: 80050033
> [ 9863.590232] CR2:  CR3: 000a63570005 CR4: 
> 00770ee0
> [ 9863.594698] DR0: 7f6642e43000 DR1:  DR2: 
> 
> [ 9863.598521] DR3:  DR6: 0ff0 DR7: 
> 0600
> [ 9863.601861] PKRU: 5540
> [ 9863.603173] Call Trace:
> [ 9863.604382]  ? virtio_fs_probe+0x13e0/0x13e0 [virtiofs]
> [ 9863.606838]  ? is_bpf_text_address+0x21/0x30
> [ 9863.608869]  ? kernel_text_address+0x125/0x140
> [ 9863.610962]  ? __kernel_text_address+0xe/0x30
> [ 9863.613117]  ? unwind_get_return_address+0x5f/0xa0
> [ 9863.615427]  ? create_prof_cpu_mask+0x20/0x20
> [ 9863.617435]  ? _raw_write_lock_irqsave+0xe0/0xe0
> [ 9863.619627]  virtio_fs_wake_pending_and_unlock+0x1ea/0x610 [virtiofs]
> [ 9863.622638]  ? queue_request_and_unlock+0x115/0x280 [fuse]
> [ 9863.625224]  flush_bg_queue+0x24c/0x3e0 [fuse]
> [ 9863.627325]  fuse_simple_background+0x3d7/0x6c0 [fuse]
> [ 9863.629735]  fuse_send_writepage+0x173/0x420 [fuse]
> [ 9863.632031]  fuse_flush_writepages+0x1fe/0x330 [fuse]
> [ 9863.634463]  ? make_kgid+0x13/0x20
> [ 9863.636064]  ? fuse_change_attributes_common+0x2de/0x940 [fuse]
> [ 9863.638850]  fuse_do_setattr+0xe84/0x13c0 [fuse]
> [ 9863.641024]  ? migrate_swap_stop+0x8d1/0x920
> [ 9863.643041]  ? fuse_flush_times+0x390/0x390 [fuse]
> [ 9863.645347]  ? avc_has_perm_noaudit+0x390/0x390
> [ 9863.647465]  fuse_setattr+0x197/0x400 [fuse]
> [ 9863.649466]  notify_change+0x744/0xda0
> [ 9863.651247]  ? __down_timeout+0x2a0/0x2a0
> [ 9863.653125]  ? do_truncate+0xe2/0x180
> [ 9863.654854]  do_truncate+0xe2/0x180
> [ 9863.656509]  ? __x64_sys_openat2+0x1c0/0x1c0
> [ 9863.658512]  ? alarm_setitimer+0xa0/0x110
> [ 9863.660418]  do_sys_ftruncate+0x1ee/0x2c0
> [ 9863.662311]  do_syscall_64+0x33/0x40
> [ 9863.663980]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 9863.666384] RIP: 

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-02 Thread Vivek Goyal
On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote:
> On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote:
> > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > > > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> > > > > Alternatively, what about adding a new KVM request type to handle 
> > > > > this?
> > > > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> > > > > request.  The vCPU then gets kicked and exits to userspace.  Before 
> > > > > exiting
> > > > > to userspace, the request handler resets vcpu->arch.apf.error_gfn.  
> > > > > Bad GFNs
> > > > > simply get if error_gfn is "valid", i.e. there's a pending request.
> > > > 
> > > > Sorry, I did not understand the above proposal. Can you please elaborate
> > > > a bit more. Part of it is that I don't know much about KVM requests.
> > > > Looking at the code it looks like that main loop is parsing if some
> > > > kvm request is pending and executing that action.
> > > > 
> > > > Don't we want to make sure that we exit to user space when guest retries
> > > > error gfn access again.
> > > 
> > > > In this case once we get -EFAULT, we will still inject page_ready into
> > > > guest. And then either same process or a different process might run. 
> > > > 
> > > > So when exactly code raises a kvm request. If I raise it right when
> > > > I get -EFAULT, then kvm will exit to user space upon next entry
> > > > time. But there is no guarantee guest vcpu is running the process which
> > > > actually accessed the error gfn. And that probably means that register
> > > > state of cpu does not mean much and one can not easily figure out
> > > > which task tried to access the bad memory and when.
> > > > 
> > > > That's why we prepare a list of error gfn and only exit to user space
> > > > when error_gfn access is retried so that guest vcpu context is correct.
> > > > 
> > > > What am I missing?
> > > 
> > > I don't think it's necessary to provide userspace with the register state 
> > > of
> > > the guest task that hit the bad page.  Other than debugging, I don't see 
> > > how
> > > userspace can do anything useful which such information.
> > 
> > I think debugging is the whole point so that user can figure out which
> > access by guest task resulted in bad memory access. I would think this
> > will be important piece of information.
> 
> But isn't this failure due to a truncation in the host?  Why would we care
> about debugging the guest?  It hasn't done anything wrong, has it?  Or am I
> misunderstanding the original problem statement.

I think you understood problem statement right. If guest has right
context, it just gives additional information who tried to access
the missing memory page. 

> 
> > > To fully handle the situation, the guest needs to remove the bad page from
> > > its memory pool.  Once the page is offlined, the guest kernel's error
> > > handling will kick in when a task accesses the bad page (or nothing ever
> > > touches the bad page again and everyone is happy).
> > 
> > This is not really a case of bad page as such. It is more of a page
> > gone missing/trucated. And no new user can map it. We just need to
> > worry about existing users who already have it mapped.
> 
> What do you mean by "no new user can map it"?  Are you talking about guest
> tasks or host tasks?  If guest tasks, how would the guest know the page is
> missing and thus prevent mapping the non-existent page?

If a new task wants mmap(), it will send a request to virtiofsd/qemu
on host. If file has been truncated, then mapping beyond file size
will fail and process will get error.  So they will not be able to
map a page which has been truncated.

> 
> > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection
> > > to handle this, but I suspect the resulting behavior will look quite 
> > > similar,
> > > e.g. notify the virtiofs driver in the guest, which does some magic to 
> > > take
> > > the offending region offline, and then guest tasks get SIGBUS or whatever.
> > > 
> > > I also don't think it's KVM's responsibility to _directly_ handle such a
> > > scenario.  As I said in an earlier version, KVM

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-02 Thread Vivek Goyal
On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> > > Alternatively, what about adding a new KVM request type to handle this?
> > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> > > request.  The vCPU then gets kicked and exits to userspace.  Before 
> > > exiting
> > > to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad 
> > > GFNs
> > > simply get if error_gfn is "valid", i.e. there's a pending request.
> > 
> > Sorry, I did not understand the above proposal. Can you please elaborate
> > a bit more. Part of it is that I don't know much about KVM requests.
> > Looking at the code it looks like that main loop is parsing if some
> > kvm request is pending and executing that action.
> > 
> > Don't we want to make sure that we exit to user space when guest retries
> > error gfn access again.
> 
> > In this case once we get -EFAULT, we will still inject page_ready into
> > guest. And then either same process or a different process might run. 
> > 
> > So when exactly code raises a kvm request. If I raise it right when
> > I get -EFAULT, then kvm will exit to user space upon next entry
> > time. But there is no guarantee guest vcpu is running the process which
> > actually accessed the error gfn. And that probably means that register
> > state of cpu does not mean much and one can not easily figure out
> > which task tried to access the bad memory and when.
> > 
> > That's why we prepare a list of error gfn and only exit to user space
> > when error_gfn access is retried so that guest vcpu context is correct.
> > 
> > What am I missing?
> 
> I don't think it's necessary to provide userspace with the register state of
> the guest task that hit the bad page.  Other than debugging, I don't see how
> userspace can do anything useful which such information.

I think debugging is the whole point so that user can figure out which
access by guest task resulted in bad memory access. I would think this
will be important piece of information.

> 
> Even if you want to inject an event of some form into the guest, having the
> correct context for the event itself is not required.  IMO it's perfectly
> reasonable for such an event to be asynchronous.
> 
> IIUC, your end goal is to be able to gracefully handle DAX file truncation.
> Simply killing the guest task that hit the bad page isn't sufficient, as
> nothing prevents a future task from accessing the same bad page.

Next task can't even mmap that page mmap will fail. File got truncated,
that page does not exist. 

So sending SIGBUS to task should definitely solve the problem. We also
need to solve the issue for guest kernel accessing the page which got
truncated on host. In that case we need to use correct memcpy helpers
and use exception table magic and return error code to user space.

> To fully
> handle the situation, the guest needs to remove the bad page from its memory
> pool.  Once the page is offlined, the guest kernel's error handling will
> kick in when a task accesses the bad page (or nothing ever touches the bad
> page again and everyone is happy).

This is not really a case of bad page as such. It is more of a page
gone missing/trucated. And no new user can map it. We just need to
worry about existing users who already have it mapped.

> 
> Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection
> to handle this, but I suspect the resulting behavior will look quite similar,
> e.g. notify the virtiofs driver in the guest, which does some magic to take
> the offending region offline, and then guest tasks get SIGBUS or whatever.
> 
> I also don't think it's KVM's responsibility to _directly_ handle such a
> scenario.  As I said in an earlier version, KVM can't possibly know _why_ a
> page fault came back with -EFAULT, only userspace can connect the dots of
> GPA -> HVA -> vm_area_struct -> file -> inject event.  KVM definitely should
> exit to userspace on the -EFAULT instead of hanging the guest, but that can
> be done via a new request, as suggested.

KVM atleast should have the mechanism to report this back to guest. And
we don't have any. There only seems to be #MC stuff for poisoned pages.
I am not sure how much we can build on top of #MC stuff to take care
of this case. One problem with #MC I found was that it generates
synchronous #MC only on load and not store. So all the code is
written in such a way that synchronous #MC can happen only on load
and hence the error handling. 

Stores generate different kind of #MC th

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-02 Thread Vivek Goyal
On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> On Thu, Oct 01, 2020 at 05:55:08PM -0400, Vivek Goyal wrote:
> > On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote:
> > > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> > > > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, 
> > > > unsigned long rflags)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(kvm_set_rflags);
> > > >  
> > > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> > > > +{
> > > > +   BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> > > > +
> > > > +   return hash_32(gfn & 0x, 
> > > > order_base_2(ERROR_GFN_PER_VCPU));
> > > > +}
> > > > +
> > > > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > > > +{
> > > > +   u32 key = kvm_error_gfn_hash_fn(gfn);
> > > > +
> > > > +   /*
> > > > +* Overwrite the previous gfn. This is just a hint to do
> > > > +* sync page fault.
> > > > +*/
> > > > +   vcpu->arch.apf.error_gfns[key] = gfn;
> > > > +}
> > > > +
> > > > +/* Returns true if gfn was found in hash table, false otherwise */
> > > > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t 
> > > > gfn)
> > > > +{
> > > > +   u32 key = kvm_error_gfn_hash_fn(gfn);
> > > 
> > > Mostly out of curiosity, do we really need a hash?  E.g. could we get away
> > > with an array of 4 values?  2 values?  Just wondering if we can avoid 64*8
> > > bytes per CPU.
> > 
> > We are relying on returning error when guest task retries fault. Fault
> > will be retried on same address if same task is run by vcpu after
> > "page ready" event. There is no guarantee that same task will be
> > run. In theory, this cpu could have a large number of tasks queued
> > and run these tasks before the faulting task is run again. Now say
> > there are 128 tasks being run and 32 of them have page fault
> > errors. Then if we keep 4 values, newer failures will simply
> > overwrite older failures and we will keep spinning instead of
> > exiting to user space.
> > 
> > That's why this array of 64 gfns and add gfns based on hash. This
> > does not completely elimiante the above possibility but chances
> > of one hitting this are very very slim.
> 
> But have you actually tried such a scenario?  In other words, is there good
> justification for burning the extra memory?

Its not easy to try and reproduce. So it is all theory  at this point of time.
If you are worried about memory usage, we can probably reduce the size
of hash table. Say from 64, reduce it to 8. I am fine with that. I think
initially I had a single error_gfn. But Vitaly had concerns about
above scenario, so I implemeted a hash table.

I think reducing hash table size to 8 or 16 probaly is a good middle
ground.

> 
> Alternatively, what about adding a new KVM request type to handle this?
> E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> request.  The vCPU then gets kicked and exits to userspace.  Before exiting
> to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad GFNs
> simply get if error_gfn is "valid", i.e. there's a pending request.

Sorry, I did not understand the above proposal. Can you please elaborate
a bit more. Part of it is that I don't know much about KVM requests.
Looking at the code it looks like that main loop is parsing if some
kvm request is pending and executing that action.

Don't we want to make sure that we exit to user space when guest retries
error gfn access again. In this case once we get -EFAULT, we will still
inject page_ready into guest. And then either same process or a different
process might run. 

So when exactly code raises a kvm request. If I raise it right when
I get -EFAULT, then kvm will exit to user space upon next entry
time. But there is no guarantee guest vcpu is running the process which
actually accessed the error gfn. And that probably means that register
state of cpu does not mean much and one can not easily figure out
which task tried to access the bad memory and when.

That's why we prepare a list of error gfn and only exit to user space
when error_gfn access is retried so that guest vcpu context is correct.

What am I missing?

Thanks
Vivek

> 
> That would guarantee the error is propagated to userspace, and doesn't lose
> any guest information as dropping error GFNs just means the guest will take
>

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-01 Thread Vivek Goyal
On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote:
> On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 ++
> >  arch/x86/kvm/mmu.h  |  2 +-
> >  arch/x86/kvm/mmu/mmu.c  |  2 +-
> >  arch/x86/kvm/x86.c  | 54 +++--
> >  include/linux/kvm_types.h   |  1 +
> >  5 files changed, 56 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index be5363b21540..e6f8d3f1a377 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t 
> > base_gfn, int level)
> >  #define KVM_NR_VAR_MTRR 8
> >  
> >  #define ASYNC_PF_PER_VCPU 64
> > +#define ERROR_GFN_PER_VCPU 64
> 
> Aren't these two related?  I.e. wouldn't it make sense to do:

Hi,

They are related somewhat but they don't have to be same. I think we
can accumulate many more error GFNs if this vcpu does not schedule
the same task again to retry.

>   #define ERROR_GFN_PER_VCPU ASYNC_PF_PER_VCPU
> 
> Or maybe even size error_gfns[] to ASYNC_PF_PER_VCPU?

Given these two don't have to be same, I kept it separate. And kept the
hash size same for now. If one wants, hash can be bigger or smaller
down the line.

> 
> >  
> >  enum kvm_reg {
> > VCPU_REGS_RAX = __VCPU_REGS_RAX,
> > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch {
> > unsigned long nested_apf_token;
> > bool delivery_as_pf_vmexit;
> > bool pageready_pending;
> > +   gfn_t error_gfns[ERROR_GFN_PER_VCPU];
> > } apf;
> >  
> > /* OSVW MSRs (AMD only) */
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 444bb9c54548..d0a2a12c7bb6 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool 
> > reset_roots);
> >  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 
> > efer);
> >  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> >  bool accessed_dirty, gpa_t new_eptp);
> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
> >  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> > u64 fault_address, char *insn, int insn_len);
> >  
> 
> ...
> 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 88c593f83b28..c1f5094d6e53 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct 
> > kvm_vcpu *vcpu)
> > vcpu->arch.apf.gfns[i] = ~0;
> >  }
> >  
> > +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu)
> > +{
> > +   int i;
> 
> Need a newline.   

Will do.

> 
> > +   for (i = 0; i < ERROR_GFN_PER_VCPU; i++)
> > +   vcpu->arch.apf.error_gfns[i] = GFN_INVALID;
> > +}
> > +
> >  static void kvm_on_user_return(struct user_return_notifier *urn)
> >  {
> > unsigned slot;
> > @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT;
> >  
> > kvm_async_pf_hash_reset(vcpu);
> > +   kvm_error_gfn_hash_reset(vcpu);
> > kvm_pmu_init(vcpu);
> >  
> > vcpu->arch.pending_external_vector = -1;
> > @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool 
> > init_event)
> >  
> > kvm_clear_async_pf_completion_queue(vcpu);
> > kvm_async_pf_hash_reset(vcpu);
> > +   kvm_error_gfn_hash_reset(vcpu);
> > vcpu->arch.apf.halted = false;
> >  
> > if (kvm_mpx_supported()) {
> > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, 
> > unsigned long rflags)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_set_rflags);
> >  
> > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> > +{
> > +   BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> > +
> > +   return hash_32(gfn & 0x, order_base_2(ERROR_GFN_PER_VCPU));
> > +}
> > +
> > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > +{
> > +   u32 key = kvm_error_gfn_hash_fn(gfn);
> > +
> > +   /*
> > +* Overwrite the previous gfn. This is just a hint to do
>

Re: [PATCH v2] ovl: introduce new "index=nouuid" option for inodes index feature

2020-09-24 Thread Vivek Goyal
On Thu, Sep 24, 2020 at 05:44:22AM +0300, Amir Goldstein wrote:
> On Wed, Sep 23, 2020 at 10:47 PM Vivek Goyal  wrote:
> >
> > On Wed, Sep 23, 2020 at 06:23:08PM +0300, Pavel Tikhomirov wrote:
> > > This relaxes uuid checks for overlay index feature. It is only possible
> > > in case there is only one filesystem for all the work/upper/lower
> > > directories and bare file handles from this backing filesystem are uniq.
> >
> > Hi Pavel,
> >
> > Wondering why upper/work has to be on same filesystem as lower for this to
> > work?
> >
> 
> I reckon that's because I asked for this constraint, so I will answer.
> 
> You are right that the important thing is that all lower layers are
> on the same fs, but because of
>   a888db310195 ovl: fix regression with re-formatted lower squashfs

Hi Amir,

So with "upper on same as lower fs" contstraint we are just making it
little harder so that people don't use recreated lower with existing
upper? Is that the intention behind this constraint.

On a side note, I have a question about above commit. 

So this is basically the issue of upper stored file handle resolving to
a different file (in recreated lower). And we are considering this to
be a corner case. But the very fact a user was running into it, it
probably is not that hard to reproduce. So with the fix a888db310195,
we avoided the problem for simple configurations (no-index, no-metacopy,
and no xino). But if same user runs with index=on, with recreatd lower,
they can still run into similar issues?

> 
> I preferred to keep the rules simpler.
> 
> Pavel's use case is clone of disk and change of its UUID.
> This is a real use case and I don't think it is unique to Virtuozzo,
> so I wanted index=nouuid to address that use case only and
> I prefer that it is documented that way too.

Sure. I understand that. I am only harping on this to make sure
we tell people to not use this "recreated lower with existing upper".
In Pavel's use case, it is more of a cloned use case and not
re-created use case.

Otherwise people will re-create lower layers with regular filesystems and
use index=nouuid and then run into squashfs like issue one day.

Or we could document what Miklos had said. Using existing upper
with recreated lower will likely run into issues with advanced
overlay features like (index, metacopy, xino etc).

> 
> Ironically, one of the justifications for index=nouuid is virtiofs -
> because fuse is now allowed as upper (or as same fs),
> one can already use fuse passthough (or one could use fuse
> passthrough when nfs export works correctly) as a "uuid anonymizer"
> for any fs, so in practice, index=nouuid cannot do any more harm
> then one can already do when enabling index over virtiofs.

Interesing. Using virtiofs or a fuse passthrough filesystem on top
just to avoid uuid check will be lot of work. 

But keeping upper/ on same fs as lower fs constraint does not help with this.

> 
> That is why I prefer the interpretation that index=nouuid means
> "use null uuid instead of s_uuid for ovl_fh" over the interpretation
> "relax comparison of uuid in ovl_fh".

So bottom line is that there are many ways where users can recreate
lower layers and run into issues.

- squashfs with index
- use a fuse passthrough filesystem
- use index=nouuid

So to me documenting that don't use existig upper with recreated lower
should help with all.

And putting a constraint of "lower and upper being on same fs" seems fine
for now but I am not sure it helps a lot. Anyway, I am fine with this
constratint. Just wanted to understand the rationale behind it.

Thanks
Vivek



Re: [PATCH v2] ovl: introduce new "index=nouuid" option for inodes index feature

2020-09-23 Thread Vivek Goyal
On Wed, Sep 23, 2020 at 06:23:08PM +0300, Pavel Tikhomirov wrote:
> This relaxes uuid checks for overlay index feature. It is only possible
> in case there is only one filesystem for all the work/upper/lower
> directories and bare file handles from this backing filesystem are uniq.

Hi Pavel,

Wondering why upper/work has to be on same filesystem as lower for this to
work?

> In case we have multiple filesystems here just fall back to normal
> "index=on".
> 
> This is needed when overlayfs is/was mounted in a container with
> index enabled (e.g.: to be able to resolve inotify watch file handles on
> it to paths in CRIU), and this container is copied and started alongside
> with the original one. This way the "copy" container can't have the same
> uuid on the superblock and mounting the overlayfs from it later would
> fail.
> 
> That is an example of the problem on top of loop+ext4:
> 
> dd if=/dev/zero of=loopbackfile.img bs=100M count=10
> losetup -fP loopbackfile.img
> losetup -a
>   #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
> mkfs.ext4 loopbackfile.img
> mkdir loop-mp
> mount -o loop /dev/loop0 loop-mp
> mkdir loop-mp/{lower,upper,work,merged}
> mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
> upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
> umount loop-mp/merged
> umount loop-mp
> e2fsck -f /dev/loop0
> tune2fs -U random /dev/loop0
> 
> mount -o loop /dev/loop0 loop-mp
> mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
> upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
>   #mount: /loop-test/loop-mp/merged:
>   #mount(2) system call failed: Stale file handle.
> 
> mount -t overlay overlay -oindex=nouuid,lowerdir=loop-mp/lower,\
> upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
> 
> If you just change the uuid of the backing filesystem, overlay is not
> mounting any more. In Virtuozzo we copy container disks (ploops) when
> crate the copy of container and we require fs uuid to be uniq for a new
> container.
> 
> v2: in v1 I missed actual uuid check skip - add it
> 
> CC: Amir Goldstein 
> CC: Vivek Goyal 
> CC: Miklos Szeredi 
> CC: linux-unio...@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> Signed-off-by: Pavel Tikhomirov 
> ---
>  fs/overlayfs/Kconfig | 16 +++
>  fs/overlayfs/export.c|  6 ++--
>  fs/overlayfs/namei.c | 35 +++
>  fs/overlayfs/overlayfs.h | 23 +++
>  fs/overlayfs/ovl_entry.h |  2 +-
>  fs/overlayfs/super.c | 61 +---
>  6 files changed, 106 insertions(+), 37 deletions(-)

Please put something in Documentation file to explain this option. Also
explain where it is safe to use and where it is not. I am concerned
about the case where recreating lower can result in file handle resolving
to a different file (as discusse in other thread). I think it would
be nice if we mention it explicitly.

Thanks
Vivek

> 
> diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
> index dd188c7996b3..b00fd44006f9 100644
> --- a/fs/overlayfs/Kconfig
> +++ b/fs/overlayfs/Kconfig
> @@ -61,6 +61,22 @@ config OVERLAY_FS_INDEX
>  
> If unsure, say N.
>  
> +config OVERLAY_FS_INDEX_NOUUID
> + bool "Overlayfs: relax uuid checks of inodes index feature"
> + depends on OVERLAY_FS
> + depends on OVERLAY_FS_INDEX
> + help
> +   If this config option is enabled then overlay will skip uuid checks
> +   for index lower to upper inode map, this only can be done if all
> +   upper and lower directories are on the same filesystem where basic
> +   fhandles are uniq.
> +
> +   It is needed to overcome possible change of uuid on superblock of the
> +   backing filesystem, e.g. when you copied the virtual disk and mount
> +   both the copy of the disk and the original one at the same time.
> +
> +   If unsure, say N.
> +
>  config OVERLAY_FS_NFS_EXPORT
>   bool "Overlayfs: turn on NFS export feature by default"
>   depends on OVERLAY_FS
> diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
> index 0e696f72cf65..d53feb7547d9 100644
> --- a/fs/overlayfs/export.c
> +++ b/fs/overlayfs/export.c
> @@ -676,11 +676,12 @@ static struct dentry *ovl_upper_fh_to_d(struct 
> super_block *sb,
>   struct ovl_fs *ofs = sb->s_fs_info;
>   struct dentry *dentry;
>   struct dentry *upper;
> + bool nouuid = ofs->config.index == OVL_INDEX_NOUUID;
>  
>   if (!ovl_upper_mnt(ofs))
>   return ERR_PTR(-EACCES);
>  
> - upper = ovl_decode_real_fh(fh, ovl_upper_mnt(ofs), true);
> + upper = ovl_decode_real_fh(fh, ovl_up

Re: linux-next: build failure after merge of the akpm-current tree

2020-09-08 Thread Vivek Goyal
On Tue, Sep 08, 2020 at 08:09:50PM +1000, Stephen Rothwell wrote:

[..]
> fs/fuse/virtio_fs.c: In function 'virtio_fs_setup_dax':
> fs/fuse/virtio_fs.c:838:9: error: 'struct dev_pagemap' has no member named 
> 'res'; did you mean 'ref'?
>   838 |  pgmap->res = (struct resource){
>   | ^~~
>   | ref
> 
> Caused by commit
> 
>   b3e022c5a68c ("mm/memremap_pages: convert to 'struct range'")
> 
> interacting with commit
> 
>   9e2369c06c8a ("xen: add helpers to allocate unpopulated memory")
> 
> from Linus' tree (in v5.9-rc4) and commit
> 
>   7e833303db20 ("virtiofs: set up virtio_fs dax_device")
> 
> from the fuse tree.
> 
> I have added the following patch which may require more work but at
> least makes it all build.
> 
> From: Stephen Rothwell 
> Date: Tue, 8 Sep 2020 20:00:20 +1000
> Subject: [PATCH] merge fix up for "mm/memremap_pages: convert to 'struct
>  range'"
> 
> Signed-off-by: Stephen Rothwell 
> ---
>  drivers/xen/unpopulated-alloc.c | 15 +--
>  fs/fuse/virtio_fs.c |  3 +--
>  2 files changed, 10 insertions(+), 8 deletions(-)
> 

[..]
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index da3ede268604..8f27478497fa 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -835,8 +835,7 @@ static int virtio_fs_setup_dax(struct virtio_device 
> *vdev, struct virtio_fs *fs)
>* initialize a struct resource from scratch (only the start
>* and end fields will be used).
>*/
> - pgmap->res = (struct resource){
> - .name = "virtio-fs dax window",
> + pgmap->range = (struct range){
>   .start = (phys_addr_t) cache_reg.addr,
>   .end = (phys_addr_t) cache_reg.addr + cache_reg.len - 1,
>   };

Thanks Stephen. This change looks good to me for virtiofs.

Thanks
Vivek



Re: [PATCH v3 00/18] virtiofs: Add DAX support

2020-08-28 Thread Vivek Goyal
On Fri, Aug 28, 2020 at 04:26:55PM +0200, Miklos Szeredi wrote:
> On Thu, Aug 20, 2020 at 12:21 AM Vivek Goyal  wrote:
> >
> > Hi All,
> >
> > This is V3 of patches. I had posted version v2 version here.
> 
> Pushed to:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git#dax
> 
> Fixed a couple of minor issues, and added two patches:
> 
> 1. move dax specific code from fuse core to a separate source file
> 
> 2. move dax specific data, as well as allowing dax to be configured out
> 
> I think it would be cleaner to fold these back into the original
> series, but for now I'm just asking for comments and testing.

Thanks Miklos. I will have a look and test.

Vivek



Re: [PATCH v3 11/18] fuse: implement FUSE_INIT map_alignment field

2020-08-26 Thread Vivek Goyal
On Wed, Aug 26, 2020 at 09:26:29PM +0200, Miklos Szeredi wrote:
> On Wed, Aug 26, 2020 at 9:17 PM Dr. David Alan Gilbert
>  wrote:
> 
> > Agreed, because there's not much that the server can do about it if the
> > client would like a smaller granularity - the servers granularity might
> > be dictated by it's mmap/pagesize/filesystem.  If the client wants a
> > larger granularity that's it's choice when it sends the setupmapping
> > calls.
> 
> What bothers me is that the server now comes with the built in 2MiB
> granularity (obviously much larger than actually needed).
> 
> What if at some point we'd want to reduce that somewhat in the client?
>   Yeah, we can't.   Maybe this is not a kernel problem after all, the
> proper thing would be to fix the server to actually send something
> meaningful.

Hi Miklos,

Current implementation of virtiofsd reports this map alignment as
PAGE_SIZE.

/* This constraint comes from mmap(2) and munmap(2) */
outarg.map_alignment = ffsl(sysconf(_SC_PAGE_SIZE)) - 1;

Which should be 4K on x86. 

And that means if client wants it can drop to dax mapping size as
small as 4K and still meeting alignment constratints. Just that by
default we have chosen 2MB as of now fearing there might be too
many small mmap() calls on host and we will hit various limits.

Thanks
Vivek



Re: [PATCH v3 11/18] fuse: implement FUSE_INIT map_alignment field

2020-08-26 Thread Vivek Goyal
On Wed, Aug 26, 2020 at 04:06:35PM +0200, Miklos Szeredi wrote:
> On Thu, Aug 20, 2020 at 12:21 AM Vivek Goyal  wrote:
> >
> > The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
> > constraints via the FUST_INIT map_alignment field.  Parse this field and
> > ensure our DAX mappings meet the alignment constraints.
> >
> > We don't actually align anything differently since our mappings are
> > already 2MB aligned.  Just check the value when the connection is
> > established.  If it becomes necessary to honor arbitrary alignments in
> > the future we'll have to adjust how mappings are sized.
> >
> > The upshot of this commit is that we can be confident that mappings will
> > work even when emulating x86 on Power and similar combinations where the
> > host page sizes are different.
> >
> > Signed-off-by: Stefan Hajnoczi 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/fuse/fuse_i.h  |  5 -
> >  fs/fuse/inode.c   | 18 --
> >  include/uapi/linux/fuse.h |  4 +++-
> >  3 files changed, 23 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 478c940b05b4..4a46e35222c7 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -47,7 +47,10 @@
> >  /** Number of dentries for each connection in the control filesystem */
> >  #define FUSE_CTL_NUM_DENTRIES 5
> >
> > -/* Default memory range size, 2MB */
> > +/*
> > + * Default memory range size.  A power of 2 so it agrees with common 
> > FUSE_INIT
> > + * map_alignment values 4KB and 64KB.
> > + */
> >  #define FUSE_DAX_SZ(2*1024*1024)
> >  #define FUSE_DAX_SHIFT (21)
> >  #define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index b82eb61d63cc..947abdd776ca 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -980,9 +980,10 @@ static void process_init_reply(struct fuse_conn *fc, 
> > struct fuse_args *args,
> >  {
> > struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
> > struct fuse_init_out *arg = >out;
> > +   bool ok = true;
> >
> > if (error || arg->major != FUSE_KERNEL_VERSION)
> > -   fc->conn_error = 1;
> > +   ok = false;
> > else {
> > unsigned long ra_pages;
> >
> > @@ -1045,6 +1046,13 @@ static void process_init_reply(struct fuse_conn *fc, 
> > struct fuse_args *args,
> > min_t(unsigned int, 
> > FUSE_MAX_MAX_PAGES,
> > max_t(unsigned int, arg->max_pages, 
> > 1));
> > }
> > +   if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
> > +   (FUSE_DAX_SZ % (1ul << arg->map_alignment))) {
> 
> This just obfuscates "arg->map_alignment != FUSE_DAX_SHIFT".
> 
> So the intention was that userspace can ask the kernel for a
> particular alignment, right?

My understanding is that device will specify alignment for
the foffset/moffset fields in fuse_setupmapping_in/fuse_removemapping_one.
And DAX mapping can be any size meeting that alignment contraint.

> 
> In that case kernel can definitely succeed if the requested alignment
> is smaller than the kernel provided one, no? 

Yes. So if map_alignemnt is 64K and DAX mapping size is 2MB, that's just
fine because it meets 4K alignment contraint. Just that we can't use
4K size DAX mapping in that case.

> It would also make
> sense to make this a two way negotiation.  I.e. send the largest
> alignment (FUSE_DAX_SHIFT in this implementation) that the kernel can
> provide in fuse_init_in.   In that case the only error would be if
> userspace ignored the given constraints.

We could make it two way negotiation if it helps. So if we support
multiple mapping sizes in future, say 4K, 64K, 2MB, 1GB. So idea is
to send alignment of largest mapping size to device/user_space (1GB)
in this case? And that will allow device to choose an alignment
which best fits its needs?

But problem here is that sending (log2(1GB)) does not mean we support
all the alignments in that range. For example, if device selects say
256MB as minimum alignment, kernel might not support it.

So there seem to be two ways to handle this.

A.Let device be conservative and always specify the minimum aligment
  it can work with and let guest kernel automatically choose a mapping
  size which meets that min_alignment contraint.

B.Send all the mapping sizes supported by kernel 

Re: [PATCH v3 02/18] dax: Create a range version of dax_layout_busy_page()

2020-08-20 Thread Vivek Goyal
On Thu, Aug 20, 2020 at 02:58:55PM +0200, Jan Kara wrote:
[..]
> >  /**
> > - * dax_layout_busy_page - find first pinned page in @mapping
> > + * dax_layout_busy_page_range - find first pinned page in @mapping
> >   * @mapping: address space to scan for a page with ref count > 1
> 
> Please document additional function arguments in the kernel-doc comment.
> 
> Otherwise the patch looks good so feel free to add:
> 
> Reviewed-by: Jan Kara 
> 
> after fixing this nit.
> 

Hi Jan

Thanks for the review. Here is the updated patch. I also captured your
Reviewed-by.


>From 3f81f769be9419ffc5a788833339ed439dbcd48e Mon Sep 17 00:00:00 2001
From: Vivek Goyal 
Date: Tue, 3 Mar 2020 14:58:21 -0500
Subject: [PATCH 02/20] dax: Create a range version of dax_layout_busy_page()

virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.

As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.

We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references  (and don't want to unmap all the pages of inode).

Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.

Cc: Dan Williams 
Cc: linux-nvd...@lists.01.org
Cc: Jan Kara 
Cc: Vishal L Verma 
Cc: "Weiny, Ira" 
Signed-off-by: Vivek Goyal 
Reviewed-by: Jan Kara 
---
 fs/dax.c|   29 +++--
 include/linux/dax.h |6 ++
 2 files changed, 29 insertions(+), 6 deletions(-)

Index: redhat-linux/fs/dax.c
===
--- redhat-linux.orig/fs/dax.c  2020-08-20 14:04:41.995676669 +
+++ redhat-linux/fs/dax.c   2020-08-20 14:15:20.072676669 +
@@ -559,8 +559,11 @@ fallback:
 }
 
 /**
- * dax_layout_busy_page - find first pinned page in @mapping
+ * dax_layout_busy_page_range - find first pinned page in @mapping
  * @mapping: address space to scan for a page with ref count > 1
+ * @start: Starting offset. Page containing 'start' is included.
+ * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
+ *   pages from 'start' till the end of file are included.
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
  * 'onlined' to the page allocator so they are considered idle when
@@ -573,12 +576,15 @@ fallback:
  * to be able to run unmap_mapping_range() and subsequently not race
  * mapping_mapped() becoming true.
  */
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page_range(struct address_space *mapping,
+   loff_t start, loff_t end)
 {
-   XA_STATE(xas, >i_pages, 0);
void *entry;
unsigned int scanned = 0;
struct page *page = NULL;
+   pgoff_t start_idx = start >> PAGE_SHIFT;
+   pgoff_t end_idx;
+   XA_STATE(xas, >i_pages, start_idx);
 
/*
 * In the 'limited' case get_user_pages() for dax is disabled.
@@ -589,6 +595,11 @@ struct page *dax_layout_busy_page(struct
if (!dax_mapping(mapping) || !mapping_mapped(mapping))
return NULL;
 
+   /* If end == LLONG_MAX, all pages from start to till end of file */
+   if (end == LLONG_MAX)
+   end_idx = ULONG_MAX;
+   else
+   end_idx = end >> PAGE_SHIFT;
/*
 * If we race get_user_pages_fast() here either we'll see the
 * elevated page count in the iteration and wait, or
@@ -596,15 +607,15 @@ struct page *dax_layout_busy_page(struct
 * against is no longer mapped in the page tables and bail to the
 * get_user_pages() slow path.  The slow path is protected by
 * pte_lock() and pmd_lock(). New references are not taken without
-* holding those locks, and unmap_mapping_range() will not zero the
+* holding those locks, and unmap_mapping_pages() will not zero the
 * pte or pmd without holding the respective lock, so we are
 * guaranteed to either see new references or prevent new
 * references from being established.
 */
-   unmap_mapping_range(mapping, 0, 0, 0);
+   unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
 
xas_lock_irq();
-   xas_for_each(, entry, ULONG_MAX) {
+   

[PATCH v3 02/18] dax: Create a range version of dax_layout_busy_page()

2020-08-19 Thread Vivek Goyal
virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.

As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.

We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references  (and don't want to unmap all the pages of inode).

Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.

Cc: Dan Williams 
Cc: linux-nvd...@lists.01.org
Cc: Jan Kara 
Cc: Vishal L Verma 
Cc: "Weiny, Ira" 
Signed-off-by: Vivek Goyal 
---
 fs/dax.c| 29 +++--
 include/linux/dax.h |  6 ++
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 95341af1a966..ddd705251d9f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -559,7 +559,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
 }
 
 /**
- * dax_layout_busy_page - find first pinned page in @mapping
+ * dax_layout_busy_page_range - find first pinned page in @mapping
  * @mapping: address space to scan for a page with ref count > 1
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
@@ -572,13 +572,19 @@ static void *grab_mapping_entry(struct xa_state *xas,
  * establishment of new mappings in this address_space. I.e. it expects
  * to be able to run unmap_mapping_range() and subsequently not race
  * mapping_mapped() becoming true.
+ *
+ * Partial pages are included. If 'end' is LLONG_MAX, pages in the range
+ * from 'start' to end of the file are inluded.
  */
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page_range(struct address_space *mapping,
+   loff_t start, loff_t end)
 {
-   XA_STATE(xas, >i_pages, 0);
void *entry;
unsigned int scanned = 0;
struct page *page = NULL;
+   pgoff_t start_idx = start >> PAGE_SHIFT;
+   pgoff_t end_idx;
+   XA_STATE(xas, >i_pages, start_idx);
 
/*
 * In the 'limited' case get_user_pages() for dax is disabled.
@@ -589,6 +595,11 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
if (!dax_mapping(mapping) || !mapping_mapped(mapping))
return NULL;
 
+   /* If end == LLONG_MAX, all pages from start to till end of file */
+   if (end == LLONG_MAX)
+   end_idx = ULONG_MAX;
+   else
+   end_idx = end >> PAGE_SHIFT;
/*
 * If we race get_user_pages_fast() here either we'll see the
 * elevated page count in the iteration and wait, or
@@ -596,15 +607,15 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
 * against is no longer mapped in the page tables and bail to the
 * get_user_pages() slow path.  The slow path is protected by
 * pte_lock() and pmd_lock(). New references are not taken without
-* holding those locks, and unmap_mapping_range() will not zero the
+* holding those locks, and unmap_mapping_pages() will not zero the
 * pte or pmd without holding the respective lock, so we are
 * guaranteed to either see new references or prevent new
 * references from being established.
 */
-   unmap_mapping_range(mapping, 0, 0, 0);
+   unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
 
xas_lock_irq();
-   xas_for_each(, entry, ULONG_MAX) {
+   xas_for_each(, entry, end_idx) {
if (WARN_ON_ONCE(!xa_is_value(entry)))
continue;
if (unlikely(dax_is_locked(entry)))
@@ -625,6 +636,12 @@ struct page *dax_layout_busy_page(struct address_space 
*mapping)
xas_unlock_irq();
return page;
 }
+EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
+
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+   return dax_layout_busy_page_range(mapping, 0, LLONG_MAX);
+}
 EXPORT_SYMBOL_GPL(dax_layout_busy_page);
 
 static int __dax_invalidate_entry(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..9016929db4c6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -141,6 +141,7 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
struct dax_device *dax_dev, struct writeback_control *wbc);
 
 struct pa

[PATCH v3 11/18] fuse: implement FUSE_INIT map_alignment field

2020-08-19 Thread Vivek Goyal
The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field.  Parse this field and
ensure our DAX mappings meet the alignment constraints.

We don't actually align anything differently since our mappings are
already 2MB aligned.  Just check the value when the connection is
established.  If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.

The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Vivek Goyal 
---
 fs/fuse/fuse_i.h  |  5 -
 fs/fuse/inode.c   | 18 --
 include/uapi/linux/fuse.h |  4 +++-
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 478c940b05b4..4a46e35222c7 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,7 +47,10 @@
 /** Number of dentries for each connection in the control filesystem */
 #define FUSE_CTL_NUM_DENTRIES 5
 
-/* Default memory range size, 2MB */
+/*
+ * Default memory range size.  A power of 2 so it agrees with common FUSE_INIT
+ * map_alignment values 4KB and 64KB.
+ */
 #define FUSE_DAX_SZ(2*1024*1024)
 #define FUSE_DAX_SHIFT (21)
 #define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b82eb61d63cc..947abdd776ca 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -980,9 +980,10 @@ static void process_init_reply(struct fuse_conn *fc, 
struct fuse_args *args,
 {
struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
struct fuse_init_out *arg = >out;
+   bool ok = true;
 
if (error || arg->major != FUSE_KERNEL_VERSION)
-   fc->conn_error = 1;
+   ok = false;
else {
unsigned long ra_pages;
 
@@ -1045,6 +1046,13 @@ static void process_init_reply(struct fuse_conn *fc, 
struct fuse_args *args,
min_t(unsigned int, FUSE_MAX_MAX_PAGES,
max_t(unsigned int, arg->max_pages, 1));
}
+   if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
+   (FUSE_DAX_SZ % (1ul << arg->map_alignment))) {
+   pr_err("FUSE: map_alignment %u incompatible"
+  " with dax mem range size %u\n",
+  arg->map_alignment, FUSE_DAX_SZ);
+   ok = false;
+   }
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1060,6 +1068,11 @@ static void process_init_reply(struct fuse_conn *fc, 
struct fuse_args *args,
}
kfree(ia);
 
+   if (!ok) {
+   fc->conn_init = 0;
+   fc->conn_error = 1;
+   }
+
fuse_set_initialized(fc);
wake_up_all(>blocked_waitq);
 }
@@ -1082,7 +1095,8 @@ void fuse_send_init(struct fuse_conn *fc)
FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
-   FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
+   FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
+   FUSE_MAP_ALIGNMENT;
ia->args.opcode = FUSE_INIT;
ia->args.in_numargs = 1;
ia->args.in_args[0].size = sizeof(ia->in);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 373cada89815..5b85819e045f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -313,7 +313,9 @@ struct fuse_file_lock {
  * FUSE_CACHE_SYMLINKS: cache READLINK responses
  * FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
  * FUSE_EXPLICIT_INVAL_DATA: only invalidate cached pages on explicit request
- * FUSE_MAP_ALIGNMENT: map_alignment field is valid
+ * FUSE_MAP_ALIGNMENT: init_out.map_alignment contains log2(byte alignment) for
+ *foffset and moffset fields in struct
+ *fuse_setupmapping_out and fuse_removemapping_one.
  */
 #define FUSE_ASYNC_READ(1 << 0)
 #define FUSE_POSIX_LOCKS   (1 << 1)
-- 
2.25.4



[PATCH v3 09/18] fuse,virtiofs: Add a mount option to enable dax

2020-08-19 Thread Vivek Goyal
Add a mount option to allow using dax with virtio_fs.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/fuse_i.h|  7 
 fs/fuse/inode.c |  3 ++
 fs/fuse/virtio_fs.c | 82 +
 3 files changed, 78 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index cf5e675100ec..04fdd7c41bd1 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -486,10 +486,14 @@ struct fuse_fs_context {
bool destroy:1;
bool no_control:1;
bool no_force_umount:1;
+   bool dax:1;
unsigned int max_read;
unsigned int blksize;
const char *subtype;
 
+   /* DAX device, may be NULL */
+   struct dax_device *dax_dev;
+
/* fuse_dev pointer to fill in, should contain NULL on entry */
void **fudptr;
 };
@@ -761,6 +765,9 @@ struct fuse_conn {
 
/** List of device instances belonging to this connection */
struct list_head devices;
+
+   /** DAX device, non-NULL if DAX is supported */
+   struct dax_device *dax_dev;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2ac5713c4c32..beac337ccc10 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -589,6 +589,8 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+   if (fc->dax_dev)
+   seq_printf(m, ",dax");
return 0;
 }
 
@@ -1207,6 +1209,7 @@ int fuse_fill_super_common(struct super_block *sb, struct 
fuse_fs_context *ctx)
fc->destroy = ctx->destroy;
fc->no_control = ctx->no_control;
fc->no_force_umount = ctx->no_force_umount;
+   fc->dax_dev = ctx->dax_dev;
 
err = -ENOMEM;
root = fuse_get_root_inode(sb, ctx->rootmode);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 0fd3b5cecc5f..741cad4abad8 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "fuse_i.h"
@@ -81,6 +82,45 @@ struct virtio_fs_req_work {
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 struct fuse_req *req, bool in_flight);
 
+enum {
+   OPT_DAX,
+};
+
+static const struct fs_parameter_spec virtio_fs_parameters[] = {
+   fsparam_flag("dax", OPT_DAX),
+   {}
+};
+
+static int virtio_fs_parse_param(struct fs_context *fc,
+struct fs_parameter *param)
+{
+   struct fs_parse_result result;
+   struct fuse_fs_context *ctx = fc->fs_private;
+   int opt;
+
+   opt = fs_parse(fc, virtio_fs_parameters, param, );
+   if (opt < 0)
+   return opt;
+
+   switch (opt) {
+   case OPT_DAX:
+   ctx->dax = 1;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static void virtio_fs_free_fc(struct fs_context *fc)
+{
+   struct fuse_fs_context *ctx = fc->fs_private;
+
+   if (ctx)
+   kfree(ctx);
+}
+
 static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq)
 {
struct virtio_fs *fs = vq->vdev->priv;
@@ -1220,23 +1260,27 @@ static const struct fuse_iqueue_ops virtio_fs_fiq_ops = 
{
.release= virtio_fs_fiq_release,
 };
 
-static int virtio_fs_fill_super(struct super_block *sb)
+static inline void virtio_fs_ctx_set_defaults(struct fuse_fs_context *ctx)
+{
+   ctx->rootmode = S_IFDIR;
+   ctx->default_permissions = 1;
+   ctx->allow_other = 1;
+   ctx->max_read = UINT_MAX;
+   ctx->blksize = 512;
+   ctx->destroy = true;
+   ctx->no_control = true;
+   ctx->no_force_umount = true;
+}
+
+static int virtio_fs_fill_super(struct super_block *sb, struct fs_context *fsc)
 {
struct fuse_conn *fc = get_fuse_conn_super(sb);
struct virtio_fs *fs = fc->iq.priv;
+   struct fuse_fs_context *ctx = fsc->fs_private;
unsigned int i;
int err;
-   struct fuse_fs_context ctx = {
-   .rootmode = S_IFDIR,
-   .default_permissions = 1,
-   .allow_other = 1,
-   .max_read = UINT_MAX,
-   .blksize = 512,
-   .destroy = true,
-   .no_control = true,
-   .no_force_umount = true,
-   };
 
+   virtio_fs_ctx_set_defaults(ctx);
mutex_lock(_fs_mutex);
 
/* After holding mutex, make sure virtiofs device is still there.
@@ -1260,8 +1304,10 @@ static int virtio_fs_fill_super(struct super_block *sb)
}
 
/* virti

[PATCH v3 17/18] fuse,virtiofs: Maintain a list of busy elements

2020-08-19 Thread Vivek Goyal
This list will be used selecting fuse_dax_mapping to free when number of
free mappings drops below a threshold.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c   | 22 ++
 fs/fuse/fuse_i.h |  7 +++
 fs/fuse/inode.c  |  4 
 3 files changed, 33 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index aaa57c625af7..723602813ad6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -213,6 +213,23 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct 
fuse_conn *fc)
return dmap;
 }
 
+/* This assumes fc->lock is held */
+static void __dmap_remove_busy_list(struct fuse_conn *fc,
+   struct fuse_dax_mapping *dmap)
+{
+   list_del_init(>busy_list);
+   WARN_ON(fc->nr_busy_ranges == 0);
+   fc->nr_busy_ranges--;
+}
+
+static void dmap_remove_busy_list(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+   spin_lock(>lock);
+   __dmap_remove_busy_list(fc, dmap);
+   spin_unlock(>lock);
+}
+
 /* This assumes fc->lock is held */
 static void __dmap_add_to_free_pool(struct fuse_conn *fc,
struct fuse_dax_mapping *dmap)
@@ -266,6 +283,10 @@ static int fuse_setup_one_mapping(struct inode *inode, 
unsigned long start_idx,
/* Protected by fi->i_dmap_sem */
interval_tree_insert(>itn, >dmap_tree);
fi->nr_dmaps++;
+   spin_lock(>lock);
+   list_add_tail(>busy_list, >busy_ranges);
+   fc->nr_busy_ranges++;
+   spin_unlock(>lock);
}
return 0;
 }
@@ -335,6 +356,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn 
*fc,
pr_debug("fuse: freeing memory range start_idx=0x%lx end_idx=0x%lx "
 "window_offset=0x%llx length=0x%llx\n", dmap->itn.start,
 dmap->itn.last, dmap->window_offset, dmap->length);
+   __dmap_remove_busy_list(fc, dmap);
dmap->itn.start = dmap->itn.last = 0;
__dmap_add_to_free_pool(fc, dmap);
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e555c9a33359..400a19a464ca 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -80,6 +80,9 @@ struct fuse_dax_mapping {
/* For interval tree in file/inode */
struct interval_tree_node itn;
 
+   /* Will connect in fc->busy_ranges to keep track busy memory */
+   struct list_head busy_list;
+
/** Position in DAX window */
u64 window_offset;
 
@@ -812,6 +815,10 @@ struct fuse_conn {
/** DAX device, non-NULL if DAX is supported */
struct dax_device *dax_dev;
 
+   /* List of memory ranges which are busy */
+   unsigned long nr_busy_ranges;
+   struct list_head busy_ranges;
+
/*
 * DAX Window Free Ranges
 */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 3735bc5fdfa2..671e84e3dd99 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -636,6 +636,8 @@ static void fuse_free_dax_mem_ranges(struct list_head 
*mem_list)
/* Free All allocated elements */
list_for_each_entry_safe(range, temp, mem_list, list) {
list_del(>list);
+   if (!list_empty(>busy_list))
+   list_del(>busy_list);
kfree(range);
}
 }
@@ -680,6 +682,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
 */
range->window_offset = i * FUSE_DAX_SZ;
range->length = FUSE_DAX_SZ;
+   INIT_LIST_HEAD(>busy_list);
list_add_tail(>list, _ranges);
}
 
@@ -727,6 +730,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct 
user_namespace *user_ns,
fc->user_ns = get_user_ns(user_ns);
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
INIT_LIST_HEAD(>free_ranges);
+   INIT_LIST_HEAD(>busy_ranges);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
-- 
2.25.4



[PATCH v3 06/18] virtiofs: Provide a helper function for virtqueue initialization

2020-08-19 Thread Vivek Goyal
This reduces code duplication and make it little easier to read code.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/virtio_fs.c | 50 +++--
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 104f35de5270..ed8da4825b70 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -24,6 +24,8 @@ enum {
VQ_REQUEST
 };
 
+#define VQ_NAME_LEN24
+
 /* Per-virtqueue state */
 struct virtio_fs_vq {
spinlock_t lock;
@@ -36,7 +38,7 @@ struct virtio_fs_vq {
bool connected;
long in_flight;
struct completion in_flight_zero; /* No inflight requests */
-   char name[24];
+   char name[VQ_NAME_LEN];
 } cacheline_aligned_in_smp;
 
 /* A virtio-fs device instance */
@@ -596,6 +598,26 @@ static void virtio_fs_vq_done(struct virtqueue *vq)
schedule_work(>done_work);
 }
 
+static void virtio_fs_init_vq(struct virtio_fs_vq *fsvq, char *name,
+ int vq_type)
+{
+   strncpy(fsvq->name, name, VQ_NAME_LEN);
+   spin_lock_init(>lock);
+   INIT_LIST_HEAD(>queued_reqs);
+   INIT_LIST_HEAD(>end_reqs);
+   init_completion(>in_flight_zero);
+
+   if (vq_type == VQ_REQUEST) {
+   INIT_WORK(>done_work, virtio_fs_requests_done_work);
+   INIT_DELAYED_WORK(>dispatch_work,
+ virtio_fs_request_dispatch_work);
+   } else {
+   INIT_WORK(>done_work, virtio_fs_hiprio_done_work);
+   INIT_DELAYED_WORK(>dispatch_work,
+ virtio_fs_hiprio_dispatch_work);
+   }
+}
+
 /* Initialize virtqueues */
 static int virtio_fs_setup_vqs(struct virtio_device *vdev,
   struct virtio_fs *fs)
@@ -611,7 +633,7 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
if (fs->num_request_queues == 0)
return -EINVAL;
 
-   fs->nvqs = 1 + fs->num_request_queues;
+   fs->nvqs = VQ_REQUEST + fs->num_request_queues;
fs->vqs = kcalloc(fs->nvqs, sizeof(fs->vqs[VQ_HIPRIO]), GFP_KERNEL);
if (!fs->vqs)
return -ENOMEM;
@@ -625,29 +647,17 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
goto out;
}
 
+   /* Initialize the hiprio/forget request virtqueue */
callbacks[VQ_HIPRIO] = virtio_fs_vq_done;
-   snprintf(fs->vqs[VQ_HIPRIO].name, sizeof(fs->vqs[VQ_HIPRIO].name),
-   "hiprio");
+   virtio_fs_init_vq(>vqs[VQ_HIPRIO], "hiprio", VQ_HIPRIO);
names[VQ_HIPRIO] = fs->vqs[VQ_HIPRIO].name;
-   INIT_WORK(>vqs[VQ_HIPRIO].done_work, virtio_fs_hiprio_done_work);
-   INIT_LIST_HEAD(>vqs[VQ_HIPRIO].queued_reqs);
-   INIT_LIST_HEAD(>vqs[VQ_HIPRIO].end_reqs);
-   INIT_DELAYED_WORK(>vqs[VQ_HIPRIO].dispatch_work,
-   virtio_fs_hiprio_dispatch_work);
-   init_completion(>vqs[VQ_HIPRIO].in_flight_zero);
-   spin_lock_init(>vqs[VQ_HIPRIO].lock);
 
/* Initialize the requests virtqueues */
for (i = VQ_REQUEST; i < fs->nvqs; i++) {
-   spin_lock_init(>vqs[i].lock);
-   INIT_WORK(>vqs[i].done_work, virtio_fs_requests_done_work);
-   INIT_DELAYED_WORK(>vqs[i].dispatch_work,
- virtio_fs_request_dispatch_work);
-   INIT_LIST_HEAD(>vqs[i].queued_reqs);
-   INIT_LIST_HEAD(>vqs[i].end_reqs);
-   init_completion(>vqs[i].in_flight_zero);
-   snprintf(fs->vqs[i].name, sizeof(fs->vqs[i].name),
-"requests.%u", i - VQ_REQUEST);
+   char vq_name[VQ_NAME_LEN];
+
+   snprintf(vq_name, VQ_NAME_LEN, "requests.%u", i - VQ_REQUEST);
+   virtio_fs_init_vq(>vqs[i], vq_name, VQ_REQUEST);
callbacks[i] = virtio_fs_vq_done;
names[i] = fs->vqs[i].name;
}
-- 
2.25.4



[PATCH v3 07/18] fuse: Get rid of no_mount_options

2020-08-19 Thread Vivek Goyal
This option was introduced so that for virtio_fs we don't show any mounts
options fuse_show_options(). Because we don't offer any of these options
to be controlled by mounter.

Very soon we are planning to introduce option "dax" which mounter should
be able to specify. And no_mount_options does not work anymore. What
we need is a per mount option specific flag so that filesystem can
specify which options to show.

Add few such flags to control the behavior in more fine grained manner
and get rid of no_mount_options.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/fuse_i.h| 14 ++
 fs/fuse/inode.c | 22 ++
 fs/fuse/virtio_fs.c |  1 -
 3 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 740a8a7d7ae6..cf5e675100ec 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -471,18 +471,21 @@ struct fuse_fs_context {
int fd;
unsigned int rootmode;
kuid_t user_id;
+   bool user_id_show;
kgid_t group_id;
+   bool group_id_show;
bool is_bdev:1;
bool fd_present:1;
bool rootmode_present:1;
bool user_id_present:1;
bool group_id_present:1;
bool default_permissions:1;
+   bool default_permissions_show:1;
bool allow_other:1;
+   bool allow_other_show:1;
bool destroy:1;
bool no_control:1;
bool no_force_umount:1;
-   bool no_mount_options:1;
unsigned int max_read;
unsigned int blksize;
const char *subtype;
@@ -512,9 +515,11 @@ struct fuse_conn {
 
/** The user id for this mount */
kuid_t user_id;
+   bool user_id_show:1;
 
/** The group id for this mount */
kgid_t group_id;
+   bool group_id_show:1;
 
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;
@@ -698,10 +703,14 @@ struct fuse_conn {
 
/** Check permissions based on the file mode or not? */
unsigned default_permissions:1;
+   bool default_permissions_show:1;
 
/** Allow other than the mounter user to access the filesystem ? */
unsigned allow_other:1;
 
+   /** Show allow_other in mount options */
+   bool allow_other_show:1;
+
/** Does the filesystem support copy_file_range? */
unsigned no_copy_file_range:1;
 
@@ -717,9 +726,6 @@ struct fuse_conn {
/** Do not allow MNT_FORCE umount */
unsigned int no_force_umount:1;
 
-   /* Do not show mount options */
-   unsigned int no_mount_options:1;
-
/** The number of requests waiting for completion */
atomic_t num_waiting;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index bba747520e9b..2ac5713c4c32 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -535,10 +535,12 @@ static int fuse_parse_param(struct fs_context *fc, struct 
fs_parameter *param)
 
case OPT_DEFAULT_PERMISSIONS:
ctx->default_permissions = true;
+   ctx->default_permissions_show = true;
break;
 
case OPT_ALLOW_OTHER:
ctx->allow_other = true;
+   ctx->allow_other_show = true;
break;
 
case OPT_MAX_READ:
@@ -573,14 +575,15 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
struct super_block *sb = root->d_sb;
struct fuse_conn *fc = get_fuse_conn_super(sb);
 
-   if (fc->no_mount_options)
-   return 0;
-
-   seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, 
fc->user_id));
-   seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, 
fc->group_id));
-   if (fc->default_permissions)
+   if (fc->user_id_show)
+   seq_printf(m, ",user_id=%u",
+  from_kuid_munged(fc->user_ns, fc->user_id));
+   if (fc->group_id_show)
+   seq_printf(m, ",group_id=%u",
+  from_kgid_munged(fc->user_ns, fc->group_id));
+   if (fc->default_permissions && fc->default_permissions_show)
seq_puts(m, ",default_permissions");
-   if (fc->allow_other)
+   if (fc->allow_other && fc->allow_other_show)
seq_puts(m, ",allow_other");
if (fc->max_read != ~0)
seq_printf(m, ",max_read=%u", fc->max_read);
@@ -1193,14 +1196,17 @@ int fuse_fill_super_common(struct super_block *sb, 
struct fuse_fs_context *ctx)
sb->s_flags |= SB_POSIXACL;
 
fc->default_permissions = ctx->default_permissions;
+   fc->default_permissions_show = ctx->default_permissions_show;
fc->allow_other = ctx->allow_other;
+   fc->allow_other_show = ctx->allow_other_show;
fc->user_id = ctx->user_id;
+   fc->user_id_show = ctx->user_id_show

  1   2   3   4   5   6   7   8   9   10   >