Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-05 Thread Amir Goldstein
On Tue, Jan 5, 2021 at 6:26 PM Vivek Goyal  wrote:
>
> On Tue, Jan 05, 2021 at 09:11:23AM +0200, Amir Goldstein wrote:
> > > >
> > > > What I would rather see is:
> > > > - Non-volatile: first syncfs in every container gets an error (nice to 
> > > > have)
> > >
> > > I am not sure why are we making this behavior per container. This should
> > > be no different from current semantics we have for syncfs() on regular
> > > filesystem. And that will provide what you are looking for. If you
> > > want single error to be reported in all ovleray mounts, then make
> > > sure you have one fd open in each mount after mount, then call syncfs()
> > > on that fd.
> > >
> >
> > Ok.
> >
> > > Not sure why overlayfs behavior/semantics should be any differnt
> > > than what regular filessytems like ext4/xfs are offering. Once we
> > > get page cache sharing sorted out with xfs reflink, then people
> > > will not even need overlayfs and be able to launch containers
> > > just using xfs reflink and share base image. In that case also
> > > they will need to keep an fd open per container they want to
> > > see an error in.
> > >
> > > So my patches exactly provide that. syncfs() behavior is same with
> > > overlayfs as application gets it on other filesystems. And to me
> > > its important to keep behavior same.
> > >
> > > > - Volatile: every syncfs and every fsync in every container gets an 
> > > > error
> > > >   (important IMO)
> > >
> > > For volatile mounts, I agree that we need to fail overlayfs instance
> > > as soon as first error is detected since mount. And this applies to
> > > not only syncfs()/fsync() but to read/write and other operations too.
> > >
> > > For that we will need additional patches which are floating around
> > > to keep errseq sample in overlay and check for errors in all
> > > paths syncfs/fsync/read/write/ and fail fs.
> >
> > > But these patches build on top of my patches.
> >
> > Here we disagree.
> >
> > I don't see how Jeff's patch is "building on top of your patches"
> > seeing that it is perfectly well contained and does not in fact depend
> > on your patches.
>
> Jeff's patches are solving problem only for volatile mounts and they
> are propagating error to overlayfs sb.
>
> My patches are solving the issue both for volatile mount as well as
> non-volatile mounts and solve it using same method so there is no
> confusion.
>
> So there are multiple pieces to this puzzle and IMHO, it probably
> should be fixed in this order.
>
> A. First fix the syncfs() path to return error both for volatile as
>as well non-volatile mounts.
>
> B. And then add patches to fail filesystem for volatile mount as soon
>as first error is detected (either in syncfs path or in other paths
>like read/write/...). This probably will require to save errseq
>in ovl_fs, and then compare with upper_sb in critical paths and fail
>filesystem as soon as error is detected.
>
> C. Finally fix the issues related to mount/remount error detection which
>Sargun is wanting to fix. This will be largerly solved by B except
>saving errseq on disk.
>
> My patches should fix the first problem. And more patches can be
> applied on top to fix issue B and issue C.
>
> Now if we agree with this, in this context I see that fixing problem
> B and C is building on top of my patches which fixes problem A.
>

That order is fine by me.

> >
> > And I do insist that the fix for volatile mounts syncfs/fsync error
> > reporting should be applied before your patches or at the very least
> > not heavily depend on them.
>
> I still don't understand that why volatile syncfs() error reporting
> is more important than non-volatile syncfs(). But I will stop harping
> on this point now.
>
> My issue with Jeff's patches is that syncfs() error reporting should
> be dealt in same way both for volatile and non-volatile mount. That
> is compare file->f_sb_err and upper_sb->s_wb_err to figure out if
> there is an error to report to user space. Currently this patches
> only solve the problem for volatile mounts and use propagation to
> overlay sb which is conflicting for non-volatile mounts.
>
> IIUC, your primary concern with volatile mount is that you want to
> detect as soon as writeback error happens, and flag it to container
> manager so that container manager can stop container, throw away
> upper layer and restart from scratch. If yes, what you want can
> be solved by solving problem B and backporting it to LTS kernel.
> I think patches for that will be well contained within overlayfs
> (And no VFS) changes and should be relatively easy to backport.
>
> IOW, backportability to LTS kernel should not be a concern/blocker
> for my patch series which fixes syncfs() issue for overlayfs.
>

That's all I wanted to know.

Thanks,
Amir.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-05 Thread Vivek Goyal
On Tue, Jan 05, 2021 at 09:11:23AM +0200, Amir Goldstein wrote:
> > >
> > > What I would rather see is:
> > > - Non-volatile: first syncfs in every container gets an error (nice to 
> > > have)
> >
> > I am not sure why are we making this behavior per container. This should
> > be no different from current semantics we have for syncfs() on regular
> > filesystem. And that will provide what you are looking for. If you
> > want single error to be reported in all ovleray mounts, then make
> > sure you have one fd open in each mount after mount, then call syncfs()
> > on that fd.
> >
> 
> Ok.
> 
> > Not sure why overlayfs behavior/semantics should be any differnt
> > than what regular filessytems like ext4/xfs are offering. Once we
> > get page cache sharing sorted out with xfs reflink, then people
> > will not even need overlayfs and be able to launch containers
> > just using xfs reflink and share base image. In that case also
> > they will need to keep an fd open per container they want to
> > see an error in.
> >
> > So my patches exactly provide that. syncfs() behavior is same with
> > overlayfs as application gets it on other filesystems. And to me
> > its important to keep behavior same.
> >
> > > - Volatile: every syncfs and every fsync in every container gets an error
> > >   (important IMO)
> >
> > For volatile mounts, I agree that we need to fail overlayfs instance
> > as soon as first error is detected since mount. And this applies to
> > not only syncfs()/fsync() but to read/write and other operations too.
> >
> > For that we will need additional patches which are floating around
> > to keep errseq sample in overlay and check for errors in all
> > paths syncfs/fsync/read/write/ and fail fs.
> 
> > But these patches build on top of my patches.
> 
> Here we disagree.
> 
> I don't see how Jeff's patch is "building on top of your patches"
> seeing that it is perfectly well contained and does not in fact depend
> on your patches.

Jeff's patches are solving problem only for volatile mounts and they
are propagating error to overlayfs sb.

My patches are solving the issue both for volatile mount as well as
non-volatile mounts and solve it using same method so there is no
confusion.

So there are multiple pieces to this puzzle and IMHO, it probably
should be fixed in this order.

A. First fix the syncfs() path to return error both for volatile as
   as well non-volatile mounts.

B. And then add patches to fail filesystem for volatile mount as soon
   as first error is detected (either in syncfs path or in other paths
   like read/write/...). This probably will require to save errseq
   in ovl_fs, and then compare with upper_sb in critical paths and fail
   filesystem as soon as error is detected.

C. Finally fix the issues related to mount/remount error detection which
   Sargun is wanting to fix. This will be largerly solved by B except
   saving errseq on disk.

My patches should fix the first problem. And more patches can be
applied on top to fix issue B and issue C.

Now if we agree with this, in this context I see that fixing problem
B and C is building on top of my patches which fixes problem A.

> 
> And I do insist that the fix for volatile mounts syncfs/fsync error
> reporting should be applied before your patches or at the very least
> not heavily depend on them.

I still don't understand that why volatile syncfs() error reporting
is more important than non-volatile syncfs(). But I will stop harping
on this point now.

My issue with Jeff's patches is that syncfs() error reporting should
be dealt in same way both for volatile and non-volatile mount. That
is compare file->f_sb_err and upper_sb->s_wb_err to figure out if
there is an error to report to user space. Currently this patches
only solve the problem for volatile mounts and use propagation to
overlay sb which is conflicting for non-volatile mounts.

IIUC, your primary concern with volatile mount is that you want to
detect as soon as writeback error happens, and flag it to container
manager so that container manager can stop container, throw away
upper layer and restart from scratch. If yes, what you want can
be solved by solving problem B and backporting it to LTS kernel.
I think patches for that will be well contained within overlayfs
(And no VFS) changes and should be relatively easy to backport.

IOW, backportability to LTS kernel should not be a concern/blocker
for my patch series which fixes syncfs() issue for overlayfs.

Thanks
Vivek

> 
> volatile mount was introduced in fresh new v5.10, which is also an
> LTS kernel. It would be inconsiderate of volatile mount users and developers
> to make backporting that fix to v5.10.y any harder than it should be.

> 
> > My patches don't solve this problem of failing overlay mount for
> > the volatile mount case.
> >
> 
> Here we agree.
> 
> > >
> > > This is why I prefer to sample upper sb error on mount and propagate
> > > new errors to overlayfs sb (Jeff's patch).
> >
> > Ok, I think 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Amir Goldstein
> >
> > What I would rather see is:
> > - Non-volatile: first syncfs in every container gets an error (nice to have)
>
> I am not sure why are we making this behavior per container. This should
> be no different from current semantics we have for syncfs() on regular
> filesystem. And that will provide what you are looking for. If you
> want single error to be reported in all ovleray mounts, then make
> sure you have one fd open in each mount after mount, then call syncfs()
> on that fd.
>

Ok.

> Not sure why overlayfs behavior/semantics should be any differnt
> than what regular filessytems like ext4/xfs are offering. Once we
> get page cache sharing sorted out with xfs reflink, then people
> will not even need overlayfs and be able to launch containers
> just using xfs reflink and share base image. In that case also
> they will need to keep an fd open per container they want to
> see an error in.
>
> So my patches exactly provide that. syncfs() behavior is same with
> overlayfs as application gets it on other filesystems. And to me
> its important to keep behavior same.
>
> > - Volatile: every syncfs and every fsync in every container gets an error
> >   (important IMO)
>
> For volatile mounts, I agree that we need to fail overlayfs instance
> as soon as first error is detected since mount. And this applies to
> not only syncfs()/fsync() but to read/write and other operations too.
>
> For that we will need additional patches which are floating around
> to keep errseq sample in overlay and check for errors in all
> paths syncfs/fsync/read/write/ and fail fs.

> But these patches build on top of my patches.

Here we disagree.

I don't see how Jeff's patch is "building on top of your patches"
seeing that it is perfectly well contained and does not in fact depend
on your patches.

And I do insist that the fix for volatile mounts syncfs/fsync error
reporting should be applied before your patches or at the very least
not heavily depend on them.

volatile mount was introduced in fresh new v5.10, which is also an
LTS kernel. It would be inconsiderate of volatile mount users and developers
to make backporting that fix to v5.10.y any harder than it should be.

> My patches don't solve this problem of failing overlay mount for
> the volatile mount case.
>

Here we agree.

> >
> > This is why I prefer to sample upper sb error on mount and propagate
> > new errors to overlayfs sb (Jeff's patch).
>
> Ok, I think this is one of the key points of the whole discussion. What
> mechanism should be used to propagate writeback errors through overlayfs.
>
> A. Propagate errors from upper sb to overlay sb.
> B. Leave overlay sb alone and use upper sb for error checks.
>
> We don't have good model to propagate errors between super blocks,
> so Jeff preferred not to do error propagation between super blocks
> for regular mounts.
>
> https://lore.kernel.org/linux-fsdevel/bff90dfee3a3392d67a4f3516ab28989e87fa25f.ca...@kernel.org/
>
> If we are not defining new semantics for syncfs() for overlayfs, then
> I can't see what's the advantage of coming up with new mechanism to
> propagate errors to overlay sb. Approach B should work just fine and
> provide the syncfs() semantics we want for overlayfs (Same semantics
> as other filesystems).
>

Ok. I am on board with B.

Philosophically. overlayfs model is somewhere between "passthrough"
and "proxy" when handling pure upper files and as overlayfs evolves,
it steadily moves towards the "proxy" model, with page cache and
writeback being the largest remaining piece to convert.

So I concede that as long as overlayfs writeback is mostly passthrough,
syncfs might as well be passthrough to upper fs as well.

Thanks,
Amir.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Amir Goldstein
On Mon, Jan 4, 2021 at 5:40 PM Vivek Goyal  wrote:
>
> On Mon, Jan 04, 2021 at 05:22:07PM +0200, Amir Goldstein wrote:
> > > > Since Jeff's patch is minimal, I think that it should be the fix applied
> > > > first and proposed for stable (with adaptations for non-volatile 
> > > > overlay).
> > >
> > > Does stable fix has to be same as mainline fix. IOW, I think atleast in
> > > mainline we should first fix it the right way and then think how to fix
> > > it for stable. If fixes taken in mainline are not realistic for stable,
> > > can we push a different small fix just for stable?
> >
> > We can do a lot of things.
> > But if we are able to create a series with minimal (and most critical) fixes
> > followed by other fixes, it would be easier for everyone involved.
>
> I am not sure this is really critical. writeback error reporting for
> overlayfs are broken since the beginning for regular mounts. There is no
> notion of these errors being reported to user space. If that did not
> create a major issue, then why suddenly volatile mounts make it
> a critical issue.
>

Volatile mounts didn't make this a critical issue.
But this discussion made us notice a mildly serious issue.
It is not surprising to me that users did not report this issue.
Do you know what it takes for a user to notice that writeback had failed,
but an application did fsync and error did not get reported?
Filesystem durability guaranties are hard to prove especially with so
many subsystem layers and with fsync that does return an error correctly.
I once found a durability bug in fsync of xfs that existed for 12 years.
That fact does not at all make it any less critical.

> To me we should fix the issue properly which is easy to maintain
> down the line and then worry about doing a stable fix if need be.
>
> >
> > >
> > > IOW, because we have to push a fix in stable, should not determine
> > > what should be problem solution for mainline, IMHO.
> > >
> >
> > I find in this case there is a correlation between the simplest fix and the
> > most relevant fix for stable.
> >
> > > The porblem I have with Jeff's fix is that its only works for volatile
> > > mounts. While I prefer a solution where syncfs() is fixed both for
> > > volatile as well as non-volatile mount and then there is less confusion.
> > >
> >
> > I proposed a variation on Jeff's patch that covers both cases.
> > Sargun is going to work on it.
>
> What's the problem with my patches which fixes syncfs() error reporting
> for overlayfs both for volatile and non-volatile mount?
>

- mount 1000 overlays
- 1 writeback error recorded in upper sb
- syncfs (new fd) inside each of the 1000 containers

With your patch 3/3 only one syncfs will report an error for
both volatile and non-volatile cases. Right?

What I would rather see is:
- Non-volatile: first syncfs in every container gets an error (nice to have)
- Volatile: every syncfs and every fsync in every container gets an error
  (important IMO)

This is why I prefer to sample upper sb error on mount and propagate
new errors to overlayfs sb (Jeff's patch).

I am very much in favor of your patch 1/3 and I am not against the concept
of patches 2-3/3. Just think that ovl_errseq_check_advance() is not the
implementation that gives the most desirable result.

If people do accept my point of view that proxying the stacked error check
is preferred over "passthrough" to upper sb error check, then as a by-product,
the new ->check_error() method is not going to make much of a difference for
overlayfs. Maybe it can be used to fine tune some corner cases.
I am not sure.
If we do agree on the propagate error concept then IMO all other use
cases for not consuming the unseen error from upper fs are nice-to-have.

Before we continue to debate on the implementation, let's first try
to agree on the desired behavior, what is a must vs. what is nice to have.
Without consensus on this, it will be quite hard to converge.

Another thing, to help everyone, I think it is best that any patch on ovl_syncfs
"solutions" will include detailed description of the use cases it solves and
the use cases that it leaves unsolved.

Thanks,
Amir.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Mon, Jan 04, 2021 at 11:42:51PM +0200, Amir Goldstein wrote:
> On Mon, Jan 4, 2021 at 5:40 PM Vivek Goyal  wrote:
> >
> > On Mon, Jan 04, 2021 at 05:22:07PM +0200, Amir Goldstein wrote:
> > > > > Since Jeff's patch is minimal, I think that it should be the fix 
> > > > > applied
> > > > > first and proposed for stable (with adaptations for non-volatile 
> > > > > overlay).
> > > >
> > > > Does stable fix has to be same as mainline fix. IOW, I think atleast in
> > > > mainline we should first fix it the right way and then think how to fix
> > > > it for stable. If fixes taken in mainline are not realistic for stable,
> > > > can we push a different small fix just for stable?
> > >
> > > We can do a lot of things.
> > > But if we are able to create a series with minimal (and most critical) 
> > > fixes
> > > followed by other fixes, it would be easier for everyone involved.
> >
> > I am not sure this is really critical. writeback error reporting for
> > overlayfs are broken since the beginning for regular mounts. There is no
> > notion of these errors being reported to user space. If that did not
> > create a major issue, then why suddenly volatile mounts make it
> > a critical issue.
> >
> 
> Volatile mounts didn't make this a critical issue.
> But this discussion made us notice a mildly serious issue.
> It is not surprising to me that users did not report this issue.
> Do you know what it takes for a user to notice that writeback had failed,
> but an application did fsync and error did not get reported?
> Filesystem durability guaranties are hard to prove especially with so
> many subsystem layers and with fsync that does return an error correctly.
> I once found a durability bug in fsync of xfs that existed for 12 years.
> That fact does not at all make it any less critical.
> 
> > To me we should fix the issue properly which is easy to maintain
> > down the line and then worry about doing a stable fix if need be.
> >
> > >
> > > >
> > > > IOW, because we have to push a fix in stable, should not determine
> > > > what should be problem solution for mainline, IMHO.
> > > >
> > >
> > > I find in this case there is a correlation between the simplest fix and 
> > > the
> > > most relevant fix for stable.
> > >
> > > > The porblem I have with Jeff's fix is that its only works for volatile
> > > > mounts. While I prefer a solution where syncfs() is fixed both for
> > > > volatile as well as non-volatile mount and then there is less confusion.
> > > >
> > >
> > > I proposed a variation on Jeff's patch that covers both cases.
> > > Sargun is going to work on it.
> >
> > What's the problem with my patches which fixes syncfs() error reporting
> > for overlayfs both for volatile and non-volatile mount?
> >
> 
> - mount 1000 overlays
> - 1 writeback error recorded in upper sb
> - syncfs (new fd) inside each of the 1000 containers
> 
> With your patch 3/3 only one syncfs will report an error for
> both volatile and non-volatile cases. Right?

Right. If you don't have an old fd open in each container, then only
one container will see the error. If you want to see error in each
container, then one fd needs to be kept opened in each container
before error hapens and call syncfs() on that fd, and then each
container should see the error.

> 
> What I would rather see is:
> - Non-volatile: first syncfs in every container gets an error (nice to have)

I am not sure why are we making this behavior per container. This should
be no different from current semantics we have for syncfs() on regular
filesystem. And that will provide what you are looking for. If you
want single error to be reported in all ovleray mounts, then make
sure you have one fd open in each mount after mount, then call syncfs()
on that fd.

Not sure why overlayfs behavior/semantics should be any differnt
than what regular filessytems like ext4/xfs are offering. Once we
get page cache sharing sorted out with xfs reflink, then people
will not even need overlayfs and be able to launch containers
just using xfs reflink and share base image. In that case also
they will need to keep an fd open per container they want to
see an error in.

So my patches exactly provide that. syncfs() behavior is same with
overlayfs as application gets it on other filesystems. And to me
its important to keep behavior same.

> - Volatile: every syncfs and every fsync in every container gets an error
>   (important IMO)

For volatile mounts, I agree that we need to fail overlayfs instance
as soon as first error is detected since mount. And this applies to
not only syncfs()/fsync() but to read/write and other operations too.

For that we will need additional patches which are floating around
to keep errseq sample in overlay and check for errors in all
paths syncfs/fsync/read/write/ and fail fs. But these patches
build on top of my patches. My patches don't solve this problem of
failing overlay mount for the volatile mount case.

> 
> This is why I prefer to sample upper sb error 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > Currently syncfs() and fsync() seem to be two interfaces which check and
> > return writeback errors on superblock to user space. fsync() should
> > work fine with overlayfs as it relies on underlying filesystem to
> > do the check and return error. For example, if ext4 is on upper filesystem,
> > then ext4_sync_file() calls file_check_and_advance_wb_err(file) on
> > upper file and returns error. So overlayfs does not have to do anything
> > special.
> > 
> > But with syncfs(), error check happens in vfs in syncfs() w.r.t
> > overlay_sb->s_wb_err. Given overlayfs is stacked filesystem, it
> > does not do actual writeback and all writeback errors are recorded
> > on underlying filesystem. So sb->s_wb_err is never updated hence
> > syncfs() does not work with overlay.
> > 
> > Jeff suggested that instead of trying to propagate errors to overlay
> > super block, why not simply check for errors against upper filesystem
> > super block. I implemented this idea.
> > 
> > Overlay file has "since" value which needs to be initialized at open
> > time. Overlay overrides VFS initialization and re-initializes
> > f->f_sb_err w.r.t upper super block. Later when
> > ovl_sb->errseq_check_advance() is called, f->f_sb_err is used as
> > since value to figure out if any error on upper sb has happened since
> > then.
> > 
> > Note, Right now this patch only deals with regular file and directories.
> > Yet to deal with special files like device inodes, socket, fifo etc.
> > 
> > Suggested-by: Jeff Layton 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/overlayfs/file.c  |  1 +
> >  fs/overlayfs/overlayfs.h |  1 +
> >  fs/overlayfs/readdir.c   |  1 +
> >  fs/overlayfs/super.c | 23 +++
> >  fs/overlayfs/util.c  | 13 +
> >  5 files changed, 39 insertions(+)
> > 
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index efccb7c1f9bc..7b58a44dcb71 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -163,6 +163,7 @@ static int ovl_open(struct inode *inode, struct file 
> > *file)
> > return PTR_ERR(realfile);
> >  
> > file->private_data = realfile;
> > +   ovl_init_file_errseq(file);
> >  
> > return 0;
> >  }
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index f8880aa2ba0e..47838abbfb3d 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct 
> > dentry *dentry);
> >  bool ovl_is_metacopy_dentry(struct dentry *dentry);
> >  char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
> >  int padding);
> > +void ovl_init_file_errseq(struct file *file);
> >  
> >  static inline bool ovl_is_impuredir(struct super_block *sb,
> > struct dentry *dentry)
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 01620ebae1bd..0c48f1545483 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -960,6 +960,7 @@ static int ovl_dir_open(struct inode *inode, struct 
> > file *file)
> > od->is_real = ovl_dir_is_real(file->f_path.dentry);
> > od->is_upper = OVL_TYPE_UPPER(type);
> > file->private_data = od;
> > +   ovl_init_file_errseq(file);
> >  
> > return 0;
> >  }
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 290983bcfbb3..d99867983722 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -390,6 +390,28 @@ static int ovl_remount(struct super_block *sb, int 
> > *flags, char *data)
> > return ret;
> >  }
> >  
> > +static int ovl_errseq_check_advance(struct super_block *sb, struct file 
> > *file)
> > +{
> > +   struct ovl_fs *ofs = sb->s_fs_info;
> > +   struct super_block *upper_sb;
> > +   int ret;
> > +
> > +   if (!ovl_upper_mnt(ofs))
> > +   return 0;
> > +
> > +   upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +
> > +   if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> > +   return 0;
> > +
> > +   /* Something changed, must use slow path */
> > +   spin_lock(>f_lock);
> > +   ret = errseq_check_and_advance(_sb->s_wb_err, >f_sb_err);
> > +   spin_unlock(>f_lock);
> > +
> > +   return ret;
> > +}
> > +
> >  static const struct super_operations ovl_super_operations = {
> > .alloc_inode= ovl_alloc_inode,
> > .free_inode = ovl_free_inode,
> > @@ -400,6 +422,7 @@ static const struct super_operations 
> > ovl_super_operations = {
> > .statfs = ovl_statfs,
> > .show_options   = ovl_show_options,
> > .remount_fs = ovl_remount,
> > +   .errseq_check_advance   = ovl_errseq_check_advance,
> >  };
> >  
> >  enum {
> > diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> > index 23f475627d07..a1742847f3a8 100644
> > --- a/fs/overlayfs/util.c
> > +++ 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Mon, Dec 28, 2020 at 03:56:18PM +, Matthew Wilcox wrote:
> On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> > To be clear, the main thing you'll lose with the method above is the
> > ability to see an unseen error on a newly opened fd, if there was an
> > overlayfs mount using the same upper sb before your open occurred.
> > 
> > IOW, consider two overlayfs mounts using the same upper layer sb:
> > 
> > ovlfs1  ovlfs2
> > --
> > mount
> > open fd1
> > write to fd1
> > 
> > mount (upper errseq_t SEEN flag marked)
> > open fd2
> > syncfs(fd2)
> > syncfs(fd1)
> > 
> > 
> > On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> > calls. The first one has a sample from before the error occurred, and
> > the second one has a sample of 0, due to the fact that the error was
> > unseen at open time.
> > 
> > On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> > return an error and syncfs(fd2) will not. If we split the SEEN flag into
> > two, then we can ensure that they both still get an error in this
> > situation.
> 
> But do we need to?  If the inode has been evicted we also lose the errno.

That's for the case of fsync(), right? For the case of syncfs() we will
not lose error as its stored in super_block.

Even for the case of fsync(), inode can be evicted only if no other
fd is opened for the file. So in above example, fd1 is opened so
inode can't be evicted, that means we will see error on syncfs(fd2)
and not lose it.

So if we start consuming upper fs on overlay mount(), it will be a
change of behavior for applications using same upper fs. So far
overlay mount() does not consume unseen error and even if an fd
is opened after the error, application will see error on super
block. If we consume error on mount(), we change behavior.

I am not saying that's necessarily bad, I am just trying to point
out that its a user space visible behavior change and worried
if somebody starts calling it a regression.

Anyway, I looks like two problems got mixed into same thread. One
problem we need to solve is that syncfs() on overlayfs should
report back writeback errors (as well as other errors) to applications.
And that's what this patch series is solving.

And then second issue is detecting writeback errors over remount
for volatile mounts. And that's where this question comes whether
we should split seen flag or we should simply consume error on
mount. So this can be further discussed when patches for this
changes are posted again.

For now, I will focus on trying to fix first issue and post patches
for that again after more testing.

Vivek


> The guarantee we provide is that a fd that was open before the error
> occurred will see the error.  An fd that's opened after the error occurred
> may or may not see the error.



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Mon, Dec 28, 2020 at 05:51:06PM +0200, Amir Goldstein wrote:
> On Mon, Dec 28, 2020 at 3:25 PM Jeff Layton  wrote:
> >
> > On Fri, 2020-12-25 at 08:50 +0200, Amir Goldstein wrote:
> > > On Thu, Dec 24, 2020 at 2:13 PM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> > > > > In current master, syncfs() on any file by any container user will
> > > > > result in full syncfs() of the upperfs, which is very bad for 
> > > > > container
> > > > > isolation. This has been partly fixed by Chengguang Xu [1] and I 
> > > > > expect
> > > > > his work will be merged soon. Overlayfs still does not do the 
> > > > > writeback
> > > > > and syncfs() in overlay still waits for all upper fs writeback to 
> > > > > complete,
> > > > > but at least syncfs() in overlay only kicks writeback for upper fs 
> > > > > files
> > > > > dirtied by this overlay.
> > > > >
> > > > > [1] 
> > > > > https://lore.kernel.org/linux-unionfs/cajfpegsbb4itxw8zyurfvnc63zg7ku7vzpsnuzhasyzh-d5...@mail.gmail.com/
> > > > >
> > > > > Sharing the same SEEN flag among thousands of containers is also
> > > > > far from ideal, because effectively this means that any given workload
> > > > > in any single container has very little chance of observing the SEEN 
> > > > > flag.
> > > >
> > > > Perhaps you misunderstand how errseq works.  If each container samples
> > > > the errseq at startup, then they will all see any error which occurs
> > > > during their lifespan
> > >
> > > Meant to say "...very little chance of NOT observing the SEEN flag",
> > > but We are not in disagreement.
> > > My argument against sharing the SEEN flag refers to Vivek's patch of
> > > stacked errseq_sample()/errseq_check_and_advance() which does NOT
> > > sample errseq at overlayfs mount time. That is why my next sentence is:
> > > "I do agree with Matthew that overlayfs should sample errseq...".
> > >
> > > > (and possibly an error which occurred before they started up).
> > > >
> > >
> > > Right. And this is where the discussion of splitting the SEEN flag 
> > > started.
> > > Some of us want to treat overlayfs mount time as a true epoc for errseq.
> > > The new container didn't write any files yet, so it should not care about
> > > writeback errors from the past.
> > >
> > > I agree that it may not be very critical, but as I wrote before, I think 
> > > we
> > > should do our best to try and isolate container workloads.
> > >
> > > > > To this end, I do agree with Matthew that overlayfs should sample 
> > > > > errseq
> > > > > and the best patchset to implement it so far IMO is Jeff's patchset 
> > > > > [2].
> > > > > This patch set was written to cater only "volatile" overlayfs mount, 
> > > > > but
> > > > > there is no reason not to use the same mechanism for regular overlay
> > > > > mount. The only difference being that "volatile" overlay only checks 
> > > > > for
> > > > > error since mount on syncfs() (because "volatile" overlay does NOT
> > > > > syncfs upper fs) and regular overlay checks and advances the overlay's
> > > > > errseq sample on syncfs (and does syncfs upper fs).
> > > > >
> > > > > Matthew, I hope that my explanation of the use case and Jeff's answer
> > > > > is sufficient to understand why the split of the SEEN flag is needed.
> > > > >
> > > > > [2] 
> > > > > https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlay...@kernel.org/
> > > >
> > > > No, it still feels weird and wrong.
> > > >
> > >
> > > All right. Considering your reservations, I think perhaps the split of the
> > > SEEN flag can wait for a later time after more discussions and maybe
> > > not as suitable for stable as we thought.
> > >
> > > I think that for stable, it would be sufficient to adapt Surgun's original
> > > syncfs for volatile mount patch [1] to cover the non-volatile case:
> > > on mout:
> > > - errseq_sample() upper fs
> > > - on volatile mount, errseq_check() upper fs and fail mount on un-SEEN 
> > > error
> > > on syncfs:
> > > - errseq_check() for volatile mount
> > > - errseq_check_and_advance() for non-volatile mount
> > > - errseq_set() overlay sb on upper fs error
> > >
> > > Now errseq_set() is not only a hack around __sync_filesystem ignoring
> > > return value of ->sync_fs(). It is really needed for per-overlay SEEN
> > > error isolation in the non-volatile case.
> > >
> > > Unless I am missing something, I think we do not strictly need Vivek's
> > > 1/3 patch [2] for stable, but not sure.
> > >
> > > Sargun,
> > >
> > > Do you agree with the above proposal?
> > > Will you make it into a patch?
> > >
> > > Vivek, Jefff,
> > >
> > > Do you agree that overlay syncfs observing writeback errors that predate
> > > overlay mount time is an issue that can be deferred (maybe forever)?
> > >
> >
> > That's very application dependent.
> >
> > To be clear, the main thing you'll lose with the method above is the
> > ability to see an unseen error on a newly opened fd, if there was an
> > 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Mon, Jan 04, 2021 at 05:22:07PM +0200, Amir Goldstein wrote:
> > > Since Jeff's patch is minimal, I think that it should be the fix applied
> > > first and proposed for stable (with adaptations for non-volatile overlay).
> >
> > Does stable fix has to be same as mainline fix. IOW, I think atleast in
> > mainline we should first fix it the right way and then think how to fix
> > it for stable. If fixes taken in mainline are not realistic for stable,
> > can we push a different small fix just for stable?
> 
> We can do a lot of things.
> But if we are able to create a series with minimal (and most critical) fixes
> followed by other fixes, it would be easier for everyone involved.

I am not sure this is really critical. writeback error reporting for
overlayfs are broken since the beginning for regular mounts. There is no
notion of these errors being reported to user space. If that did not
create a major issue, then why suddenly volatile mounts make it
a critical issue.

To me we should fix the issue properly which is easy to maintain
down the line and then worry about doing a stable fix if need be.

> 
> >
> > IOW, because we have to push a fix in stable, should not determine
> > what should be problem solution for mainline, IMHO.
> >
> 
> I find in this case there is a correlation between the simplest fix and the
> most relevant fix for stable.
> 
> > The porblem I have with Jeff's fix is that its only works for volatile
> > mounts. While I prefer a solution where syncfs() is fixed both for
> > volatile as well as non-volatile mount and then there is less confusion.
> >
> 
> I proposed a variation on Jeff's patch that covers both cases.
> Sargun is going to work on it.

What's the problem with my patches which fixes syncfs() error reporting
for overlayfs both for volatile and non-volatile mount?

Thanks
Vivek



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Amir Goldstein
> > Since Jeff's patch is minimal, I think that it should be the fix applied
> > first and proposed for stable (with adaptations for non-volatile overlay).
>
> Does stable fix has to be same as mainline fix. IOW, I think atleast in
> mainline we should first fix it the right way and then think how to fix
> it for stable. If fixes taken in mainline are not realistic for stable,
> can we push a different small fix just for stable?

We can do a lot of things.
But if we are able to create a series with minimal (and most critical) fixes
followed by other fixes, it would be easier for everyone involved.

>
> IOW, because we have to push a fix in stable, should not determine
> what should be problem solution for mainline, IMHO.
>

I find in this case there is a correlation between the simplest fix and the
most relevant fix for stable.

> The porblem I have with Jeff's fix is that its only works for volatile
> mounts. While I prefer a solution where syncfs() is fixed both for
> volatile as well as non-volatile mount and then there is less confusion.
>

I proposed a variation on Jeff's patch that covers both cases.
Sargun is going to work on it.

Thanks,
Amir.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-04 Thread Vivek Goyal
On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> On Wed, Dec 23, 2020 at 10:44 PM Matthew Wilcox  wrote:
> >
> > On Wed, Dec 23, 2020 at 08:21:41PM +, Sargun Dhillon wrote:
> > > On Wed, Dec 23, 2020 at 08:07:46PM +, Matthew Wilcox wrote:
> > > > On Wed, Dec 23, 2020 at 07:29:41PM +, Sargun Dhillon wrote:
> > > > > On Wed, Dec 23, 2020 at 06:50:44PM +, Matthew Wilcox wrote:
> > > > > > On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> > > > > > > I fail to see why this is neccessary if you incorporate error 
> > > > > > > reporting into the
> > > > > > > sync_fs callback. Why is this separate from that callback? If you 
> > > > > > > pickup Jeff's
> > > > > > > patch that adds the 2nd flag to errseq for "observed", you should 
> > > > > > > be able to
> > > > > > > stash the first errseq seen in the ovl_fs struct, and do the 
> > > > > > > check-and-return
> > > > > > > in there instead instead of adding this new infrastructure.
> > > > > >
> > > > > > You still haven't explained why you want to add the "observed" flag.
> > > > >
> > > > >
> > > > > In the overlayfs model, many users may be using the same filesystem 
> > > > > (super block)
> > > > > for their upperdir. Let's say you have something like this:
> > > > >
> > > > > /workdir [Mounted FS]
> > > > > /workdir/upperdir1 [overlayfs upperdir]
> > > > > /workdir/upperdir2 [overlayfs upperdir]
> > > > > /workdir/userscratchspace
> > > > >
> > > > > The user needs to be able to do something like:
> > > > > sync -f ${overlayfs1}/file
> > > > >
> > > > > which in turn will call sync on the the underlying filesystem (the 
> > > > > one mounted
> > > > > on /workdir), and can check if the errseq has changed since the 
> > > > > overlayfs was
> > > > > mounted, and use that to return an error to the user.
> > > >
> > > > OK, but I don't see why the current scheme doesn't work for this.  If
> > > > (each instance of) overlayfs samples the errseq at mount time and then
> > > > check_and_advances it at sync time, it will see any error that has 
> > > > occurred
> > > > since the mount happened (and possibly also an error which occurred 
> > > > before
> > > > the mount happened, but hadn't been reported to anybody before).
> > > >
> > >
> > > If there is an outstanding error at mount time, and the SEEN flag is 
> > > unset,
> > > subsequent errors will not increment the counter, until the user calls 
> > > sync on
> > > the upperdir's filesystem. If overlayfs calls check_and_advance on the 
> > > upperdir's
> > > super block at any point, it will then set the seen block, and if the 
> > > user calls
> > > syncfs on the upperdir, it will not return that there is an outstanding 
> > > error,
> > > since overlayfs just cleared it.
> >
> > Your concern is this case:
> >
> > fs is mounted on /workdir
> > /workdir/A is written to and then closed.
> > writeback happens and -EIO happens, but there's nobody around to care.
> > /workdir/upperdir1 becomes part of an overlayfs mount
> > overlayfs samples the error
> > a user writes to /workdir/B, another -EIO occurs, but nothing happens
> > someone calls syncfs on /workdir/upperdir/A, gets the EIO.
> > a user opens /workdir/B and calls syncfs, but sees no error
> >
> > do i have that right?  or is it something else?
> 
> IMO it is something else. Others may disagree.
> IMO the level of interference between users accessing overlay and users
> accessing upper fs directly is not well defined and it can stay this way.
> 
> Concurrent access to  /workdir/upperdir/A via overlay and underlying fs
> is explicitly warranted against in Documentation/filesystems/overlayfs.rst#
> Changes to underlying filesystems:
> "Changes to the underlying filesystems while part of a mounted overlay
> filesystem are not allowed.  If the underlying filesystem is changed,
> the behavior of the overlay is undefined, though it will not result in
> a crash or deadlock."

I think people use same underlying filesystem both as upper for multiple
overlayfs mounts as well as root filesystem. For example, when you
run podman (or docker), they all share same filesystem for all containers
as well as other non-containered apps use same filesystem.

IIUC, what we meant to say is that lowerdir/workdir/upperdir being
used for overlayfs mount should be left untouched. Right?

What I am trying to say is that while discussing this problem and
solution, we should assume that both a regular application might
be using same upper fs as being used by overlayfs. It seems to
be a very common operating model.

> 
> The question is whether syncfs(open(/workdir/B)) is considered
> "Changes to the underlying filesystems". Regardless of the answer,
> this is not an interesting case IMO.
> 
> The real issue is with interference between overlays that share the
> same upper fs, because this is by far and large the common use case
> that is creating real problems for a lot of container users.
> 
> Workloads running inside containers 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2021-01-02 Thread Jeff Layton
On Mon, 2020-12-28 at 20:48 +, Matthew Wilcox wrote:
> On Mon, Dec 28, 2020 at 09:37:37PM +0200, Amir Goldstein wrote:
> > Having said that, I never objected to the SEEN flag split.
> 
> I STRONGLY object to the SEEN flag split.  I think it is completely
> unnecessary and nobody's shown me a use-case that changes my mind.

I think the flag split makes better sense conceptually, though the
existing callers don't really have a need for it. I have a use-case in
mind that doesn't really involve overlayfs:

We still have a lot of internal callers that ultimately call
filemap_check_errors() to check and clear the mapping's AS_EIO/AS_ENOSPC
flags.

Splitting the SEEN flag in two could allow those callers to instead
sample the errseq_t using errseq_peek for their own purposes, without
clearing the REPORTED flag. That means that the existing semantics for
seeing errors on newly opened files could be preserved while allowing
internal callers to use errseq_t-based error handling.

That said, I don't have any patches to do this right now. It's a fairly
significant project to convert all of the existing callers of
filemap_check_errors() to such a scheme wholesale. It could be done
piecemeal though, and we could start discouraging new callers of
filemap_check_errors and the like.

-- 
Jeff Layton 



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-28 Thread Amir Goldstein
On Mon, Dec 28, 2020 at 7:26 PM Jeff Layton  wrote:
>
> On Mon, 2020-12-28 at 15:56 +, Matthew Wilcox wrote:
> > On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> > > To be clear, the main thing you'll lose with the method above is the
> > > ability to see an unseen error on a newly opened fd, if there was an
> > > overlayfs mount using the same upper sb before your open occurred.
> > >
> > > IOW, consider two overlayfs mounts using the same upper layer sb:
> > >
> > > ovlfs1  ovlfs2
> > > --
> > > mount
> > > open fd1
> > > write to fd1
> > > 
> > > mount (upper errseq_t SEEN flag marked)
> > > open fd2
> > > syncfs(fd2)
> > > syncfs(fd1)
> > >
> > >
> > > On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> > > calls. The first one has a sample from before the error occurred, and
> > > the second one has a sample of 0, due to the fact that the error was
> > > unseen at open time.
> > >
> > > On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> > > return an error and syncfs(fd2) will not. If we split the SEEN flag into
> > > two, then we can ensure that they both still get an error in this
> > > situation.
> >
> > But do we need to?  If the inode has been evicted we also lose the errno.
> > The guarantee we provide is that a fd that was open before the error
> > occurred will see the error.  An fd that's opened after the error occurred
> > may or may not see the error.
> >
>
> In principle, you can lose errors this way (which was the justification
> for making errseq_sample return 0 when there are unseen errors). E.g.,
> if you close fd1 instead of doing a syncfs on it, that error will be
> lost forever.
>
> As to whether that's OK, it's hard to say. It is a deviation from how
> this works in a non-containerized situation, and I'd argue that it's
> less than ideal. You may or may not see the error on fd2, but it's
> dependent on events that take place outside the container and that
> aren't observable from within it. That effectively makes the results
> non-deterministic, which is usually a bad thing in computing...
>

I understand that user experience inside containers will deviate from
non containerized use cases. I can't say that I fully understand the
situations that deviate.

Having said that, I never objected to the SEEN flag split.
To me, the split looks architecturally correct. If not for anything else,
then for not observing past errors inside the overlay mount.
I think you still need to convince Matthew though.

The question remains what, if anything, should be nominated for
stable. I was trying to propose the minimal patch that fixes the
most basic syncfs overlayfs issues. In that context, it seemed
that the issues that SEEN flag split solves are not on the
MUST HAVE list, but maybe I am wrong.

Sargun,

How about sending another version of your patch, with or without the
SEEN flag split (up to you) but not only for both the volatile and non-
volatile cases, following my proposal.

At least we can continue debating on a concrete patch instead of
an envisioned combination of pieces posted to the list.

If you can give some examples of use cases that the patch fixes
with and without the SEEN flag split that could be useful for the
discussion.

Thanks,
Amir.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-28 Thread Matthew Wilcox
On Mon, Dec 28, 2020 at 09:37:37PM +0200, Amir Goldstein wrote:
> Having said that, I never objected to the SEEN flag split.

I STRONGLY object to the SEEN flag split.  I think it is completely
unnecessary and nobody's shown me a use-case that changes my mind.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-28 Thread Sargun Dhillon
On Mon, Dec 28, 2020 at 9:26 AM Jeff Layton  wrote:
>
> On Mon, 2020-12-28 at 15:56 +, Matthew Wilcox wrote:
> > On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> > > To be clear, the main thing you'll lose with the method above is the
> > > ability to see an unseen error on a newly opened fd, if there was an
> > > overlayfs mount using the same upper sb before your open occurred.
> > >
> > > IOW, consider two overlayfs mounts using the same upper layer sb:
> > >
> > > ovlfs1  ovlfs2
> > > --
> > > mount
> > > open fd1
> > > write to fd1
> > > 
> > > mount (upper errseq_t SEEN flag marked)
> > > open fd2
> > > syncfs(fd2)
> > > syncfs(fd1)
> > >
> > >
> > > On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> > > calls. The first one has a sample from before the error occurred, and
> > > the second one has a sample of 0, due to the fact that the error was
> > > unseen at open time.
> > >
> > > On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> > > return an error and syncfs(fd2) will not. If we split the SEEN flag into
> > > two, then we can ensure that they both still get an error in this
> > > situation.
> >
> > But do we need to?  If the inode has been evicted we also lose the errno.
> > The guarantee we provide is that a fd that was open before the error
> > occurred will see the error.  An fd that's opened after the error occurred
> > may or may not see the error.
> >
>
> In principle, you can lose errors this way (which was the justification
> for making errseq_sample return 0 when there are unseen errors). E.g.,
> if you close fd1 instead of doing a syncfs on it, that error will be
> lost forever.
>
> As to whether that's OK, it's hard to say. It is a deviation from how
> this works in a non-containerized situation, and I'd argue that it's
> less than ideal. You may or may not see the error on fd2, but it's
> dependent on events that take place outside the container and that
> aren't observable from within it. That effectively makes the results
> non-deterministic, which is usually a bad thing in computing...
>
> --
> Jeff Layton 
>

I agree that predictable behaviour outweighs any benefit of complexity
cutting we might do here.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-28 Thread Jeff Layton
On Mon, 2020-12-28 at 15:56 +, Matthew Wilcox wrote:
> On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> > To be clear, the main thing you'll lose with the method above is the
> > ability to see an unseen error on a newly opened fd, if there was an
> > overlayfs mount using the same upper sb before your open occurred.
> > 
> > IOW, consider two overlayfs mounts using the same upper layer sb:
> > 
> > ovlfs1  ovlfs2
> > --
> > mount
> > open fd1
> > write to fd1
> > 
> > mount (upper errseq_t SEEN flag marked)
> > open fd2
> > syncfs(fd2)
> > syncfs(fd1)
> > 
> > 
> > On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> > calls. The first one has a sample from before the error occurred, and
> > the second one has a sample of 0, due to the fact that the error was
> > unseen at open time.
> > 
> > On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> > return an error and syncfs(fd2) will not. If we split the SEEN flag into
> > two, then we can ensure that they both still get an error in this
> > situation.
> 
> But do we need to?  If the inode has been evicted we also lose the errno.
> The guarantee we provide is that a fd that was open before the error
> occurred will see the error.  An fd that's opened after the error occurred
> may or may not see the error.
> 

In principle, you can lose errors this way (which was the justification
for making errseq_sample return 0 when there are unseen errors). E.g.,
if you close fd1 instead of doing a syncfs on it, that error will be
lost forever.

As to whether that's OK, it's hard to say. It is a deviation from how
this works in a non-containerized situation, and I'd argue that it's
less than ideal. You may or may not see the error on fd2, but it's
dependent on events that take place outside the container and that
aren't observable from within it. That effectively makes the results
non-deterministic, which is usually a bad thing in computing...

--
Jeff Layton 



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-28 Thread Matthew Wilcox
On Mon, Dec 28, 2020 at 08:25:50AM -0500, Jeff Layton wrote:
> To be clear, the main thing you'll lose with the method above is the
> ability to see an unseen error on a newly opened fd, if there was an
> overlayfs mount using the same upper sb before your open occurred.
> 
> IOW, consider two overlayfs mounts using the same upper layer sb:
> 
> ovlfs1ovlfs2
> --
> mount
> open fd1
> write to fd1
> 
>   mount (upper errseq_t SEEN flag marked)
> open fd2
> syncfs(fd2)
> syncfs(fd1)
> 
> 
> On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
> calls. The first one has a sample from before the error occurred, and
> the second one has a sample of 0, due to the fact that the error was
> unseen at open time.
> 
> On overlayfs, with the intervening mount of ovlfs2, syncfs(fd1) will
> return an error and syncfs(fd2) will not. If we split the SEEN flag into
> two, then we can ensure that they both still get an error in this
> situation.

But do we need to?  If the inode has been evicted we also lose the errno.
The guarantee we provide is that a fd that was open before the error
occurred will see the error.  An fd that's opened after the error occurred
may or may not see the error.



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-28 Thread Amir Goldstein
On Mon, Dec 28, 2020 at 3:25 PM Jeff Layton  wrote:
>
> On Fri, 2020-12-25 at 08:50 +0200, Amir Goldstein wrote:
> > On Thu, Dec 24, 2020 at 2:13 PM Matthew Wilcox  wrote:
> > >
> > > On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> > > > In current master, syncfs() on any file by any container user will
> > > > result in full syncfs() of the upperfs, which is very bad for container
> > > > isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> > > > his work will be merged soon. Overlayfs still does not do the writeback
> > > > and syncfs() in overlay still waits for all upper fs writeback to 
> > > > complete,
> > > > but at least syncfs() in overlay only kicks writeback for upper fs files
> > > > dirtied by this overlay.
> > > >
> > > > [1] 
> > > > https://lore.kernel.org/linux-unionfs/cajfpegsbb4itxw8zyurfvnc63zg7ku7vzpsnuzhasyzh-d5...@mail.gmail.com/
> > > >
> > > > Sharing the same SEEN flag among thousands of containers is also
> > > > far from ideal, because effectively this means that any given workload
> > > > in any single container has very little chance of observing the SEEN 
> > > > flag.
> > >
> > > Perhaps you misunderstand how errseq works.  If each container samples
> > > the errseq at startup, then they will all see any error which occurs
> > > during their lifespan
> >
> > Meant to say "...very little chance of NOT observing the SEEN flag",
> > but We are not in disagreement.
> > My argument against sharing the SEEN flag refers to Vivek's patch of
> > stacked errseq_sample()/errseq_check_and_advance() which does NOT
> > sample errseq at overlayfs mount time. That is why my next sentence is:
> > "I do agree with Matthew that overlayfs should sample errseq...".
> >
> > > (and possibly an error which occurred before they started up).
> > >
> >
> > Right. And this is where the discussion of splitting the SEEN flag started.
> > Some of us want to treat overlayfs mount time as a true epoc for errseq.
> > The new container didn't write any files yet, so it should not care about
> > writeback errors from the past.
> >
> > I agree that it may not be very critical, but as I wrote before, I think we
> > should do our best to try and isolate container workloads.
> >
> > > > To this end, I do agree with Matthew that overlayfs should sample errseq
> > > > and the best patchset to implement it so far IMO is Jeff's patchset [2].
> > > > This patch set was written to cater only "volatile" overlayfs mount, but
> > > > there is no reason not to use the same mechanism for regular overlay
> > > > mount. The only difference being that "volatile" overlay only checks for
> > > > error since mount on syncfs() (because "volatile" overlay does NOT
> > > > syncfs upper fs) and regular overlay checks and advances the overlay's
> > > > errseq sample on syncfs (and does syncfs upper fs).
> > > >
> > > > Matthew, I hope that my explanation of the use case and Jeff's answer
> > > > is sufficient to understand why the split of the SEEN flag is needed.
> > > >
> > > > [2] 
> > > > https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlay...@kernel.org/
> > >
> > > No, it still feels weird and wrong.
> > >
> >
> > All right. Considering your reservations, I think perhaps the split of the
> > SEEN flag can wait for a later time after more discussions and maybe
> > not as suitable for stable as we thought.
> >
> > I think that for stable, it would be sufficient to adapt Surgun's original
> > syncfs for volatile mount patch [1] to cover the non-volatile case:
> > on mout:
> > - errseq_sample() upper fs
> > - on volatile mount, errseq_check() upper fs and fail mount on un-SEEN error
> > on syncfs:
> > - errseq_check() for volatile mount
> > - errseq_check_and_advance() for non-volatile mount
> > - errseq_set() overlay sb on upper fs error
> >
> > Now errseq_set() is not only a hack around __sync_filesystem ignoring
> > return value of ->sync_fs(). It is really needed for per-overlay SEEN
> > error isolation in the non-volatile case.
> >
> > Unless I am missing something, I think we do not strictly need Vivek's
> > 1/3 patch [2] for stable, but not sure.
> >
> > Sargun,
> >
> > Do you agree with the above proposal?
> > Will you make it into a patch?
> >
> > Vivek, Jefff,
> >
> > Do you agree that overlay syncfs observing writeback errors that predate
> > overlay mount time is an issue that can be deferred (maybe forever)?
> >
>
> That's very application dependent.
>
> To be clear, the main thing you'll lose with the method above is the
> ability to see an unseen error on a newly opened fd, if there was an
> overlayfs mount using the same upper sb before your open occurred.
>
> IOW, consider two overlayfs mounts using the same upper layer sb:
>
> ovlfs1  ovlfs2
> --
> mount
> open fd1
> write to fd1
> 
> mount (upper errseq_t SEEN flag marked)
> open fd2
> 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-28 Thread Jeff Layton
On Fri, 2020-12-25 at 08:50 +0200, Amir Goldstein wrote:
> On Thu, Dec 24, 2020 at 2:13 PM Matthew Wilcox  wrote:
> > 
> > On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> > > In current master, syncfs() on any file by any container user will
> > > result in full syncfs() of the upperfs, which is very bad for container
> > > isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> > > his work will be merged soon. Overlayfs still does not do the writeback
> > > and syncfs() in overlay still waits for all upper fs writeback to 
> > > complete,
> > > but at least syncfs() in overlay only kicks writeback for upper fs files
> > > dirtied by this overlay.
> > > 
> > > [1] 
> > > https://lore.kernel.org/linux-unionfs/cajfpegsbb4itxw8zyurfvnc63zg7ku7vzpsnuzhasyzh-d5...@mail.gmail.com/
> > > 
> > > Sharing the same SEEN flag among thousands of containers is also
> > > far from ideal, because effectively this means that any given workload
> > > in any single container has very little chance of observing the SEEN flag.
> > 
> > Perhaps you misunderstand how errseq works.  If each container samples
> > the errseq at startup, then they will all see any error which occurs
> > during their lifespan
> 
> Meant to say "...very little chance of NOT observing the SEEN flag",
> but We are not in disagreement.
> My argument against sharing the SEEN flag refers to Vivek's patch of
> stacked errseq_sample()/errseq_check_and_advance() which does NOT
> sample errseq at overlayfs mount time. That is why my next sentence is:
> "I do agree with Matthew that overlayfs should sample errseq...".
> 
> > (and possibly an error which occurred before they started up).
> > 
> 
> Right. And this is where the discussion of splitting the SEEN flag started.
> Some of us want to treat overlayfs mount time as a true epoc for errseq.
> The new container didn't write any files yet, so it should not care about
> writeback errors from the past.
> 
> I agree that it may not be very critical, but as I wrote before, I think we
> should do our best to try and isolate container workloads.
> 
> > > To this end, I do agree with Matthew that overlayfs should sample errseq
> > > and the best patchset to implement it so far IMO is Jeff's patchset [2].
> > > This patch set was written to cater only "volatile" overlayfs mount, but
> > > there is no reason not to use the same mechanism for regular overlay
> > > mount. The only difference being that "volatile" overlay only checks for
> > > error since mount on syncfs() (because "volatile" overlay does NOT
> > > syncfs upper fs) and regular overlay checks and advances the overlay's
> > > errseq sample on syncfs (and does syncfs upper fs).
> > > 
> > > Matthew, I hope that my explanation of the use case and Jeff's answer
> > > is sufficient to understand why the split of the SEEN flag is needed.
> > > 
> > > [2] 
> > > https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlay...@kernel.org/
> > 
> > No, it still feels weird and wrong.
> > 
> 
> All right. Considering your reservations, I think perhaps the split of the
> SEEN flag can wait for a later time after more discussions and maybe
> not as suitable for stable as we thought.
> 
> I think that for stable, it would be sufficient to adapt Surgun's original
> syncfs for volatile mount patch [1] to cover the non-volatile case:
> on mout:
> - errseq_sample() upper fs
> - on volatile mount, errseq_check() upper fs and fail mount on un-SEEN error
> on syncfs:
> - errseq_check() for volatile mount
> - errseq_check_and_advance() for non-volatile mount
> - errseq_set() overlay sb on upper fs error
> 
> Now errseq_set() is not only a hack around __sync_filesystem ignoring
> return value of ->sync_fs(). It is really needed for per-overlay SEEN
> error isolation in the non-volatile case.
> 
> Unless I am missing something, I think we do not strictly need Vivek's
> 1/3 patch [2] for stable, but not sure.
> 
> Sargun,
> 
> Do you agree with the above proposal?
> Will you make it into a patch?
> 
> Vivek, Jefff,
> 
> Do you agree that overlay syncfs observing writeback errors that predate
> overlay mount time is an issue that can be deferred (maybe forever)?
> 

That's very application dependent.

To be clear, the main thing you'll lose with the method above is the
ability to see an unseen error on a newly opened fd, if there was an
overlayfs mount using the same upper sb before your open occurred.

IOW, consider two overlayfs mounts using the same upper layer sb:

ovlfs1  ovlfs2
--
mount
open fd1
write to fd1

mount (upper errseq_t SEEN flag marked)
open fd2
syncfs(fd2)
syncfs(fd1)


On a "normal" (non-overlay) fs, you'd get an error back on both syncfs
calls. The first one has a sample from before the error occurred, and
the second one has a sample of 0, due to the fact that the error was
unseen at open time.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-24 Thread Amir Goldstein
On Thu, Dec 24, 2020 at 2:13 PM Matthew Wilcox  wrote:
>
> On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> > In current master, syncfs() on any file by any container user will
> > result in full syncfs() of the upperfs, which is very bad for container
> > isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> > his work will be merged soon. Overlayfs still does not do the writeback
> > and syncfs() in overlay still waits for all upper fs writeback to complete,
> > but at least syncfs() in overlay only kicks writeback for upper fs files
> > dirtied by this overlay.
> >
> > [1] 
> > https://lore.kernel.org/linux-unionfs/cajfpegsbb4itxw8zyurfvnc63zg7ku7vzpsnuzhasyzh-d5...@mail.gmail.com/
> >
> > Sharing the same SEEN flag among thousands of containers is also
> > far from ideal, because effectively this means that any given workload
> > in any single container has very little chance of observing the SEEN flag.
>
> Perhaps you misunderstand how errseq works.  If each container samples
> the errseq at startup, then they will all see any error which occurs
> during their lifespan

Meant to say "...very little chance of NOT observing the SEEN flag",
but We are not in disagreement.
My argument against sharing the SEEN flag refers to Vivek's patch of
stacked errseq_sample()/errseq_check_and_advance() which does NOT
sample errseq at overlayfs mount time. That is why my next sentence is:
"I do agree with Matthew that overlayfs should sample errseq...".

> (and possibly an error which occurred before they started up).
>

Right. And this is where the discussion of splitting the SEEN flag started.
Some of us want to treat overlayfs mount time as a true epoc for errseq.
The new container didn't write any files yet, so it should not care about
writeback errors from the past.

I agree that it may not be very critical, but as I wrote before, I think we
should do our best to try and isolate container workloads.

> > To this end, I do agree with Matthew that overlayfs should sample errseq
> > and the best patchset to implement it so far IMO is Jeff's patchset [2].
> > This patch set was written to cater only "volatile" overlayfs mount, but
> > there is no reason not to use the same mechanism for regular overlay
> > mount. The only difference being that "volatile" overlay only checks for
> > error since mount on syncfs() (because "volatile" overlay does NOT
> > syncfs upper fs) and regular overlay checks and advances the overlay's
> > errseq sample on syncfs (and does syncfs upper fs).
> >
> > Matthew, I hope that my explanation of the use case and Jeff's answer
> > is sufficient to understand why the split of the SEEN flag is needed.
> >
> > [2] 
> > https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlay...@kernel.org/
>
> No, it still feels weird and wrong.
>

All right. Considering your reservations, I think perhaps the split of the
SEEN flag can wait for a later time after more discussions and maybe
not as suitable for stable as we thought.

I think that for stable, it would be sufficient to adapt Surgun's original
syncfs for volatile mount patch [1] to cover the non-volatile case:
on mout:
- errseq_sample() upper fs
- on volatile mount, errseq_check() upper fs and fail mount on un-SEEN error
on syncfs:
- errseq_check() for volatile mount
- errseq_check_and_advance() for non-volatile mount
- errseq_set() overlay sb on upper fs error

Now errseq_set() is not only a hack around __sync_filesystem ignoring
return value of ->sync_fs(). It is really needed for per-overlay SEEN
error isolation in the non-volatile case.

Unless I am missing something, I think we do not strictly need Vivek's
1/3 patch [2] for stable, but not sure.

Sargun,

Do you agree with the above proposal?
Will you make it into a patch?

Vivek, Jefff,

Do you agree that overlay syncfs observing writeback errors that predate
overlay mount time is an issue that can be deferred (maybe forever)?

BTW, in all the discussions we always assumed that stacked fsync() is correct
WRT errseq, but in fact, fsync() can also observe an unseen error that predates
overlay mount.
In order to fix that, we will probably need to split the SEEN flag and some
more errseq acrobatics, but again, not sure it is worth the effort.

Thanks,
Amir.

[1] 
https://lore.kernel.org/linux-unionfs/20201202092720.41522-1-sar...@sargun.me/
[2] https://lore.kernel.org/linux-unionfs/20201222151752.ga3...@redhat.com/


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-24 Thread Matthew Wilcox
On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> In current master, syncfs() on any file by any container user will
> result in full syncfs() of the upperfs, which is very bad for container
> isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> his work will be merged soon. Overlayfs still does not do the writeback
> and syncfs() in overlay still waits for all upper fs writeback to complete,
> but at least syncfs() in overlay only kicks writeback for upper fs files
> dirtied by this overlay.
> 
> [1] 
> https://lore.kernel.org/linux-unionfs/cajfpegsbb4itxw8zyurfvnc63zg7ku7vzpsnuzhasyzh-d5...@mail.gmail.com/
> 
> Sharing the same SEEN flag among thousands of containers is also
> far from ideal, because effectively this means that any given workload
> in any single container has very little chance of observing the SEEN flag.

Perhaps you misunderstand how errseq works.  If each container samples
the errseq at startup, then they will all see any error which occurs
during their lifespan (and possibly an error which occurred before they
started up).

> To this end, I do agree with Matthew that overlayfs should sample errseq
> and the best patchset to implement it so far IMO is Jeff's patchset [2].
> This patch set was written to cater only "volatile" overlayfs mount, but
> there is no reason not to use the same mechanism for regular overlay
> mount. The only difference being that "volatile" overlay only checks for
> error since mount on syncfs() (because "volatile" overlay does NOT
> syncfs upper fs) and regular overlay checks and advances the overlay's
> errseq sample on syncfs (and does syncfs upper fs).
> 
> Matthew, I hope that my explanation of the use case and Jeff's answer
> is sufficient to understand why the split of the SEEN flag is needed.
> 
> [2] 
> https://lore.kernel.org/linux-unionfs/20201213132713.66864-1-jlay...@kernel.org/

No, it still feels weird and wrong.



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-24 Thread Sargun Dhillon
On Thu, Dec 24, 2020 at 11:32:55AM +0200, Amir Goldstein wrote:
> On Wed, Dec 23, 2020 at 10:44 PM Matthew Wilcox  wrote:
> >
> > On Wed, Dec 23, 2020 at 08:21:41PM +, Sargun Dhillon wrote:
> > > On Wed, Dec 23, 2020 at 08:07:46PM +, Matthew Wilcox wrote:
> > > > On Wed, Dec 23, 2020 at 07:29:41PM +, Sargun Dhillon wrote:
> > > > > On Wed, Dec 23, 2020 at 06:50:44PM +, Matthew Wilcox wrote:
> > > > > > On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> > > > > > > I fail to see why this is neccessary if you incorporate error 
> > > > > > > reporting into the
> > > > > > > sync_fs callback. Why is this separate from that callback? If you 
> > > > > > > pickup Jeff's
> > > > > > > patch that adds the 2nd flag to errseq for "observed", you should 
> > > > > > > be able to
> > > > > > > stash the first errseq seen in the ovl_fs struct, and do the 
> > > > > > > check-and-return
> > > > > > > in there instead instead of adding this new infrastructure.
> > > > > >
> > > > > > You still haven't explained why you want to add the "observed" flag.
> > > > >
> > > > >
> > > > > In the overlayfs model, many users may be using the same filesystem 
> > > > > (super block)
> > > > > for their upperdir. Let's say you have something like this:
> > > > >
> > > > > /workdir [Mounted FS]
> > > > > /workdir/upperdir1 [overlayfs upperdir]
> > > > > /workdir/upperdir2 [overlayfs upperdir]
> > > > > /workdir/userscratchspace
> > > > >
> > > > > The user needs to be able to do something like:
> > > > > sync -f ${overlayfs1}/file
> > > > >
> > > > > which in turn will call sync on the the underlying filesystem (the 
> > > > > one mounted
> > > > > on /workdir), and can check if the errseq has changed since the 
> > > > > overlayfs was
> > > > > mounted, and use that to return an error to the user.
> > > >
> > > > OK, but I don't see why the current scheme doesn't work for this.  If
> > > > (each instance of) overlayfs samples the errseq at mount time and then
> > > > check_and_advances it at sync time, it will see any error that has 
> > > > occurred
> > > > since the mount happened (and possibly also an error which occurred 
> > > > before
> > > > the mount happened, but hadn't been reported to anybody before).
> > > >
> > >
> > > If there is an outstanding error at mount time, and the SEEN flag is 
> > > unset,
> > > subsequent errors will not increment the counter, until the user calls 
> > > sync on
> > > the upperdir's filesystem. If overlayfs calls check_and_advance on the 
> > > upperdir's
> > > super block at any point, it will then set the seen block, and if the 
> > > user calls
> > > syncfs on the upperdir, it will not return that there is an outstanding 
> > > error,
> > > since overlayfs just cleared it.
> >
> > Your concern is this case:
> >
> > fs is mounted on /workdir
> > /workdir/A is written to and then closed.
> > writeback happens and -EIO happens, but there's nobody around to care.
> > /workdir/upperdir1 becomes part of an overlayfs mount
> > overlayfs samples the error
> > a user writes to /workdir/B, another -EIO occurs, but nothing happens
> > someone calls syncfs on /workdir/upperdir/A, gets the EIO.
> > a user opens /workdir/B and calls syncfs, but sees no error
> >
> > do i have that right?  or is it something else?
> 
> IMO it is something else. Others may disagree.
> IMO the level of interference between users accessing overlay and users
> accessing upper fs directly is not well defined and it can stay this way.
> 
> Concurrent access to  /workdir/upperdir/A via overlay and underlying fs
> is explicitly warranted against in Documentation/filesystems/overlayfs.rst#
> Changes to underlying filesystems:
> "Changes to the underlying filesystems while part of a mounted overlay
> filesystem are not allowed.  If the underlying filesystem is changed,
> the behavior of the overlay is undefined, though it will not result in
> a crash or deadlock."
> 
> The question is whether syncfs(open(/workdir/B)) is considered
> "Changes to the underlying filesystems". Regardless of the answer,
> this is not an interesting case IMO.
> 
> The real issue is with interference between overlays that share the
> same upper fs, because this is by far and large the common use case
> that is creating real problems for a lot of container users.
> 
> Workloads running inside containers (with overlayfs storage driver)
> will never be as isolated as workloads running inside VMs, but it
> doesn't mean we cannot try to improve.
> 
> In current master, syncfs() on any file by any container user will
> result in full syncfs() of the upperfs, which is very bad for container
> isolation. This has been partly fixed by Chengguang Xu [1] and I expect
> his work will be merged soon. Overlayfs still does not do the writeback
> and syncfs() in overlay still waits for all upper fs writeback to complete,
> but at least syncfs() in overlay only kicks writeback for upper fs files
> dirtied by this overlay.
> 
> [1] 
> 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-24 Thread Amir Goldstein
On Wed, Dec 23, 2020 at 10:44 PM Matthew Wilcox  wrote:
>
> On Wed, Dec 23, 2020 at 08:21:41PM +, Sargun Dhillon wrote:
> > On Wed, Dec 23, 2020 at 08:07:46PM +, Matthew Wilcox wrote:
> > > On Wed, Dec 23, 2020 at 07:29:41PM +, Sargun Dhillon wrote:
> > > > On Wed, Dec 23, 2020 at 06:50:44PM +, Matthew Wilcox wrote:
> > > > > On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> > > > > > I fail to see why this is neccessary if you incorporate error 
> > > > > > reporting into the
> > > > > > sync_fs callback. Why is this separate from that callback? If you 
> > > > > > pickup Jeff's
> > > > > > patch that adds the 2nd flag to errseq for "observed", you should 
> > > > > > be able to
> > > > > > stash the first errseq seen in the ovl_fs struct, and do the 
> > > > > > check-and-return
> > > > > > in there instead instead of adding this new infrastructure.
> > > > >
> > > > > You still haven't explained why you want to add the "observed" flag.
> > > >
> > > >
> > > > In the overlayfs model, many users may be using the same filesystem 
> > > > (super block)
> > > > for their upperdir. Let's say you have something like this:
> > > >
> > > > /workdir [Mounted FS]
> > > > /workdir/upperdir1 [overlayfs upperdir]
> > > > /workdir/upperdir2 [overlayfs upperdir]
> > > > /workdir/userscratchspace
> > > >
> > > > The user needs to be able to do something like:
> > > > sync -f ${overlayfs1}/file
> > > >
> > > > which in turn will call sync on the the underlying filesystem (the one 
> > > > mounted
> > > > on /workdir), and can check if the errseq has changed since the 
> > > > overlayfs was
> > > > mounted, and use that to return an error to the user.
> > >
> > > OK, but I don't see why the current scheme doesn't work for this.  If
> > > (each instance of) overlayfs samples the errseq at mount time and then
> > > check_and_advances it at sync time, it will see any error that has 
> > > occurred
> > > since the mount happened (and possibly also an error which occurred before
> > > the mount happened, but hadn't been reported to anybody before).
> > >
> >
> > If there is an outstanding error at mount time, and the SEEN flag is unset,
> > subsequent errors will not increment the counter, until the user calls sync 
> > on
> > the upperdir's filesystem. If overlayfs calls check_and_advance on the 
> > upperdir's
> > super block at any point, it will then set the seen block, and if the user 
> > calls
> > syncfs on the upperdir, it will not return that there is an outstanding 
> > error,
> > since overlayfs just cleared it.
>
> Your concern is this case:
>
> fs is mounted on /workdir
> /workdir/A is written to and then closed.
> writeback happens and -EIO happens, but there's nobody around to care.
> /workdir/upperdir1 becomes part of an overlayfs mount
> overlayfs samples the error
> a user writes to /workdir/B, another -EIO occurs, but nothing happens
> someone calls syncfs on /workdir/upperdir/A, gets the EIO.
> a user opens /workdir/B and calls syncfs, but sees no error
>
> do i have that right?  or is it something else?

IMO it is something else. Others may disagree.
IMO the level of interference between users accessing overlay and users
accessing upper fs directly is not well defined and it can stay this way.

Concurrent access to  /workdir/upperdir/A via overlay and underlying fs
is explicitly warranted against in Documentation/filesystems/overlayfs.rst#
Changes to underlying filesystems:
"Changes to the underlying filesystems while part of a mounted overlay
filesystem are not allowed.  If the underlying filesystem is changed,
the behavior of the overlay is undefined, though it will not result in
a crash or deadlock."

The question is whether syncfs(open(/workdir/B)) is considered
"Changes to the underlying filesystems". Regardless of the answer,
this is not an interesting case IMO.

The real issue is with interference between overlays that share the
same upper fs, because this is by far and large the common use case
that is creating real problems for a lot of container users.

Workloads running inside containers (with overlayfs storage driver)
will never be as isolated as workloads running inside VMs, but it
doesn't mean we cannot try to improve.

In current master, syncfs() on any file by any container user will
result in full syncfs() of the upperfs, which is very bad for container
isolation. This has been partly fixed by Chengguang Xu [1] and I expect
his work will be merged soon. Overlayfs still does not do the writeback
and syncfs() in overlay still waits for all upper fs writeback to complete,
but at least syncfs() in overlay only kicks writeback for upper fs files
dirtied by this overlay.

[1] 
https://lore.kernel.org/linux-unionfs/cajfpegsbb4itxw8zyurfvnc63zg7ku7vzpsnuzhasyzh-d5...@mail.gmail.com/

Sharing the same SEEN flag among thousands of containers is also
far from ideal, because effectively this means that any given workload
in any single container has very little chance 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-23 Thread Matthew Wilcox
On Wed, Dec 23, 2020 at 08:21:41PM +, Sargun Dhillon wrote:
> On Wed, Dec 23, 2020 at 08:07:46PM +, Matthew Wilcox wrote:
> > On Wed, Dec 23, 2020 at 07:29:41PM +, Sargun Dhillon wrote:
> > > On Wed, Dec 23, 2020 at 06:50:44PM +, Matthew Wilcox wrote:
> > > > On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> > > > > I fail to see why this is neccessary if you incorporate error 
> > > > > reporting into the 
> > > > > sync_fs callback. Why is this separate from that callback? If you 
> > > > > pickup Jeff's
> > > > > patch that adds the 2nd flag to errseq for "observed", you should be 
> > > > > able to
> > > > > stash the first errseq seen in the ovl_fs struct, and do the 
> > > > > check-and-return
> > > > > in there instead instead of adding this new infrastructure.
> > > > 
> > > > You still haven't explained why you want to add the "observed" flag.
> > > 
> > > 
> > > In the overlayfs model, many users may be using the same filesystem 
> > > (super block)
> > > for their upperdir. Let's say you have something like this:
> > > 
> > > /workdir [Mounted FS]
> > > /workdir/upperdir1 [overlayfs upperdir]
> > > /workdir/upperdir2 [overlayfs upperdir]
> > > /workdir/userscratchspace
> > > 
> > > The user needs to be able to do something like:
> > > sync -f ${overlayfs1}/file
> > > 
> > > which in turn will call sync on the the underlying filesystem (the one 
> > > mounted 
> > > on /workdir), and can check if the errseq has changed since the overlayfs 
> > > was
> > > mounted, and use that to return an error to the user.
> > 
> > OK, but I don't see why the current scheme doesn't work for this.  If
> > (each instance of) overlayfs samples the errseq at mount time and then
> > check_and_advances it at sync time, it will see any error that has occurred
> > since the mount happened (and possibly also an error which occurred before
> > the mount happened, but hadn't been reported to anybody before).
> > 
> 
> If there is an outstanding error at mount time, and the SEEN flag is unset, 
> subsequent errors will not increment the counter, until the user calls sync on
> the upperdir's filesystem. If overlayfs calls check_and_advance on the 
> upperdir's
> super block at any point, it will then set the seen block, and if the user 
> calls
> syncfs on the upperdir, it will not return that there is an outstanding error,
> since overlayfs just cleared it.

Your concern is this case:

fs is mounted on /workdir
/workdir/A is written to and then closed.
writeback happens and -EIO happens, but there's nobody around to care.
/workdir/upperdir1 becomes part of an overlayfs mount
overlayfs samples the error
a user writes to /workdir/B, another -EIO occurs, but nothing happens
someone calls syncfs on /workdir/upperdir/A, gets the EIO.
a user opens /workdir/B and calls syncfs, but sees no error

do i have that right?  or is it something else?

> > > If we do not advance the errseq on the upperdir to "mark it as seen", 
> > > that means 
> > > future errors will not be reported if the user calls sync -f 
> > > ${overlayfs1}/file,
> > > because errseq will not increment the value if the seen bit is unset.
> > > 
> > > On the other hand, if we mark it as seen, then if the user calls sync on 
> > > /workdir/userscratchspace/file, they wont see the error since we just set 
> > > the 
> > > SEEN flag.
> > 
> > While we set the SEEN flag, if the file were opened before the error
> > occurred, we would still report the error because the sequence is higher
> > than it was when we sampled the error.
> > 
> 
> Right, this isn't a problem for people calling f(data)sync on a particular 
> file, 
> because it takes its own snapshot of errseq. This is only problematic for 
> folks 
> calling syncfs. In Jeff's other messages, it sounded like this behaviour is
> pretty important, and the likes of postgresql depend on it.

i would suggest that in the example above, the error _didn't_ occur
while calling syncfs(), it occurred before we synced the filesystem,
and we don't have to report it in that case.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-23 Thread Sargun Dhillon
On Wed, Dec 23, 2020 at 08:07:46PM +, Matthew Wilcox wrote:
> On Wed, Dec 23, 2020 at 07:29:41PM +, Sargun Dhillon wrote:
> > On Wed, Dec 23, 2020 at 06:50:44PM +, Matthew Wilcox wrote:
> > > On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> > > > I fail to see why this is neccessary if you incorporate error reporting 
> > > > into the 
> > > > sync_fs callback. Why is this separate from that callback? If you 
> > > > pickup Jeff's
> > > > patch that adds the 2nd flag to errseq for "observed", you should be 
> > > > able to
> > > > stash the first errseq seen in the ovl_fs struct, and do the 
> > > > check-and-return
> > > > in there instead instead of adding this new infrastructure.
> > > 
> > > You still haven't explained why you want to add the "observed" flag.
> > 
> > 
> > In the overlayfs model, many users may be using the same filesystem (super 
> > block)
> > for their upperdir. Let's say you have something like this:
> > 
> > /workdir [Mounted FS]
> > /workdir/upperdir1 [overlayfs upperdir]
> > /workdir/upperdir2 [overlayfs upperdir]
> > /workdir/userscratchspace
> > 
> > The user needs to be able to do something like:
> > sync -f ${overlayfs1}/file
> > 
> > which in turn will call sync on the the underlying filesystem (the one 
> > mounted 
> > on /workdir), and can check if the errseq has changed since the overlayfs 
> > was
> > mounted, and use that to return an error to the user.
> 
> OK, but I don't see why the current scheme doesn't work for this.  If
> (each instance of) overlayfs samples the errseq at mount time and then
> check_and_advances it at sync time, it will see any error that has occurred
> since the mount happened (and possibly also an error which occurred before
> the mount happened, but hadn't been reported to anybody before).
> 

If there is an outstanding error at mount time, and the SEEN flag is unset, 
subsequent errors will not increment the counter, until the user calls sync on
the upperdir's filesystem. If overlayfs calls check_and_advance on the 
upperdir's
super block at any point, it will then set the seen block, and if the user calls
syncfs on the upperdir, it will not return that there is an outstanding error,
since overlayfs just cleared it.


> > If we do not advance the errseq on the upperdir to "mark it as seen", that 
> > means 
> > future errors will not be reported if the user calls sync -f 
> > ${overlayfs1}/file,
> > because errseq will not increment the value if the seen bit is unset.
> > 
> > On the other hand, if we mark it as seen, then if the user calls sync on 
> > /workdir/userscratchspace/file, they wont see the error since we just set 
> > the 
> > SEEN flag.
> 
> While we set the SEEN flag, if the file were opened before the error
> occurred, we would still report the error because the sequence is higher
> than it was when we sampled the error.
> 

Right, this isn't a problem for people calling f(data)sync on a particular 
file, 
because it takes its own snapshot of errseq. This is only problematic for folks 
calling syncfs. In Jeff's other messages, it sounded like this behaviour is
pretty important, and the likes of postgresql depend on it.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-23 Thread Matthew Wilcox
On Wed, Dec 23, 2020 at 07:29:41PM +, Sargun Dhillon wrote:
> On Wed, Dec 23, 2020 at 06:50:44PM +, Matthew Wilcox wrote:
> > On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> > > I fail to see why this is neccessary if you incorporate error reporting 
> > > into the 
> > > sync_fs callback. Why is this separate from that callback? If you pickup 
> > > Jeff's
> > > patch that adds the 2nd flag to errseq for "observed", you should be able 
> > > to
> > > stash the first errseq seen in the ovl_fs struct, and do the 
> > > check-and-return
> > > in there instead instead of adding this new infrastructure.
> > 
> > You still haven't explained why you want to add the "observed" flag.
> 
> 
> In the overlayfs model, many users may be using the same filesystem (super 
> block)
> for their upperdir. Let's say you have something like this:
> 
> /workdir [Mounted FS]
> /workdir/upperdir1 [overlayfs upperdir]
> /workdir/upperdir2 [overlayfs upperdir]
> /workdir/userscratchspace
> 
> The user needs to be able to do something like:
> sync -f ${overlayfs1}/file
> 
> which in turn will call sync on the the underlying filesystem (the one 
> mounted 
> on /workdir), and can check if the errseq has changed since the overlayfs was
> mounted, and use that to return an error to the user.

OK, but I don't see why the current scheme doesn't work for this.  If
(each instance of) overlayfs samples the errseq at mount time and then
check_and_advances it at sync time, it will see any error that has occurred
since the mount happened (and possibly also an error which occurred before
the mount happened, but hadn't been reported to anybody before).

> If we do not advance the errseq on the upperdir to "mark it as seen", that 
> means 
> future errors will not be reported if the user calls sync -f 
> ${overlayfs1}/file,
> because errseq will not increment the value if the seen bit is unset.
> 
> On the other hand, if we mark it as seen, then if the user calls sync on 
> /workdir/userscratchspace/file, they wont see the error since we just set the 
> SEEN flag.

While we set the SEEN flag, if the file were opened before the error
occurred, we would still report the error because the sequence is higher
than it was when we sampled the error.



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-23 Thread Sargun Dhillon
On Wed, Dec 23, 2020 at 06:50:44PM +, Matthew Wilcox wrote:
> On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> > I fail to see why this is neccessary if you incorporate error reporting 
> > into the 
> > sync_fs callback. Why is this separate from that callback? If you pickup 
> > Jeff's
> > patch that adds the 2nd flag to errseq for "observed", you should be able to
> > stash the first errseq seen in the ovl_fs struct, and do the 
> > check-and-return
> > in there instead instead of adding this new infrastructure.
> 
> You still haven't explained why you want to add the "observed" flag.


In the overlayfs model, many users may be using the same filesystem (super 
block)
for their upperdir. Let's say you have something like this:

/workdir [Mounted FS]
/workdir/upperdir1 [overlayfs upperdir]
/workdir/upperdir2 [overlayfs upperdir]
/workdir/userscratchspace

The user needs to be able to do something like:
sync -f ${overlayfs1}/file

which in turn will call sync on the the underlying filesystem (the one mounted 
on /workdir), and can check if the errseq has changed since the overlayfs was
mounted, and use that to return an error to the user.

If we do not advance the errseq on the upperdir to "mark it as seen", that 
means 
future errors will not be reported if the user calls sync -f ${overlayfs1}/file,
because errseq will not increment the value if the seen bit is unset.

On the other hand, if we mark it as seen, then if the user calls sync on 
/workdir/userscratchspace/file, they wont see the error since we just set the 
SEEN flag.

You need a new flag (observed) to differentiate between "Seen and reported to 
user" versus "seen by a second-order system, so should now increment".

One alternative is to always increment the errseq error counter, but I've
gotta imagine there's a reason that wasn't done in the first place.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-23 Thread Jeff Layton
On Wed, 2020-12-23 at 18:20 +, Sargun Dhillon wrote:
> On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > Currently syncfs() and fsync() seem to be two interfaces which check and
> > return writeback errors on superblock to user space. fsync() should
> > work fine with overlayfs as it relies on underlying filesystem to
> > do the check and return error. For example, if ext4 is on upper filesystem,
> > then ext4_sync_file() calls file_check_and_advance_wb_err(file) on
> > upper file and returns error. So overlayfs does not have to do anything
> > special.
> > 
> > But with syncfs(), error check happens in vfs in syncfs() w.r.t
> > overlay_sb->s_wb_err. Given overlayfs is stacked filesystem, it
> > does not do actual writeback and all writeback errors are recorded
> > on underlying filesystem. So sb->s_wb_err is never updated hence
> > syncfs() does not work with overlay.
> > 
> > Jeff suggested that instead of trying to propagate errors to overlay
> > super block, why not simply check for errors against upper filesystem
> > super block. I implemented this idea.
> > 
> > Overlay file has "since" value which needs to be initialized at open
> > time. Overlay overrides VFS initialization and re-initializes
> > f->f_sb_err w.r.t upper super block. Later when
> > ovl_sb->errseq_check_advance() is called, f->f_sb_err is used as
> > since value to figure out if any error on upper sb has happened since
> > then.
> > 
> > Note, Right now this patch only deals with regular file and directories.
> > Yet to deal with special files like device inodes, socket, fifo etc.
> > 
> > Suggested-by: Jeff Layton 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/overlayfs/file.c  |  1 +
> >  fs/overlayfs/overlayfs.h |  1 +
> >  fs/overlayfs/readdir.c   |  1 +
> >  fs/overlayfs/super.c | 23 +++
> >  fs/overlayfs/util.c  | 13 +
> >  5 files changed, 39 insertions(+)
> > 
> > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > index efccb7c1f9bc..7b58a44dcb71 100644
> > --- a/fs/overlayfs/file.c
> > +++ b/fs/overlayfs/file.c
> > @@ -163,6 +163,7 @@ static int ovl_open(struct inode *inode, struct file 
> > *file)
> >     return PTR_ERR(realfile);
> >  
> > 
> >     file->private_data = realfile;
> > +   ovl_init_file_errseq(file);
> >  
> > 
> >     return 0;
> >  }
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index f8880aa2ba0e..47838abbfb3d 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct 
> > dentry *dentry);
> >  bool ovl_is_metacopy_dentry(struct dentry *dentry);
> >  char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
> >  int padding);
> > +void ovl_init_file_errseq(struct file *file);
> >  
> > 
> >  static inline bool ovl_is_impuredir(struct super_block *sb,
> >     struct dentry *dentry)
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 01620ebae1bd..0c48f1545483 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -960,6 +960,7 @@ static int ovl_dir_open(struct inode *inode, struct 
> > file *file)
> >     od->is_real = ovl_dir_is_real(file->f_path.dentry);
> >     od->is_upper = OVL_TYPE_UPPER(type);
> >     file->private_data = od;
> > +   ovl_init_file_errseq(file);
> >  
> > 
> >     return 0;
> >  }
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index 290983bcfbb3..d99867983722 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -390,6 +390,28 @@ static int ovl_remount(struct super_block *sb, int 
> > *flags, char *data)
> >     return ret;
> >  }
> >  
> > 
> > +static int ovl_errseq_check_advance(struct super_block *sb, struct file 
> > *file)
> > +{
> > +   struct ovl_fs *ofs = sb->s_fs_info;
> > +   struct super_block *upper_sb;
> > +   int ret;
> > +
> > +   if (!ovl_upper_mnt(ofs))
> > +   return 0;
> > +
> > +   upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +
> > +   if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> > +   return 0;
> > +
> > +   /* Something changed, must use slow path */
> > +   spin_lock(>f_lock);
> > +   ret = errseq_check_and_advance(_sb->s_wb_err, >f_sb_err);
> > +   spin_unlock(>f_lock);
> > +
> > +   return ret;
> > +}
> > +
> >  static const struct super_operations ovl_super_operations = {
> >     .alloc_inode= ovl_alloc_inode,
> >     .free_inode = ovl_free_inode,
> > @@ -400,6 +422,7 @@ static const struct super_operations 
> > ovl_super_operations = {
> >     .statfs = ovl_statfs,
> >     .show_options   = ovl_show_options,
> >     .remount_fs = ovl_remount,
> > +   .errseq_check_advance   = ovl_errseq_check_advance,
> >  };
> >  
> > 
> >  enum {
> > diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> > index 23f475627d07..a1742847f3a8 100644
> > --- a/fs/overlayfs/util.c

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-23 Thread Matthew Wilcox
On Wed, Dec 23, 2020 at 06:20:27PM +, Sargun Dhillon wrote:
> I fail to see why this is neccessary if you incorporate error reporting into 
> the 
> sync_fs callback. Why is this separate from that callback? If you pickup 
> Jeff's
> patch that adds the 2nd flag to errseq for "observed", you should be able to
> stash the first errseq seen in the ovl_fs struct, and do the check-and-return
> in there instead instead of adding this new infrastructure.

You still haven't explained why you want to add the "observed" flag.


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-23 Thread Sargun Dhillon
On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> Currently syncfs() and fsync() seem to be two interfaces which check and
> return writeback errors on superblock to user space. fsync() should
> work fine with overlayfs as it relies on underlying filesystem to
> do the check and return error. For example, if ext4 is on upper filesystem,
> then ext4_sync_file() calls file_check_and_advance_wb_err(file) on
> upper file and returns error. So overlayfs does not have to do anything
> special.
> 
> But with syncfs(), error check happens in vfs in syncfs() w.r.t
> overlay_sb->s_wb_err. Given overlayfs is stacked filesystem, it
> does not do actual writeback and all writeback errors are recorded
> on underlying filesystem. So sb->s_wb_err is never updated hence
> syncfs() does not work with overlay.
> 
> Jeff suggested that instead of trying to propagate errors to overlay
> super block, why not simply check for errors against upper filesystem
> super block. I implemented this idea.
> 
> Overlay file has "since" value which needs to be initialized at open
> time. Overlay overrides VFS initialization and re-initializes
> f->f_sb_err w.r.t upper super block. Later when
> ovl_sb->errseq_check_advance() is called, f->f_sb_err is used as
> since value to figure out if any error on upper sb has happened since
> then.
> 
> Note, Right now this patch only deals with regular file and directories.
> Yet to deal with special files like device inodes, socket, fifo etc.
> 
> Suggested-by: Jeff Layton 
> Signed-off-by: Vivek Goyal 
> ---
>  fs/overlayfs/file.c  |  1 +
>  fs/overlayfs/overlayfs.h |  1 +
>  fs/overlayfs/readdir.c   |  1 +
>  fs/overlayfs/super.c | 23 +++
>  fs/overlayfs/util.c  | 13 +
>  5 files changed, 39 insertions(+)
> 
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index efccb7c1f9bc..7b58a44dcb71 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -163,6 +163,7 @@ static int ovl_open(struct inode *inode, struct file 
> *file)
>   return PTR_ERR(realfile);
>  
>   file->private_data = realfile;
> + ovl_init_file_errseq(file);
>  
>   return 0;
>  }
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index f8880aa2ba0e..47838abbfb3d 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct 
> dentry *dentry);
>  bool ovl_is_metacopy_dentry(struct dentry *dentry);
>  char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
>int padding);
> +void ovl_init_file_errseq(struct file *file);
>  
>  static inline bool ovl_is_impuredir(struct super_block *sb,
>   struct dentry *dentry)
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 01620ebae1bd..0c48f1545483 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -960,6 +960,7 @@ static int ovl_dir_open(struct inode *inode, struct file 
> *file)
>   od->is_real = ovl_dir_is_real(file->f_path.dentry);
>   od->is_upper = OVL_TYPE_UPPER(type);
>   file->private_data = od;
> + ovl_init_file_errseq(file);
>  
>   return 0;
>  }
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 290983bcfbb3..d99867983722 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -390,6 +390,28 @@ static int ovl_remount(struct super_block *sb, int 
> *flags, char *data)
>   return ret;
>  }
>  
> +static int ovl_errseq_check_advance(struct super_block *sb, struct file 
> *file)
> +{
> + struct ovl_fs *ofs = sb->s_fs_info;
> + struct super_block *upper_sb;
> + int ret;
> +
> + if (!ovl_upper_mnt(ofs))
> + return 0;
> +
> + upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> +
> + if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> + return 0;
> +
> + /* Something changed, must use slow path */
> + spin_lock(>f_lock);
> + ret = errseq_check_and_advance(_sb->s_wb_err, >f_sb_err);
> + spin_unlock(>f_lock);
> +
> + return ret;
> +}
> +
>  static const struct super_operations ovl_super_operations = {
>   .alloc_inode= ovl_alloc_inode,
>   .free_inode = ovl_free_inode,
> @@ -400,6 +422,7 @@ static const struct super_operations ovl_super_operations 
> = {
>   .statfs = ovl_statfs,
>   .show_options   = ovl_show_options,
>   .remount_fs = ovl_remount,
> + .errseq_check_advance   = ovl_errseq_check_advance,
>  };
>  
>  enum {
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index 23f475627d07..a1742847f3a8 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -950,3 +950,16 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct 
> dentry *dentry,
>   kfree(buf);
>   return ERR_PTR(res);
>  }
> +
> +void ovl_init_file_errseq(struct file *file)
> +{
> + struct super_block *sb = 

Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-23 Thread Jeff Layton
On Tue, 2020-12-22 at 12:55 -0500, Vivek Goyal wrote:
> On Tue, Dec 22, 2020 at 05:46:37PM +, Matthew Wilcox wrote:
> > On Tue, Dec 22, 2020 at 11:29:25AM -0500, Vivek Goyal wrote:
> > > On Tue, Dec 22, 2020 at 04:20:27PM +, Matthew Wilcox wrote:
> > > > On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > > > > +static int ovl_errseq_check_advance(struct super_block *sb, struct 
> > > > > file *file)
> > > > > +{
> > > > > + struct ovl_fs *ofs = sb->s_fs_info;
> > > > > + struct super_block *upper_sb;
> > > > > + int ret;
> > > > > +
> > > > > + if (!ovl_upper_mnt(ofs))
> > > > > + return 0;
> > > > > +
> > > > > + upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > > +
> > > > > + if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> > > > > + return 0;
> > > > > +
> > > > > + /* Something changed, must use slow path */
> > > > > + spin_lock(>f_lock);
> > > > > + ret = errseq_check_and_advance(_sb->s_wb_err, 
> > > > > >f_sb_err);
> > > > > + spin_unlock(>f_lock);
> > > > 
> > > > Why are you microoptimising syncfs()?  Are there really applications 
> > > > which
> > > > call syncfs() in a massively parallel manner on the same file 
> > > > descriptor?
> > > 
> > > This is atleast theoritical race. I am not aware which application can
> > > trigger this race. So to me it makes sense to fix the race.
> > > 
> > > Jeff Layton also posted a fix for syncfs().
> > > 
> > > https://lore.kernel.org/linux-fsdevel/20201219134804.20034-1-jlay...@kernel.org/
> > > 
> > > To me it makes sense to fix the race irrespective of the fact if somebody
> > > hit it or not. People end up copying code in other parts of kernel and
> > > and they will atleast copy race free code.
> > 
> > Let me try again.  "Why are you trying to avoid taking the spinlock?"
> 
> Aha.., sorry, I misunderstood your question. I don't have a good answer.
> I just copied the code from Jeff Layton's patch.
> 
> Agreed that cost of taking spin lock will not be significant until
> syncfs() is called at high frequency. Having said that, most of the
> time taking spin lock will not be needed, so avoiding it with
> a simple call to errseq_check() sounds reasonable too.
> 
> I don't have any strong opinions here. I am fine with any of the
> implementation people like.
> 

It is a micro-optimization, but we'll almost always be able to avoid
taking the lock altogether. Errors here should be very, very infrequent.

That said I don't have strong feelings on this either.
-- 
Jeff Layton 



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-22 Thread Vivek Goyal
On Tue, Dec 22, 2020 at 05:46:37PM +, Matthew Wilcox wrote:
> On Tue, Dec 22, 2020 at 11:29:25AM -0500, Vivek Goyal wrote:
> > On Tue, Dec 22, 2020 at 04:20:27PM +, Matthew Wilcox wrote:
> > > On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > > > +static int ovl_errseq_check_advance(struct super_block *sb, struct 
> > > > file *file)
> > > > +{
> > > > +   struct ovl_fs *ofs = sb->s_fs_info;
> > > > +   struct super_block *upper_sb;
> > > > +   int ret;
> > > > +
> > > > +   if (!ovl_upper_mnt(ofs))
> > > > +   return 0;
> > > > +
> > > > +   upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > > +
> > > > +   if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> > > > +   return 0;
> > > > +
> > > > +   /* Something changed, must use slow path */
> > > > +   spin_lock(>f_lock);
> > > > +   ret = errseq_check_and_advance(_sb->s_wb_err, 
> > > > >f_sb_err);
> > > > +   spin_unlock(>f_lock);
> > > 
> > > Why are you microoptimising syncfs()?  Are there really applications which
> > > call syncfs() in a massively parallel manner on the same file descriptor?
> > 
> > This is atleast theoritical race. I am not aware which application can
> > trigger this race. So to me it makes sense to fix the race.
> > 
> > Jeff Layton also posted a fix for syncfs().
> > 
> > https://lore.kernel.org/linux-fsdevel/20201219134804.20034-1-jlay...@kernel.org/
> > 
> > To me it makes sense to fix the race irrespective of the fact if somebody
> > hit it or not. People end up copying code in other parts of kernel and
> > and they will atleast copy race free code.
> 
> Let me try again.  "Why are you trying to avoid taking the spinlock?"

Aha.., sorry, I misunderstood your question. I don't have a good answer.
I just copied the code from Jeff Layton's patch.

Agreed that cost of taking spin lock will not be significant until
syncfs() is called at high frequency. Having said that, most of the
time taking spin lock will not be needed, so avoiding it with
a simple call to errseq_check() sounds reasonable too.

I don't have any strong opinions here. I am fine with any of the
implementation people like.

Vivek



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-22 Thread Matthew Wilcox
On Tue, Dec 22, 2020 at 11:29:25AM -0500, Vivek Goyal wrote:
> On Tue, Dec 22, 2020 at 04:20:27PM +, Matthew Wilcox wrote:
> > On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > > +static int ovl_errseq_check_advance(struct super_block *sb, struct file 
> > > *file)
> > > +{
> > > + struct ovl_fs *ofs = sb->s_fs_info;
> > > + struct super_block *upper_sb;
> > > + int ret;
> > > +
> > > + if (!ovl_upper_mnt(ofs))
> > > + return 0;
> > > +
> > > + upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > > +
> > > + if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> > > + return 0;
> > > +
> > > + /* Something changed, must use slow path */
> > > + spin_lock(>f_lock);
> > > + ret = errseq_check_and_advance(_sb->s_wb_err, >f_sb_err);
> > > + spin_unlock(>f_lock);
> > 
> > Why are you microoptimising syncfs()?  Are there really applications which
> > call syncfs() in a massively parallel manner on the same file descriptor?
> 
> This is atleast theoritical race. I am not aware which application can
> trigger this race. So to me it makes sense to fix the race.
> 
> Jeff Layton also posted a fix for syncfs().
> 
> https://lore.kernel.org/linux-fsdevel/20201219134804.20034-1-jlay...@kernel.org/
> 
> To me it makes sense to fix the race irrespective of the fact if somebody
> hit it or not. People end up copying code in other parts of kernel and
> and they will atleast copy race free code.

Let me try again.  "Why are you trying to avoid taking the spinlock?"


Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-22 Thread Vivek Goyal
On Tue, Dec 22, 2020 at 04:20:27PM +, Matthew Wilcox wrote:
> On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> > +static int ovl_errseq_check_advance(struct super_block *sb, struct file 
> > *file)
> > +{
> > +   struct ovl_fs *ofs = sb->s_fs_info;
> > +   struct super_block *upper_sb;
> > +   int ret;
> > +
> > +   if (!ovl_upper_mnt(ofs))
> > +   return 0;
> > +
> > +   upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> > +
> > +   if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> > +   return 0;
> > +
> > +   /* Something changed, must use slow path */
> > +   spin_lock(>f_lock);
> > +   ret = errseq_check_and_advance(_sb->s_wb_err, >f_sb_err);
> > +   spin_unlock(>f_lock);
> 
> Why are you microoptimising syncfs()?  Are there really applications which
> call syncfs() in a massively parallel manner on the same file descriptor?

This is atleast theoritical race. I am not aware which application can
trigger this race. So to me it makes sense to fix the race.

Jeff Layton also posted a fix for syncfs().

https://lore.kernel.org/linux-fsdevel/20201219134804.20034-1-jlay...@kernel.org/

To me it makes sense to fix the race irrespective of the fact if somebody
hit it or not. People end up copying code in other parts of kernel and
and they will atleast copy race free code.

Vivek



Re: [PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-22 Thread Matthew Wilcox
On Mon, Dec 21, 2020 at 02:50:55PM -0500, Vivek Goyal wrote:
> +static int ovl_errseq_check_advance(struct super_block *sb, struct file 
> *file)
> +{
> + struct ovl_fs *ofs = sb->s_fs_info;
> + struct super_block *upper_sb;
> + int ret;
> +
> + if (!ovl_upper_mnt(ofs))
> + return 0;
> +
> + upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
> +
> + if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
> + return 0;
> +
> + /* Something changed, must use slow path */
> + spin_lock(>f_lock);
> + ret = errseq_check_and_advance(_sb->s_wb_err, >f_sb_err);
> + spin_unlock(>f_lock);

Why are you microoptimising syncfs()?  Are there really applications which
call syncfs() in a massively parallel manner on the same file descriptor?


[PATCH 3/3] overlayfs: Report writeback errors on upper

2020-12-21 Thread Vivek Goyal
Currently syncfs() and fsync() seem to be two interfaces which check and
return writeback errors on superblock to user space. fsync() should
work fine with overlayfs as it relies on underlying filesystem to
do the check and return error. For example, if ext4 is on upper filesystem,
then ext4_sync_file() calls file_check_and_advance_wb_err(file) on
upper file and returns error. So overlayfs does not have to do anything
special.

But with syncfs(), error check happens in vfs in syncfs() w.r.t
overlay_sb->s_wb_err. Given overlayfs is stacked filesystem, it
does not do actual writeback and all writeback errors are recorded
on underlying filesystem. So sb->s_wb_err is never updated hence
syncfs() does not work with overlay.

Jeff suggested that instead of trying to propagate errors to overlay
super block, why not simply check for errors against upper filesystem
super block. I implemented this idea.

Overlay file has "since" value which needs to be initialized at open
time. Overlay overrides VFS initialization and re-initializes
f->f_sb_err w.r.t upper super block. Later when
ovl_sb->errseq_check_advance() is called, f->f_sb_err is used as
since value to figure out if any error on upper sb has happened since
then.

Note, Right now this patch only deals with regular file and directories.
Yet to deal with special files like device inodes, socket, fifo etc.

Suggested-by: Jeff Layton 
Signed-off-by: Vivek Goyal 
---
 fs/overlayfs/file.c  |  1 +
 fs/overlayfs/overlayfs.h |  1 +
 fs/overlayfs/readdir.c   |  1 +
 fs/overlayfs/super.c | 23 +++
 fs/overlayfs/util.c  | 13 +
 5 files changed, 39 insertions(+)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index efccb7c1f9bc..7b58a44dcb71 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -163,6 +163,7 @@ static int ovl_open(struct inode *inode, struct file *file)
return PTR_ERR(realfile);
 
file->private_data = realfile;
+   ovl_init_file_errseq(file);
 
return 0;
 }
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f8880aa2ba0e..47838abbfb3d 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -322,6 +322,7 @@ int ovl_check_metacopy_xattr(struct ovl_fs *ofs, struct 
dentry *dentry);
 bool ovl_is_metacopy_dentry(struct dentry *dentry);
 char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct dentry *dentry,
 int padding);
+void ovl_init_file_errseq(struct file *file);
 
 static inline bool ovl_is_impuredir(struct super_block *sb,
struct dentry *dentry)
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 01620ebae1bd..0c48f1545483 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -960,6 +960,7 @@ static int ovl_dir_open(struct inode *inode, struct file 
*file)
od->is_real = ovl_dir_is_real(file->f_path.dentry);
od->is_upper = OVL_TYPE_UPPER(type);
file->private_data = od;
+   ovl_init_file_errseq(file);
 
return 0;
 }
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 290983bcfbb3..d99867983722 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -390,6 +390,28 @@ static int ovl_remount(struct super_block *sb, int *flags, 
char *data)
return ret;
 }
 
+static int ovl_errseq_check_advance(struct super_block *sb, struct file *file)
+{
+   struct ovl_fs *ofs = sb->s_fs_info;
+   struct super_block *upper_sb;
+   int ret;
+
+   if (!ovl_upper_mnt(ofs))
+   return 0;
+
+   upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
+
+   if (!errseq_check(_sb->s_wb_err, file->f_sb_err))
+   return 0;
+
+   /* Something changed, must use slow path */
+   spin_lock(>f_lock);
+   ret = errseq_check_and_advance(_sb->s_wb_err, >f_sb_err);
+   spin_unlock(>f_lock);
+
+   return ret;
+}
+
 static const struct super_operations ovl_super_operations = {
.alloc_inode= ovl_alloc_inode,
.free_inode = ovl_free_inode,
@@ -400,6 +422,7 @@ static const struct super_operations ovl_super_operations = 
{
.statfs = ovl_statfs,
.show_options   = ovl_show_options,
.remount_fs = ovl_remount,
+   .errseq_check_advance   = ovl_errseq_check_advance,
 };
 
 enum {
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 23f475627d07..a1742847f3a8 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -950,3 +950,16 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, struct 
dentry *dentry,
kfree(buf);
return ERR_PTR(res);
 }
+
+void ovl_init_file_errseq(struct file *file)
+{
+   struct super_block *sb = file_dentry(file)->d_sb;
+   struct ovl_fs *ofs = sb->s_fs_info;
+   struct super_block *upper_sb;
+
+   if (!ovl_upper_mnt(ofs))
+   return;
+
+   upper_sb = ovl_upper_mnt(ofs)->mnt_sb;
+   file->f_sb_err =