Re: cross-fs copy support

2018-10-02 Thread J. Bruce Fields
On Mon, Oct 01, 2018 at 01:51:09PM -0600, Andreas Dilger wrote:
> On Oct 1, 2018, at 9:49 AM, Eric Sandeen  wrote:
> > Yes, I would expect there to be problems with his modified kernel
> > for a filesystem that supports clone_file_range, because
> > vfs_copy_file_range() will clone if possible, and this should fail across
> > filesystems.
> > 
> > In general, though, I don't know for sure why we don't fall back to
> > do_splice_direct() across filesystems, although the filesystems that
> > implement their own ->copy_file_range ops may have their own,
> > further restrictions within their implementations.
> > 
> > This call /is/ documented in the manpage as only being valid for
> > files on the same filesystem, though:
> > http://man7.org/linux/man-pages/man2/copy_file_range.2.html
> 
> There was a patch to allow cross-mount copy for NFS, but it hasn't landed
> yet.

I thought Christoph Hellwig vetoed it partly because he thought NFS
server-to-server copy is too complicated.  Which perhaps it is, but I
suspect we'll do it anyway because the benefit seems obvious.

--b.


Re: [PATCH v5 02/19] fs: don't take the i_lock in inode_inc_iversion

2018-01-19 Thread J. Bruce Fields
On Fri, Jan 19, 2018 at 09:36:34AM -0500, Jeff Layton wrote:
> Shrug...we have that problem with the spinlock in place too. The bottom
> line is that reads of this value are not serialized with the increment
> at all.

OK, so this wouldn't even be a new bug.

> I'm not 100% thrilled with this patch, but I think it's probably better
> not to add the i_lock all over the place, even as an interim step in
> cleaning this stuff up.

Makes sense to me.

I've got no comments on the rest of the series, except that I'm all for
it.

Thanks for persisting--it turned out to be more involved than I'd
imagined!

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 02/19] fs: don't take the i_lock in inode_inc_iversion

2018-01-18 Thread J. Bruce Fields
On Tue, Jan 09, 2018 at 09:10:42AM -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> The rationale for taking the i_lock when incrementing this value is
> lost in antiquity. The readers of the field don't take it (at least
> not universally), so my assumption is that it was only done here to
> serialize incrementors.
> 
> If that is indeed the case, then we can drop the i_lock from this
> codepath and treat it as a atomic64_t for the purposes of
> incrementing it. This allows us to use inode_inc_iversion without
> any danger of lock inversion.
> 
> Note that the read side is not fetched atomically with this change.
> The assumption here is that that is not a critical issue since the
> i_version is not fully synchronized with anything else anyway.

So I guess it's theoretically possible that e.g. if you read while it's
incrementing from 2^32-1 to 2^32 you could read 0, 1, or 2^32+1?

If so then you could see an i_version value reused and incorrectly
decide that a file hadn't changed.

But it's such a tiny case, and I think you convert this to atomic64_t
later anyway, so, whatever.

--b.

> 
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/iversion.h | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index d09cc3a08740..5ad9eaa3a9b0 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -104,12 +104,13 @@ inode_set_iversion_queried(struct inode *inode, u64 new)
>  static inline bool
>  inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
> - spin_lock(>i_lock);
> - inode->i_version++;
> - spin_unlock(>i_lock);
> + atomic64_t *ivp = (atomic64_t *)>i_version;
> +
> + atomic64_inc(ivp);
>   return true;
>  }
>  
> +
>  /**
>   * inode_inc_iversion - forcibly increment i_version
>   * @inode: inode that needs to be updated
> -- 
> 2.14.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 01/19] fs: new API for handling inode->i_version

2018-01-18 Thread J. Bruce Fields
On Tue, Jan 09, 2018 at 09:10:41AM -0500, Jeff Layton wrote:
> --- /dev/null
> +++ b/include/linux/iversion.h
> @@ -0,0 +1,236 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_IVERSION_H
> +#define _LINUX_IVERSION_H
> +
> +#include 
> +
> +/*
> + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> + * appear different to observers if there was a change to the inode's data or
> + * metadata since it was last queried.
> + *
> + * Observers see the i_version as a 64-bit number that never changes.

I don't understand that sentence.

> If it
> + * remains the same since it was last checked, then nothing has changed in 
> the
> + * inode. If it's different then something has changed. Observers cannot 
> infer
> + * anything about the nature or magnitude of the changes from the value, only
> + * that the inode has changed in some fashion.

As we've discussed before, there may be brief windows where the first
two statements aren't quite correct.  I think that would be worth a
mention if we can keep it concise.  Maybe add something like this?:

It may be impractical for filesystems to keep i_version updates
atomic with respect to the changes that cause them.  They
should, however, guarantee that i_version updates are never
visible before the changes that caused them.  Also, i_version
updates should never be delayed longer than it takes the
original change to reach disk.

Or maybe those details are best left to documentation on the relevant
parts of the api below (maybe inode_maybe_inc_iversion?).

I dunno if it's also worth mentioning that nfsd doesn't actually use the
raw i_version--it mixes it with ctime to prevent i_version reuse after
reboot.  Presumably that doesn't matter to IMA since it doesn't compare
i_version across reboots.

The documentation here is all very helpful, thanks.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

2017-12-18 Thread J. Bruce Fields
On Mon, Dec 18, 2017 at 12:22:20PM -0500, Jeff Layton wrote:
> On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> > On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> > >  static inline bool
> > >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> > >  {
> > > - atomic64_t *ivp = (atomic64_t *)>i_version;
> > > + u64 cur, old, new;
> > >  
> > > - atomic64_inc(ivp);
> > > + cur = (u64)atomic64_read(>i_version);
> > > + for (;;) {
> > > + /* If flag is clear then we needn't do anything */
> > > + if (!force && !(cur & I_VERSION_QUERIED))
> > > + return false;
> > 
> > The fast path here misses any memory barrier. Thus it seems this query
> > could be in theory reordered before any store that happened to modify the
> > inode? Or maybe we could race and miss the fact that in fact this i_version
> > has already been queried? But maybe there's some higher level locking that
> > makes sure this is all a non-issue... But in that case it would deserve
> > some comment I guess.
> > 
> 
> There's no higher-level locking. Getting locking out of this codepath is
> a good thing IMO. The larger question here is whether we really care
> about ordering this with anything else.
> 
> The i_version, as implemented today, is not ordered with actual changes
> to the inode. We only take the i_lock today when modifying it, not when
> querying it. It's possible today that you could see the results of a
> change and then do a fetch of the i_version that doesn't show an
> increment vs. a previous change.
> 
> It'd be nice if this were atomic with the actual changes that it
> represents, but I think that would be prohibitively expensive. That may
> be something we need to address. I'm not sure we really want to do it as
> part of this patchset though.

I don't think we need full atomicity, but we *do* need the i_version
increment to be seen as ordered after the inode change.  That should be
relatively inexpensive, yes?

Otherwise an inode change could be overlooked indefinitely, because a
client could cache the old contents with the new i_version value and
never see the i_version change again as long as the inode isn't touched
again.

(If the client instead caches new data with the old inode version, there
may be an unnecessary cache invalidation next time the client queries
the change attribute.  That's a performance bug, but not really a
correctness problem, at least for clients depending only on
close-to-open semantics: they'll only depend on seeing the new change
attribute after the change has been flushed to disk.  The performance
problem should be mitigated by the race being very rare.)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread J. Bruce Fields
On Fri, May 12, 2017 at 08:22:23AM +1000, NeilBrown wrote:
> On Thu, May 11 2017, J. Bruce Fields wrote:
> > +static inline u64 nfsd4_change_attribute(struct inode *inode)
> > +{
> > +   u64 chattr;
> > +
> > +   chattr = inode->i_ctime.tv_sec << 30;
> > +   chattr += inode->i_ctime.tv_nsec;
> > +   chattr += inode->i_version;
> > +   return chattr;
> 
> So if I chmod a file, all clients will need to flush the content from their 
> cache?
> Maybe they already do?  Maybe it is a boring corner case?

Yeah, that's the assumption, maybe it's wrong.  I can't recall
complaints about anyone bitten by that case.

> >  /*
> >   * Fill in the pre_op attr for the wcc data
> >   */
> > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_mtime = inode->i_mtime;
> > fhp->fh_pre_ctime = inode->i_ctime;
> > fhp->fh_pre_size  = inode->i_size;
> > -   fhp->fh_pre_change = inode->i_version;
> > +   fhp->fh_pre_change = nfsd4_change_attribute(inode);
> > fhp->fh_pre_saved = true;
> > }
> >  }
> > --- a/fs/nfsd/nfs3xdr.c
> > +++ b/fs/nfsd/nfs3xdr.c
> > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> > printk("nfsd: inode locked twice during operation.\n");
> >  
> > err = fh_getattr(fhp, >fh_post_attr);
> > -   fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> > +   fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> > if (err) {
> > fhp->fh_post_saved = false;
> > /* Grab the ctime anyway - set_change_info might use it */
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 26780d53a6f9..a09532d4a383 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> > *stat, struct inode *inode,
> > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> > *p++ = 0;
> > } else if (IS_I_VERSION(inode)) {
> > -   p = xdr_encode_hyper(p, inode->i_version);
> > +   p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> > } else {
> > *p++ = cpu_to_be32(stat->ctime.tv_sec);
> > *p++ = cpu_to_be32(stat->ctime.tv_nsec);
> 
> It is *really* confusing to find that fh_post_change is only set in nfs3
> code, and only used in nfs4 code.

Yup.

> It is probably time to get a 'version' field in 'struct kstat'.

The pre/post_wcc code doesn't seem to be doing an explicit stat, I
wonder if that matters?

--b.

> That would allow this code to get a little cleaner.
> 
> (to me, this exercise is just a reminder that the NFSv4 change attribute
> is poorly designed ... so it just makes me grumpy).
> 
> NeilBrown
> 
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread J. Bruce Fields
On Fri, May 12, 2017 at 07:01:25AM -0400, Jeff Layton wrote:
> This looks reasonable to me (modulo Jan's comment about casting tv_sec
> to u64).
> 
> To be clear, I think this is mostly orthogonal to the changes that I was
> originally proposing, right? I think we can still benefit from only
> bumping and storing i_version values after they've been queried.

Definitely, yes.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread J. Bruce Fields
On Fri, May 12, 2017 at 10:27:54AM +0200, Jan Kara wrote:
> On Thu 11-05-17 14:59:43, J. Bruce Fields wrote:
> > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> > > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > > > 1) Keep i_version as is, make clients also check for i_ctime.
> > > 
> > > That would be a protocol revision, which we'd definitely rather avoid.
> > > 
> > > But can't we accomplish the same by using something like
> > > 
> > >   ctime * (some constant) + i_version
> > > 
> > > ?
> > > 
> > > >Pro: No on-disk format changes.
> > > >Cons: After a crash, i_version can go backwards (but when file 
> > > > changes
> > > >i_version, i_ctime pair should be still different) or not, data can 
> > > > be
> > > >old or not.
> > > 
> > > This is probably good enough for NFS purposes: typically on an NFS
> > > filesystem, results of a read in the face of a concurrent write open are
> > > undefined.  And writers sync before close.
> > > 
> > > So after a crash with a dirty inode, we're in a situation where an NFS
> > > client still needs to resend some writes, sync, and close.  I'm OK with
> > > things being inconsistent during this window.
> > > 
> > > I do expect things to return to normal once that client's has resent its
> > > writes--hence the worry about actually resuing old values after boot
> > > (such as if i_version regresses on boot and then increments back to the
> > > same value after further writes).  Factoring in ctime fixes that.
> > 
> > So for now I'm thinking of just doing something like the following.
> > 
> > Only nfsd needs it for now, but it could be moved to a vfs helper for
> > statx, or for individual filesystems that want to do something
> > different.  (The NFSv4 client will want to use the server's change
> > attribute instead, I think.  And other filesystems might want to try
> > something more ambitious like Neil's proposal.)
> > 
> > --b.
> > 
> > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> > index 12feac6ee2fd..9636c9a60aba 100644
> > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> > index f84fe6bf9aee..14f09f1ef605 100644
> > --- a/fs/nfsd/nfsfh.h
> > +++ b/fs/nfsd/nfsfh.h
> > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_saved = false;
> >  }
> >  
> > +static inline u64 nfsd4_change_attribute(struct inode *inode)
> > +{
> > +   u64 chattr;
> > +
> > +   chattr = inode->i_ctime.tv_sec << 30;
> 
> Won't this overflow on 32-bit archs? tv_sec seems to be defined as long?
> Probably you need explicit (u64) cast... Otherwise I'm fine with this.

Whoops, yes.  Or just assign to chattr as a separate step.  I'll fix
that.

--b.

> > +   chattr += inode->i_ctime.tv_nsec;
> > +   chattr += inode->i_version;
> > +   return chattr;
> > +}
> > +
> >  /*
> >   * Fill in the pre_op attr for the wcc data
> >   */
> > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_mtime = inode->i_mtime;
> > fhp->fh_pre_ctime = inode->i_ctime;
> > fhp->fh_pre_size  = inode->i_size;
> > -   fhp->fh_pre_change = inode->i_version;
> > +   fhp->fh_pre_change = nfsd4_change_attribute(inode);
> > fhp->fh_pre_saved = true;
> > }
> >  }
> > --- a/fs/nfsd/nfs3xdr.c
> > +++ b/fs/nfsd/nfs3xdr.c
> > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> > printk("nfsd: inode locked twice during operation.\n");
> >  
> > err = fh_getattr(fhp, >fh_post_attr);
> > -   fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> > +   fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> > if (err) {
> > fhp->fh_post_saved = false;
> > /* Grab the ctime anyway - set_change_info might use it */
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 26780d53a6f9..a09532d4a383 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> > *stat, struct inode *inode,
> > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> > *p++ = 0;
> > } else if (IS_I_VERSION(inode)) {
> > -   p = xdr_encode_hyper(p, inode->i_version);
> > +   p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> > } else {
> > *p++ = cpu_to_be32(stat->ctime.tv_sec);
> > *p++ = cpu_to_be32(stat->ctime.tv_nsec);
> -- 
> Jan Kara <j...@suse.com>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-11 Thread J. Bruce Fields
On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > 1) Keep i_version as is, make clients also check for i_ctime.
> 
> That would be a protocol revision, which we'd definitely rather avoid.
> 
> But can't we accomplish the same by using something like
> 
>   ctime * (some constant) + i_version
> 
> ?
> 
> >Pro: No on-disk format changes.
> >Cons: After a crash, i_version can go backwards (but when file changes
> >i_version, i_ctime pair should be still different) or not, data can be
> >old or not.
> 
> This is probably good enough for NFS purposes: typically on an NFS
> filesystem, results of a read in the face of a concurrent write open are
> undefined.  And writers sync before close.
> 
> So after a crash with a dirty inode, we're in a situation where an NFS
> client still needs to resend some writes, sync, and close.  I'm OK with
> things being inconsistent during this window.
> 
> I do expect things to return to normal once that client's has resent its
> writes--hence the worry about actually resuing old values after boot
> (such as if i_version regresses on boot and then increments back to the
> same value after further writes).  Factoring in ctime fixes that.

So for now I'm thinking of just doing something like the following.

Only nfsd needs it for now, but it could be moved to a vfs helper for
statx, or for individual filesystems that want to do something
different.  (The NFSv4 client will want to use the server's change
attribute instead, I think.  And other filesystems might want to try
something more ambitious like Neil's proposal.)

--b.

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 12feac6ee2fd..9636c9a60aba 100644
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index f84fe6bf9aee..14f09f1ef605 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
fhp->fh_pre_saved = false;
 }
 
+static inline u64 nfsd4_change_attribute(struct inode *inode)
+{
+   u64 chattr;
+
+   chattr = inode->i_ctime.tv_sec << 30;
+   chattr += inode->i_ctime.tv_nsec;
+   chattr += inode->i_version;
+   return chattr;
+}
+
 /*
  * Fill in the pre_op attr for the wcc data
  */
@@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
fhp->fh_pre_mtime = inode->i_mtime;
fhp->fh_pre_ctime = inode->i_ctime;
fhp->fh_pre_size  = inode->i_size;
-   fhp->fh_pre_change = inode->i_version;
+   fhp->fh_pre_change = nfsd4_change_attribute(inode);
fhp->fh_pre_saved = true;
}
 }
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
printk("nfsd: inode locked twice during operation.\n");
 
err = fh_getattr(fhp, >fh_post_attr);
-   fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
+   fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
if (err) {
fhp->fh_post_saved = false;
/* Grab the ctime anyway - set_change_info might use it */
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 26780d53a6f9..a09532d4a383 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
*stat, struct inode *inode,
*p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
*p++ = 0;
} else if (IS_I_VERSION(inode)) {
-   p = xdr_encode_hyper(p, inode->i_version);
+   p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
} else {
*p++ = cpu_to_be32(stat->ctime.tv_sec);
*p++ = cpu_to_be32(stat->ctime.tv_nsec);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread J. Bruce Fields
On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> 1) Keep i_version as is, make clients also check for i_ctime.

That would be a protocol revision, which we'd definitely rather avoid.

But can't we accomplish the same by using something like

ctime * (some constant) + i_version

?

>Pro: No on-disk format changes.
>Cons: After a crash, i_version can go backwards (but when file changes
>i_version, i_ctime pair should be still different) or not, data can be
>old or not.

This is probably good enough for NFS purposes: typically on an NFS
filesystem, results of a read in the face of a concurrent write open are
undefined.  And writers sync before close.

So after a crash with a dirty inode, we're in a situation where an NFS
client still needs to resend some writes, sync, and close.  I'm OK with
things being inconsistent during this window.

I do expect things to return to normal once that client's has resent its
writes--hence the worry about actually resuing old values after boot
(such as if i_version regresses on boot and then increments back to the
same value after further writes).  Factoring in ctime fixes that.

> 2) Fsync when reporting i_version.
>Pro: No on-disk format changes, strong consistency of i_version and
> data.
>Cons: Difficult to implement for filesystems due to locking constrains.
>  High performance overhead or i_version reporting.

Sounds painful.

> 3) Some variant of crash counter.
>Pro: i_version cannot go backwards.
>Cons: Requires on-disk format changes. After a crash data can be old
>  (however i_version increased).

Also, some unnecessary invalidations.  Which maybe can be mostly avoided
by some variation of Neil's scheme.

It looks to me like option (1) is doable now and introduces no
regressions compared to the current situation.  (2) and (3) are more
copmlicated and involve some tradeoffs.

Also, we can implement (1) and switch to (2) or (3) later.  We'd want to
do it without reported i_versions decreasing on kernel upgrade, but
there are multiple ways of handling that.  (Worst case, just restrict
the change to newly created filesystems.)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread J. Bruce Fields
On Wed, Apr 05, 2017 at 11:43:32AM +1000, NeilBrown wrote:
> On Tue, Apr 04 2017, J. Bruce Fields wrote:
> 
> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> >> > > > Because if above is acceptable we could make reported i_version to 
> >> > > > be a sum
> >> > > > of "superblock crash counter" and "inode i_version". We increment
> >> > > > "superblock crash counter" whenever we detect unclean filesystem 
> >> > > > shutdown.
> >> > > > That way after a crash we are guaranteed each inode will report new
> >> > > > i_version (the sum would probably have to look like "superblock crash
> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
> >> > > > possible
> >> > > > i_version numbers we gave away but did not write to disk but 
> >> > > > still...).
> >> > > > Thoughts?
> >> > 
> >> > How hard is this for filesystems to support?  Do they need an on-disk
> >> > format change to keep track of the crash counter?  Maybe not, maybe the
> >> > high bits of the i_version counters are all they need.
> >> > 
> >> 
> >> Yeah, I imagine we'd need a on-disk change for this unless there's
> >> something already present that we could use in place of a crash counter.
> >
> > We could consider using the current time instead.  So, put the current
> > time (or time of last boot, or this inode's ctime, or something) in the
> > high bits of the change attribute, and keep the low bits as a counter.
> 
> This is a very different proposal.
> I don't think Jan was suggesting that the i_version be split into two
> bit fields, one the change-counter and one the crash-counter.
> Rather, the crash-counter was multiplied by a large-number and added to
> the change-counter with the expectation that while not ever
> change-counter landed on disk, at least 1 in every large-number would.
> So after each crash we effectively add large-number to the
> change-counter, and can be sure that number hasn't been used already.

I was sort of ignoring the distinction between concatenate(A,B) and
A*m+B, but, sure, multiplying's probably better.

> To store the crash-counter in each inode (which does appeal) you would
> need to be able to remove it before adding the new crash counter, and
> that requires bit-fields.  Maybe there are enough bits.

i_version and the NFSv4 change attribute are 64 bits which gives us a
fair amount of flexibility.

> If you want to ensure read-only files can remain cached over a crash,
> then you would have to mark a file in some way on stable storage
> *before* allowing any change.
> e.g. you could use the lsb.  Odd i_versions might have been changed
> recently and crash-count*large-number needs to be added.
> Even i_versions have not been changed recently and nothing need be
> added.
> 
> If you want to change a file with an even i_version, you subtract
>   crash-count*large-number
> to the i_version, then set lsb.  This is written to stable storage before
> the change.
> 
> If a file has not been changed for a while, you can add
>   crash-count*large-number
> and clear lsb.
> 
> The lsb of the i_version would be for internal use only.  It would not
> be visible outside the filesystem.
> 
> It feels a bit clunky, but I think it would work and is the best
> combination of Jan's idea and your requirement.
> The biggest cost would be switching to 'odd' before an changes, and the
> unknown is when does it make sense to switch to 'even'.

I'm not sure how to model the costs.  Something like that might work.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields
On Thu, Mar 30, 2017 at 10:41:37AM +1100, Dave Chinner wrote:
> On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > > - It's durable; the above comparison still works if there were 
> > > > > > > reboots
> > > > > > >   between the two i_version checks.
> > > > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > > atomically with the changes that caused them?  What if a
> > > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > > it to disk, and then that value is reused after reboot?
> > > > > > > 
> > > > > > 
> > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll 
> > > > > > mark
> > > > > > the inode dirty and I think that will end up with the new i_version 
> > > > > > at
> > > > > > least being journalled before __mark_inode_dirty returns.
> > > > > 
> > > > > So you think the filesystem can provide the atomicity?  In more 
> > > > > detail:
> > > > > 
> > > > 
> > > > Sorry, I hit send too quickly. That should have read:
> > > > 
> > > > "Yeah, there could be atomicity issues there."
> > > > 
> > > > I think providing that level of atomicity may be difficult, though
> > > > maybe there's some way to make the querying of i_version block until
> > > > the inode update has been journalled?
> > > 
> > > Just to complement what Dave said from ext4 side - similarly as with XFS
> > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > > Until that you can see arbitrary combination of data & i_version after the
> > > crash. We do take care to keep data and metadata in sync only when there
> > > are security implications to that (like exposing uninitialized disk 
> > > blocks)
> > > and if not, we are as lazy as we can to improve performance...
> > > 
> > > 
> > 
> > Yeah, I think what we'll have to do here is ensure that those
> > filesystems do an fsync prior to reporting the i_version getattr
> > codepath. It's not pretty, but I don't see a real alternative.
> 
> I think that's even more problematic. ->getattr currently runs
> completely unlocked for performance reasons - it's racy w.r.t. to
> ongoing modifications to begin with, so /nothing/ that is returned
> to userspace via stat/statx can be guaranteed to be "coherent".
> Linus will be very unhappy if you make his git workload (which is
> /very/ stat heavy) run slower by adding any sort of locking in this
> hot path.
> 
> Even if we did put an fsync() into ->getattr() (and dealt with all
> the locking issues that entails), by the time the statx syscall
> returns to userspace the i_version value may not match the
> data/metadata in the inode(*).  IOWs, by the time i_version gets
> to userspace, it is out of date and any use of it for data
> versioning from userspace is going to be prone to race conditions.

A slightly out-of-date i_version is fine, I think.  It's just the
reverse we want to avoid.  E.g., assuming an i_version-supporting
statux, if somebody could called statx then read, and got the new
i_version followed by the old data, that would cause problems.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields
On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > Because if above is acceptable we could make reported i_version to be a 
> > > > sum
> > > > of "superblock crash counter" and "inode i_version". We increment
> > > > "superblock crash counter" whenever we detect unclean filesystem 
> > > > shutdown.
> > > > That way after a crash we are guaranteed each inode will report new
> > > > i_version (the sum would probably have to look like "superblock crash
> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > i_version numbers we gave away but did not write to disk but still...).
> > > > Thoughts?
> > 
> > How hard is this for filesystems to support?  Do they need an on-disk
> > format change to keep track of the crash counter?  Maybe not, maybe the
> > high bits of the i_version counters are all they need.
> > 
> 
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.

We could consider using the current time instead.  So, put the current
time (or time of last boot, or this inode's ctime, or something) in the
high bits of the change attribute, and keep the low bits as a counter.

Then as long as we trust our clock, we're no longer at risk of reusing
an i_version value.

> > I guess we just want to have some back-of-the-envelope estimates of
> > maximum number of i_version increments possible between crashes and
> > maximum number of crashes possible over lifetime of a filesystem, to
> > decide how to split up the bits.
> > 
> > I wonder if we could get away with using the new crash counter only for
> > *new* values of the i_version?  After a crash, use the on disk i_version
> > as is, and put off using the new crash counter until the next time the
> > file's modified.
> > 
> 
> That sounds difficult to get right. Suppose I have an inode that has not
> been updated in a long time. Someone writes to it and then queries the
> i_version. How do I know whether there were crashes since the last time
> I updated it? Or am I misunderstanding what you're proposing here?

I believe Jan was suggesting that we keep the i_version as-is on disk
but combine with with the crash counter when it's queried.

I was suggesting instead that on write, when we bump the i_version, we
replace the simple increment by something that increments *and* sticks
the current crash counter (or maybe just a time) in the high bits.  And
that combination will be what goes in i_version and is written to disk.

That guarantees that we never reuse an old value when we increment.

It also avoids having to invalidate absolutely every cache, even of
completely static files.

Clients could still see the change attribute go backwards, though, if
they query a file with dirty data and the server crashes before it
writes out the new change attribute.

Also, they could fail to detect data that reverted after boot if they
cache data while the file's dirty on the server side, and the server
crash preserves the i_version update but not the new data.

Presumably both are possible already for ctime (depending on the
filesystem), and I'm not convinced they're a big problem.

In both cases the reading client is caching while somebody's still
writing, and as long as the writer comes back and finishes its job,
readers will thereafter see the right thing.  So I think it's adequate
for close-to-open.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields
On Tue, Apr 04, 2017 at 10:34:14PM +1000, Dave Chinner wrote:
> On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
> > What filesystems can or cannot easily do obviously differs. Ext4 has a
> > recovery flag set in superblock on RW mount/remount and cleared on
> > umount/RO remount.
> 
> Even this doesn't help. A recent bug that was reported to the XFS
> list - turns out that systemd can't remount-ro the root
> filesystem sucessfully on shutdown because there are open write fds
> on the root filesystem when it attempts the remount. So it just
> reboots without a remount-ro.

I'd certainly rather not invalidate caches on *every* boot.

On the other hand, if the only cases involve the root filesystem, then
from the point of view of NFS, we probably don't care much.

> > This flag being set on mount would imply incrementing the crash
> > counter. It should be pretty easy for each filesystem to implement
> > such flag and the counter but I agree it requires an on-disk
> > format change.
> 
> Yup, anything we want that is persistent and consistent across
> filesystems will need on-disk format changes. Hence we need a solid
> specification first, not to mention tests to validate correct
> behaviour across all filesystems in xfstests...

For xfstests we'll need a way to get i_version (patch it into statx, I
guess?).  Ideally we'd have a way to test behavior across crashes too,
any advice there?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread J. Bruce Fields
On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > Hum, so are we fine if i_version just changes (increases) for all inodes
> > after a server crash? If I understand its use right, it would mean
> > invalidation of all client's caches but that is not such a big deal given
> > how frequent server crashes should be, right?

Even if it's rare, it may be really painful when all your clients are
forced to throw out and repopulate their caches after a crash.  But,
yes, maybe we can live with it.

> > Because if above is acceptable we could make reported i_version to be a sum
> > of "superblock crash counter" and "inode i_version". We increment
> > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > That way after a crash we are guaranteed each inode will report new
> > i_version (the sum would probably have to look like "superblock crash
> > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > i_version numbers we gave away but did not write to disk but still...).
> > Thoughts?

How hard is this for filesystems to support?  Do they need an on-disk
format change to keep track of the crash counter?  Maybe not, maybe the
high bits of the i_version counters are all they need.

> That does sound like a good idea. This is a 64 bit value, so we should
> be able to carve out some upper bits for a crash counter without risking
> wrapping.
> 
> The other constraint here is that we'd like any later version of the
> counter to be larger than any earlier value that was handed out. I think
> this idea would still satisfy that.

I guess we just want to have some back-of-the-envelope estimates of
maximum number of i_version increments possible between crashes and
maximum number of crashes possible over lifetime of a filesystem, to
decide how to split up the bits.

I wonder if we could get away with using the new crash counter only for
*new* values of the i_version?  After a crash, use the on disk i_version
as is, and put off using the new crash counter until the next time the
file's modified.

That would still eliminate the risk of accidental reuse of an old
i_version value.  It still leaves some cases where the client could fail
to notice an update indefinitely.  All these cases I think have to
assume that a writer made some changes that it failed to ever sync, so
as long as we care only about close-to-open semantics perhaps those
cases don't matter.

I wonder if repeated crashes can lead to any odd corner cases.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields
On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > - It's durable; the above comparison still works if there were reboots
> > > >   between the two i_version checks.
> > > > - I don't know how realistic this is--we may need to figure out
> > > >   if there's a weaker guarantee that's still useful.  Do
> > > >   filesystems actually make ctime/mtime/i_version changes
> > > >   atomically with the changes that caused them?  What if a
> > > >   change attribute is exposed to an NFS client but doesn't make
> > > >   it to disk, and then that value is reused after reboot?
> > > > 
> > > 
> > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > the inode dirty and I think that will end up with the new i_version at
> > > least being journalled before __mark_inode_dirty returns.
> > 
> > So you think the filesystem can provide the atomicity?  In more detail:
> > 
> 
> Sorry, I hit send too quickly. That should have read:
> 
> "Yeah, there could be atomicity issues there."
> 
> I think providing that level of atomicity may be difficult, though
> maybe there's some way to make the querying of i_version block until
> the inode update has been journalled?

No idea.  Anyway, I'd like to figure out some reasonable requirement
that we can document.

> 
> > - if I write to a file, a simultaneous reader should see either
> >   (old data, old i_version) or (new data, new i_version), not a
> >   combination of the two.
> > - ditto for metadata modifications.
> > - the same should be true if there's a crash.
> > 
> > (If that's not possible, then I think we could live with a brief window
> > of (new data, old i_version) as long as it doesn't persist beyond sync?)
> > 
> > > That said, I suppose it is possible for us to bump the counter, hand
> > > that new counter value out to a NFS client and then the box crashes
> > > before it makes it to the journal.
> > > 
> > > Not sure how big a problem that really is.
> > 
> > The other case I was wondering about may have been unclear.  Represent
> > the state of a file by a (data, i_version) pair.  Say:
> > 
> > - file is modified from (F, V) to (F', V+1).
> > - client reads and caches (F', V+1).
> > - server crashes before writeback, so disk still has (F, V).
> > - after restart, someone else modifies file to (F'', V+1).
> > - original client revalidates its cache, sees V+1, concludes
> >   file data is still F'.
> > 
> > This may not cause a real problem for clients depending only on
> > traditional NFS close-to-open semantics.
> > 
> > 
> 
> No, I think that is a legitimate problem.
> 
> That said, after F'', the mtime would almost certainly be different
> from the time after F', and that would likely be enough to prevent
> confusion in NFS clients.

Oh, good point.  So, may be worth saying that anyone wanting to make
sense of these across reboot should compare times as well (maybe that
should be in nfs rfc's too).  I think that should be ctime not mtime,
though?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - It's durable; the above comparison still works if there were reboots
> >   between the two i_version checks.
> > - I don't know how realistic this is--we may need to figure out
> >   if there's a weaker guarantee that's still useful.  Do
> >   filesystems actually make ctime/mtime/i_version changes
> >   atomically with the changes that caused them?  What if a
> >   change attribute is exposed to an NFS client but doesn't make
> >   it to disk, and then that value is reused after reboot?
> > 
> 
> Yeah, there could be atomicity there. If we bump i_version, we'll mark
> the inode dirty and I think that will end up with the new i_version at
> least being journalled before __mark_inode_dirty returns.

So you think the filesystem can provide the atomicity?  In more detail:

- if I write to a file, a simultaneous reader should see either
  (old data, old i_version) or (new data, new i_version), not a
  combination of the two.
- ditto for metadata modifications.
- the same should be true if there's a crash.

(If that's not possible, then I think we could live with a brief window
of (new data, old i_version) as long as it doesn't persist beyond sync?)

> That said, I suppose it is possible for us to bump the counter, hand
> that new counter value out to a NFS client and then the box crashes
> before it makes it to the journal.
> 
> Not sure how big a problem that really is.

The other case I was wondering about may have been unclear.  Represent
the state of a file by a (data, i_version) pair.  Say:

- file is modified from (F, V) to (F', V+1).
- client reads and caches (F', V+1).
- server crashes before writeback, so disk still has (F, V).
- after restart, someone else modifies file to (F'', V+1).
- original client revalidates its cache, sees V+1, concludes
  file data is still F'.

This may not cause a real problem for clients depending only on
traditional NFS close-to-open semantics.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields
On Tue, Mar 21, 2017 at 01:37:04PM -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > >   - NFS doesn't actually require that it increases, but I think it
> > > should.  I assume 64 bits means we don't need a discussion of
> > > wraparound.
> > 
> > I thought NFS spec required that you be able to recognize old change
> > attributes so that they can be discarded. I could be wrong here though.
> > I'd have to go back and look through the spec to be sure.
> 
> https://tools.ietf.org/html/rfc7862#section-10

So, I'm suggesting we implement this one:

NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR:  The change attribute value
  MUST monotonically increase for every atomic change to the file
  attributes, data, or directory contents.

It may be a slight lie--after your patches we wouldn't actually increase
"for every atomic change".  I think that's OK.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - NFS doesn't actually require that it increases, but I think it
> >   should.  I assume 64 bits means we don't need a discussion of
> >   wraparound.
> 
> I thought NFS spec required that you be able to recognize old change
> attributes so that they can be discarded. I could be wrong here though.
> I'd have to go back and look through the spec to be sure.

https://tools.ietf.org/html/rfc7862#section-10

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields
On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> > To me, the interesting question is whether this allows us to turn on
> > i_version updates by default on xfs and ext4.
> 
> XFS v5 file systems have it on by default.

Great, thanks.

> Although we'll still need to agree on the exact semantics of i_version
> before it's going to be useful.

Once it's figured out maybe we should write it up for a manpage that
could be used if statx starts exposing it to userspace.

A first attempt:

- It's a u64.

- It works for regular files and directories.  (What about symlinks or
  other special types?)

- It changes between two checks if and only if there were intervening
  data or metadata changes.  The change will always be an increase, but
  the amount of the increase is meaningless.
- NFS doesn't actually require that it increases, but I think it
  should.  I assume 64 bits means we don't need a discussion of
  wraparound.
- AFS wants an actual counter: if you get i_version X, then
  write twice, then get i_version X+2, you're allowed to assume
  your writes were the only modifications.  Let's ignore this
  for now.  In the future if someone explains how to count
  operations, then we can extend the interface to tell the
  caller it can get those extra semantics.

- It's durable; the above comparison still works if there were reboots
  between the two i_version checks.
- I don't know how realistic this is--we may need to figure out
  if there's a weaker guarantee that's still useful.  Do
  filesystems actually make ctime/mtime/i_version changes
  atomically with the changes that caused them?  What if a
  change attribute is exposed to an NFS client but doesn't make
  it to disk, and then that value is reused after reboot?

Am I missing any issues?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-20 Thread J. Bruce Fields
On Thu, Dec 22, 2016 at 09:42:04AM -0500, Jeff Layton wrote:
> On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote:
> > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > > 
> > > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > > this, these filesystems must log the inode to disk whenever the
> > > i_version counter changes. That has a non-zero performance impact,
> > > especially on write-heavy workloads, because we end up dirtying the
> > > inode metadata on every write, not just when the times change. [1]
> > 
> > Do you have numbers to justify these changes?
> 
> I have numbers. As to whether they justify the changes, I'm not sure.
> This helps a lot on a (admittedly nonsensical) 1-byte write workload. On
> XFS, with this fio jobfile:

To me, the interesting question is whether this allows us to turn on
i_version updates by default on xfs and ext4.

When Josef looked at doing that previously he withdrew the patch due to
performance regressions.  I think the most useful thread started here:


http://lkml.kernel.org/r/1337092396-3272-1-git-send-email-jo...@redhat.com

Skimming quickly  I think the regression was also in the small-write
case.  So apparently that was thought to reveal a real problem?

So if you've mostly eliminated that regression, then that's good
motivation for your patches.  (Though I think in addition to comparing
the patched and unpatched i_version case, we need to compare to the
unpatched not-i_version case.  I'm not clear whether you did that.)

--b.

> 
> 8<--
> [global]
> direct=0
> size=2g
> filesize=512m
> bsrange=1-1
> timeout=60
> numjobs=1
> directory=/mnt/scratch
> 
> [f1]
> filename=randwrite
> rw=randwrite
> 8<--
> 
> Unpatched kernel:
>   WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, 
> mint=6msec, maxt=6msec
> 
> Patched kernel:
>   WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, 
> mint=6msec, maxt=6msec
> 
> So quite a difference there and it's pretty consistent across runs. If I
> change the jobfile to have "direct=1" and "bsrange=4k-4k", then any
> variation between the two doesn't seem to be significant (numbers vary
> as much between runs on the same kernels and are roughly the same).
> 
> Playing with buffered I/O sizes between 1 byte and 4k shows that as the
> I/O sizes get larger, this makes less difference (which is what I'd
> expect).
> 
> Previous testing with ext4 shows roughly the same results. btrfs shows
> some benefit here but significantly less than with ext4 or xfs. Not sure
> why that is yet -- maybe CoW effects?
> 
> That said, I don't have a great test rig for this. I'm using VMs with a
> dedicated LVM volume that's on a random SSD I had laying around. It
> could use testing on a wider set of configurations and workloads.
> 
> I was also hoping that others may have workloads that they think might
> be (postively or negatively) affected by these changes. If you can think
> of any in particular, then I'm interested to hear about them.
> 
> -- 
> Jeff Layton 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-08 Thread J. Bruce Fields
On Fri, Mar 03, 2017 at 07:53:57PM -0500, Jeff Layton wrote:
> On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote:
> > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > > tl;dr: I think we can greatly reduce the cost of the inode->i_version
> > > counter, by exploiting the fact that we don't need to increment it
> > > if no one is looking at it. We can also clean up the code to prepare
> > > to eventually expose this value via statx().
> > > 
> > > The inode->i_version field is supposed to be a value that changes
> > > whenever there is any data or metadata change to the inode. Some
> > > filesystems use it internally to detect directory changes during
> > > readdir. knfsd will use it if the filesystem has MS_I_VERSION
> > > set. IMA will also use it (though it's not clear to me that that
> > > works 100% -- no check for MS_I_VERSION there).
> > 
> > I'm a little confused about the internal uses for directories.  Is it
> > necessarily kept on disk?
> 
> No, they aren't necessarily stored on disk, and in fact they aren't on
> most (maybe all?) of those filesystems. It's just a convenient place to
> store a dir change value that is subsequently checked in readdir
> operations.
> 
> That's part of the "fun" of untangling this mess. ;)
> 
> > In cases it's not, then there's not the same
> > performance issue, right? 
> 
> Right, I don't think it's really a big deal for most of those and I'm
> not terribly concerned about the i_version counter on directory change
> operations. An extra atomic op on a directory change seems unlikely to
> hurt anything.
> 
> The main purpose of this is to improve the situation with small writes.
> This should also help pave the way for fs' like NFS and Ceph that
> implement a version counter but may not necessarily bump it on every
> change.
> 
> I think once we have things more consistent, we'll be able to consider
> exposing the i_version counter to userland via statx.
> 
> > Is there any risk these patches make
> > performance slightly worse in that case?
> > 
> 
> Maybe, but I think that risk is pretty low. Note that I haven't measured
> that specifically here, so I could be completely wrong.
> 
> If it is a problem, we could consider unioning this thing with a non-
> atomic field for the directory change cases, but that would add some
> complexity and I'm not sure it's worth it. I'd want to measure it first.

Makes sense to me.  I agree that it's unlikely to be a problem, just
wanted to make sure I understood

Anyway, I'm a great fan of the basic idea, I hope we can get this done.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-03 Thread J. Bruce Fields
On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> tl;dr: I think we can greatly reduce the cost of the inode->i_version
> counter, by exploiting the fact that we don't need to increment it
> if no one is looking at it. We can also clean up the code to prepare
> to eventually expose this value via statx().
> 
> The inode->i_version field is supposed to be a value that changes
> whenever there is any data or metadata change to the inode. Some
> filesystems use it internally to detect directory changes during
> readdir. knfsd will use it if the filesystem has MS_I_VERSION
> set. IMA will also use it (though it's not clear to me that that
> works 100% -- no check for MS_I_VERSION there).

I'm a little confused about the internal uses for directories.  Is it
necessarily kept on disk?  In cases it's not, then there's not the same
performance issue, right?  Is there any risk these patches make
performance slightly worse in that case?

> Only btrfs, ext4, and xfs implement it for data changes. Because of
> this, these filesystems must log the inode to disk whenever the
> i_version counter changes.

On those filesystems that's done for both files and directories, right?

--b.

> That has a non-zero performance impact,
> especially on write-heavy workloads, because we end up dirtying the
> inode metadata on every write, not just when the times change. [1]
> 
> It turns out though that none of these users of i_version require that
> i_version change on every change to the file. The only real requirement
> is that it be different if _something_ changed since the last time we
> queried for it. [2]
> 
> So, if we simply keep track of when something queries the value, we
> can avoid bumping the counter and that metadata update.
> 
> This patchset implements this:
> 
> It starts with some small cleanup patches to just remove any mention of
> the i_version counter in filesystems that don't actually use it.
> 
> Then, we add a new set of static inlines for managing the counter. The
> initial version should work identically to what we have now. Then, all
> of the remaining filesystems that use i_version are converted to the new
> inlines.
> 
> Once that's in place, we switch to a new implementation that allows us
> to track readers of i_version counter, and only bump it when it's
> necessary or convenient (i.e. we're going to disk anyway).
> 
> The final patch switches from a scheme that uses the i_lock to serialize
> the counter updates during write to an atomic64_t. That's a wash
> performance-wise in my testing, but I like not having to take the i_lock
> down where it could end up nested inside other locks.
> 
> With this, we reduce inode metadata updates across all 3 filesystems
> down to roughly the frequency of the timestamp granularity, particularly
> when it's not being queried (the vastly common case).
> 
> The pessimal workload here is 1 byte writes, and it helps that
> significantly. Of course, that's not a real-world workload.
> 
> A tiobench-example.fio workload also shows some modest performance
> gains, and I've gotten mails from the kernel test robot that show some
> significant performance gains on some microbenchmarks (case-msync-mt in
> the vm-scalability testsuite to be specific).
> 
> I'm happy to run other workloads if anyone can suggest them.
> 
> At this point, the patchset works and does what it's expected to do in
> my own testing. It seems like it's at least a modest performance win
> across all 3 major disk-based filesystems. It may also encourage others
> to implement i_version as well since it reduces that cost.
> 
> Is this an avenue that's worthwhile to pursue?
> 
> Note that I think we may have other changes coming in the future that
> will make this sort of cleanup necessary anyway. I'd like to plug in the
> Ceph change attribute here eventually, and that will require something
> like this anyway.
> 
> Thoughts, comments and suggestions are welcome...
> 
> ---
> 
> [1]: On ext4 it must be turned on with the i_version mount option,
>  mostly due to fears of incurring this impact, AFAICT.
> 
> [2]: NFS also recommends that it appear to increase in value over time, so
>  that clients can discard metadata updates that are older than ones
>  they've already seen.
> 
> Jeff Layton (30):
>   lustre: don't set f_version in ll_readdir
>   ecryptfs: remove unnecessary i_version bump
>   ceph: remove the bump of i_version
>   f2fs: don't bother setting i_version
>   hpfs: don't bother with the i_version counter
>   jfs: remove initialization of i_version counter
>   nilfs2: remove inode->i_version initialization
>   orangefs: remove initialization of i_version
>   reiserfs: remove unneeded i_version bump
>   ntfs: remove i_version handling
>   fs: new API for handling i_version
>   fat: convert to new i_version API
>   affs: convert to new i_version API
>   afs: convert to new i_version API
>   btrfs: convert to new i_version API
>   exofs: switch to new i_version API
>   

Re: [RFC PATCH v1 11/30] fs: new API for handling i_version

2017-03-03 Thread J. Bruce Fields
The patch ordering is a little annoying as I'd like to be able to be
able to verify the implementation at the same time these new interfaces
are added, but, I don't know, I don't have a better idea.

Anyway, various nits:

On Wed, Dec 21, 2016 at 12:03:28PM -0500, Jeff Layton wrote:
> We already have inode_inc_iversion. Add inode_set_iversion,
> inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.
> 
> These should encompass how i_version is currently accessed and
> manipulated so that we can later change the implementation.
> 
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/fs.h | 109 
> ++---
>  1 file changed, 104 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2ba074328894..075c915fe2b1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1962,18 +1962,117 @@ static inline void inode_dec_link_count(struct inode 
> *inode)
>  }
>  
>  /**
> + * inode_set_iversion - set i_version to a particular value
> + * @inode: inode to set
> + * @new: new i_version value to set
> + *
> + * Set the inode's i_version. Should generally be called when initializing
> + * a new inode.
> + */
> +static inline void
> +inode_set_iversion(struct inode *inode, const u64 new)
> +{
> + inode->i_version = new;
> +}
> +
> +/**
> + * inode_inc_iversion_locked - increment i_version while protected
> + * @inode: inode to be updated
> + *
> + * Increment the i_version field in the inode. This version is usable
> + * when there is some other sort of lock in play that would prevent
> + * concurrent accessors.
> + */
> +static inline void
> +inode_inc_iversion_locked(struct inode *inode)
> +{
> + inode->i_version++;
> +}
> +
> +/**
> + * inode_set_iversion_read - set i_version to a particular value and flag
> + *set flag to indicate that it has been viewed

s/flag set flag/set flag/.

> + * @inode: inode to set
> + * @new: new i_version value to set
> + *
> + * Set the inode's i_version, and flag the inode as if it has been viewed.
> + * Should generally be called when initializinga new inode in memory from
> + * from disk.

s/from from/from/.

> + */
> +static inline void
> +inode_set_iversion_read(struct inode *inode, const u64 new)
> +{
> + inode_set_iversion(inode, new);
> +}
> +
> +/**
>   * inode_inc_iversion - increments i_version
>   * @inode: inode that need to be updated
>   *
>   * Every time the inode is modified, the i_version field will be incremented.
> - * The filesystem has to be mounted with i_version flag
> + * The filesystem has to be mounted with MS_I_VERSION flag.

I'm not sure why that comment's there.  Filesystems can choose to
increment i_version without requiring the mount option if they want,
can't they?

> + */
> +static inline bool

Document what the return value means?

> +inode_inc_iversion(struct inode *inode)
> +{
> + spin_lock(>i_lock);
> + inode_inc_iversion_locked(inode);
> + spin_unlock(>i_lock);
> + return true;
> +}
> +
> +/**
> + * inode_get_iversion_raw - read i_version without "perturbing" it
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter for an inode without registering it as a
> + * read. Should typically be used by local filesystems that need to store an
> + * i_version on disk.
> + */
> +static inline u64
> +inode_get_iversion_raw(const struct inode *inode)
> +{
> + return inode->i_version;
> +}
> +
> +/**
> + * inode_get_iversion - read i_version for later use
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter. This should be used by callers that wish
> + * to store the returned i_version for later comparison.

I'm not sure what "for later comparison" means.  This is the "read"
operation that actually flags the i_version as read, which you'd use
(for example) to implement NFS getattr?  I wonder if there's a better
way to say that.

> + */
> +static inline u64
> +inode_get_iversion(const struct inode *inode)
> +{
> + return inode_get_iversion_raw(inode);
> +}
> +
> +/**
> + * inode_cmp_iversion - check whether the i_version counter has changed
> + * @inode: inode to check
> + * @old: old value to check against its i_version
> + *
> + * Compare an i_version counter with a previous one. Returns 0 if they are
> + * the same or non-zero if they are different.

Does this flag the i_version as read?  What's it for?  (OK, I guess I'll
find out after a few more patches...).

--b.

>   */
> +static inline s64
> +inode_cmp_iversion(const struct inode *inode, const u64 old)
> +{
> + return (s64)inode->i_version - (s64)old;
> +}
>  
> -static inline void inode_inc_iversion(struct inode *inode)
> +/**
> + * inode_iversion_need_inc - is the i_version in need of being incremented?
> + * @inode: inode to check
> + *
> + * Returns whether the inode->i_version counter needs incrementing on the 
> 

Re: [PATCH] exportfs: be careful to only return expected errors.

2016-10-06 Thread J. Bruce Fields
On Thu, Oct 06, 2016 at 05:39:24PM +1100, NeilBrown wrote:
> On Thu, Aug 04 2016, NeilBrown wrote:
> 
> >
> >
> > When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors.
> > In particular it can be tempting to return ENOENT, but this is not
> > handled well by nfsd.
> >
> > Rather than requiring strict adherence to error code code filesystems,
> > treat all unexpected error codes the same as ESTALE.  This is safest.
> >
> > Signed-off-by: NeilBrown 
> > ---
> >
> > I didn't add a dprintk for unexpected error messages, partly
> > because dprintk isn't usable in exportfs.  I could have used pr_debug()
> > but I really didn't see much value.
> >
> > This has been tested together with the btrfs change, and it restores
> > correct functionality.
> >
> > Thanks,
> > NeilBrown
> 
> 
> Hi Bruce,
>  I notice that this patch isn't in -next, but the btrfs change which
>  motivated it is.
>  That means, unless there is some other work around somewhere, btrfs
>  might start having problems with nfs export.

Whoops, I wonder what happened.  Looking back  Oh, I think I set it
aside pending a response from Christoph, but I don't think that's really
necessary.

>  Can we get this patch into 4.9-rc??
> 
>  Or has someone fixed it a different way?

No, I'll just apply.

I need to send a pull request soon, I just have to make up my mind on
COPY first.

--b.

> Thanks,
> NeilBrown
> 
> 
> >
> >  fs/exportfs/expfs.c  | 10 ++
> >  include/linux/exportfs.h | 13 +++--
> >  2 files changed, 13 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
> > index 207ba8d627ca..a4b531be9168 100644
> > --- a/fs/exportfs/expfs.c
> > +++ b/fs/exportfs/expfs.c
> > @@ -428,10 +428,10 @@ struct dentry *exportfs_decode_fh(struct vfsmount 
> > *mnt, struct fid *fid,
> > if (!nop || !nop->fh_to_dentry)
> > return ERR_PTR(-ESTALE);
> > result = nop->fh_to_dentry(mnt->mnt_sb, fid, fh_len, fileid_type);
> > -   if (!result)
> > -   result = ERR_PTR(-ESTALE);
> > -   if (IS_ERR(result))
> > -   return result;
> > +   if (PTR_ERR(result) == -ENOMEM)
> > +   return ERR_CAST(result);
> > +   if (IS_ERR_OR_NULL(result))
> > +   return ERR_PTR(-ESTALE);
> >  
> > if (d_is_dir(result)) {
> > /*
> > @@ -541,6 +541,8 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, 
> > struct fid *fid,
> >  
> >   err_result:
> > dput(result);
> > +   if (err != -ENOMEM)
> > +   err = -ESTALE;
> > return ERR_PTR(err);
> >  }
> >  EXPORT_SYMBOL_GPL(exportfs_decode_fh);
> > diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> > index b03c0625fa6e..5ab958cdc50b 100644
> > --- a/include/linux/exportfs.h
> > +++ b/include/linux/exportfs.h
> > @@ -157,12 +157,13 @@ struct fid {
> >   *@fh_to_dentry is given a  super_block (@sb) and a file handle
> >   *fragment (@fh, @fh_len). It should return a  dentry which 
> > refers
> >   *to the same file that the file handle fragment refers to.  If it 
> > cannot,
> > - *it should return a %NULL pointer if the file was found but no 
> > acceptable
> > - * were available, or an %ERR_PTR error code indicating why it
> > - *couldn't be found (e.g. %ENOENT or %ENOMEM).  Any suitable dentry 
> > can be
> > - *returned including, if necessary, a new dentry created with 
> > d_alloc_root.
> > - *The caller can then find any other extant dentries by following the
> > - *d_alias links.
> > + *it should return a %NULL pointer if the file cannot be found, or an
> > + *%ERR_PTR error code of %ENOMEM if a memory allocation failure 
> > occurred.
> > + *Any other error code is treated like %NULL, and will cause an 
> > %ESTALE error
> > + *for callers of exportfs_decode_fh().
> > + *Any suitable dentry can be returned including, if necessary, a new 
> > dentry
> > + *created with d_alloc_root.  The caller can then find any other extant
> > + *dentries by following the d_alias links.
> >   *
> >   * fh_to_parent:
> >   *Same as @fh_to_dentry, except that it returns a pointer to the parent
> > -- 
> > 2.9.2


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lkp] [nfsd] b44061d0b9: BUG: Dentry ffff880027d7c540{i=1846f,n=0a} still in use (1) [unmount of btrfs vda]

2016-08-10 Thread J. Bruce Fields
On Wed, Aug 10, 2016 at 11:56:15AM -0700, Linus Torvalds wrote:
> On Wed, Aug 10, 2016 at 11:46 AM, Josef Bacik  wrote:
> >
> > So my naive fix would be something like this
> 
> Bruce? Josef's patch looks ObviouslyCorrect(tm) to me now that I look
> at it - all the other callers of fh_compose() also seem to just drop
> the dentry unconditionally, knowing that fh_compose() took a ref to it
> if needed.
> 
> In fact, the only thing I'd do differently would be to not even put
> the comment there at all, since this call site isn't any different
> from any of the others. If anything, it could go on fh_compose() if we
> want to add comments.

Yep, makes sense to me too.

OK with me if you want to take it, otherwise I'll run it through my
usual tests and send you a pull request probably today or tomorrow.

Thanks, Josef!  (And kernel test robot folks--the speed with which
they're catching stuff, and bisecting down to individual commits, is
really amazing to me.)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] exportfs: be careful to only return expected errors.

2016-08-04 Thread J. Bruce Fields
On Thu, Aug 04, 2016 at 05:47:19AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 04, 2016 at 10:19:06AM +1000, NeilBrown wrote:
> > 
> > 
> > When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors.
> > In particular it can be tempting to return ENOENT, but this is not
> > handled well by nfsd.
> > 
> > Rather than requiring strict adherence to error code code filesystems,
> > treat all unexpected error codes the same as ESTALE.  This is safest.
> > 
> > Signed-off-by: NeilBrown 
> > ---
> > 
> > I didn't add a dprintk for unexpected error messages, partly
> > because dprintk isn't usable in exportfs.  I could have used pr_debug()
> > but I really didn't see much value.
> > 
> > This has been tested together with the btrfs change, and it restores
> > correct functionality.
> 
> I don't really like all this magic which is partially historic.  I think
> we should instead allow the fs to return any error from the export
> operations,

What errors other than ENOENT and ENOMEM do you think are reasonable?

ENOENT is going to screw up both nfsd and open_by_fhandle_at, which are
the only callers.

> and forbid returning NULL entirely.  Then the actualy caller
> (nfsd) can sort out which errors it wants to send over the wire.

The needs of those two callers don't look very different to me, and I
can't recall seeing a correct use of an error other than ESTALE or
ENOMEM, so I've been thinking of it more of a question of how to best
handle a misbehaving filesystem.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: be more precise on errors when getting an inode from disk

2016-07-22 Thread J. Bruce Fields
On Fri, Jul 22, 2016 at 12:40:26PM +1000, NeilBrown wrote:
> On Fri, Jul 22 2016, J. Bruce Fields wrote:
> 
> > On Fri, Jul 22, 2016 at 11:08:17AM +1000, NeilBrown wrote:
> >> On Fri, Jun 10 2016, fdman...@kernel.org wrote:
> >> 
> >> > From: Filipe Manana <fdman...@suse.com>
> >> >
> >> > When we attempt to read an inode from disk, we end up always returning an
> >> > -ESTALE error to the caller regardless of the actual failure reason, 
> >> > which
> >> > can be an out of memory problem (when allocating a path), some error 
> >> > found
> >> > when reading from the fs/subvolume btree (like a genuine IO error) or the
> >> > inode does not exists. So lets start returning the real error code to the
> >> > callers so that they don't treat all -ESTALE errors as meaning that the
> >> > inode does not exists (such as during orphan cleanup). This will also be
> >> > needed for a subsequent patch in the same series dealing with a special
> >> > fsync case.
> >> >
> >> > Signed-off-by: Filipe Manana <fdman...@suse.com>
> >> 
> >> SNIP
> >> 
> >> > @@ -5594,7 +5602,8 @@ struct inode *btrfs_iget(struct super_block *s, 
> >> > struct btrfs_key *location,
> >> >  } else {
> >> >  unlock_new_inode(inode);
> >> >  iput(inode);
> >> > -inode = ERR_PTR(-ESTALE);
> >> > +ASSERT(ret < 0);
> >> > +inode = ERR_PTR(ret < 0 ? ret : -ESTALE);
> >> >  }
> >> 
> >> Just a heads-up.  This change breaks NFS :-(
> >> 
> >> The change in error code percolates up the call chain:
> >> 
> >>  nfs4_pufh->fh_verify->nfsd_set_fh_dentry->exportfs_decode_fh
> >> ->btrfs_fh_to_dentry->ntrfs_get_dentry->btrfs_iget
> >> 
> >> and nfsd returns NFS4ERR_NOENT to the client instead of NFS4ERR_STALE,
> >> and the client doesn't handle that quite the same way.
> >> 
> >> This doesn't mean that the change is wrong, but it could mean we need to
> >> fix something else in the path to sanitize the error code.
> >> 
> >> nfsd_set_fh_dentry already has
> >> 
> >>error = nfserr_stale;
> >>if (PTR_ERR(exp) == -ENOENT)
> >>return error;
> >> 
> >>if (IS_ERR(exp))
> >>return nfserrno(PTR_ERR(exp));
> >> 
> >> for a different error case, so duplicating that would work, but I doubt
> >> it is best.  At the very least we should check for valid errors, not
> >> specific invalid ones.
> >> 
> >> Bruce: do you have an opinion where we should make sure that PUTFH (and
> >> various other requests) returns a valid error code?
> >
> > Uh, I guess not.  Maybe exportfs_decode_fh?
> >
> > Though my kneejerk reaction is to be cranky and wonder why btrfs
> > suddenly needs a different convention for decode_fh().
> >
> 
> I can certainly agree with that perspective, though it would be
> appropriate in that case to make sure we document the requirements for
> fh_to_dentry (the current spelling for 'decode_fh').  So I went looking
> for documentation and found:
> 
>  * fh_to_dentry:
>  *@fh_to_dentry is given a  super_block (@sb) and a file handle
>  *fragment (@fh, @fh_len). It should return a  dentry which refers
>  *to the same file that the file handle fragment refers to.  If it cannot,
>  *it should return a %NULL pointer if the file was found but no acceptable
>  * were available, or an %ERR_PTR error code indicating why it
>  *couldn't be found (e.g. %ENOENT or %ENOMEM).  Any suitable dentry can be
>  *returned including, if necessary, a new dentry created with 
> d_alloc_root.
>  *The caller can then find any other extant dentries by following the
>  *d_alias links.
>  *
> 
> So the new btrfs code is actually conformant!!
> That documentation dates back to 2002 when I wrote it
> And it looks like ENOENT wasn't handled correctly then :-(
> 
> I suspect anything that isn't ENOMEM should be converted to ESTALE.
> ENOMEM causes the client to be asked to retry the request later.
> 
> Does this look reasonable to you?
> (Adding Christof as he as contributed a lot to exportfs)
> 
> If there is agreement I'll test and post a proper patch.

I can live with it.  It bothers me that we're losing pot

Re: [PATCH 1/2] Btrfs: be more precise on errors when getting an inode from disk

2016-07-21 Thread J. Bruce Fields
On Fri, Jul 22, 2016 at 11:08:17AM +1000, NeilBrown wrote:
> On Fri, Jun 10 2016, fdman...@kernel.org wrote:
> 
> > From: Filipe Manana 
> >
> > When we attempt to read an inode from disk, we end up always returning an
> > -ESTALE error to the caller regardless of the actual failure reason, which
> > can be an out of memory problem (when allocating a path), some error found
> > when reading from the fs/subvolume btree (like a genuine IO error) or the
> > inode does not exists. So lets start returning the real error code to the
> > callers so that they don't treat all -ESTALE errors as meaning that the
> > inode does not exists (such as during orphan cleanup). This will also be
> > needed for a subsequent patch in the same series dealing with a special
> > fsync case.
> >
> > Signed-off-by: Filipe Manana 
> 
> SNIP
> 
> > @@ -5594,7 +5602,8 @@ struct inode *btrfs_iget(struct super_block *s, 
> > struct btrfs_key *location,
> > } else {
> > unlock_new_inode(inode);
> > iput(inode);
> > -   inode = ERR_PTR(-ESTALE);
> > +   ASSERT(ret < 0);
> > +   inode = ERR_PTR(ret < 0 ? ret : -ESTALE);
> > }
> 
> Just a heads-up.  This change breaks NFS :-(
> 
> The change in error code percolates up the call chain:
> 
>  nfs4_pufh->fh_verify->nfsd_set_fh_dentry->exportfs_decode_fh
> ->btrfs_fh_to_dentry->ntrfs_get_dentry->btrfs_iget
> 
> and nfsd returns NFS4ERR_NOENT to the client instead of NFS4ERR_STALE,
> and the client doesn't handle that quite the same way.
> 
> This doesn't mean that the change is wrong, but it could mean we need to
> fix something else in the path to sanitize the error code.
> 
> nfsd_set_fh_dentry already has
> 
>   error = nfserr_stale;
>   if (PTR_ERR(exp) == -ENOENT)
>   return error;
> 
>   if (IS_ERR(exp))
>   return nfserrno(PTR_ERR(exp));
> 
> for a different error case, so duplicating that would work, but I doubt
> it is best.  At the very least we should check for valid errors, not
> specific invalid ones.
> 
> Bruce: do you have an opinion where we should make sure that PUTFH (and
> various other requests) returns a valid error code?

Uh, I guess not.  Maybe exportfs_decode_fh?

Though my kneejerk reaction is to be cranky and wonder why btrfs
suddenly needs a different convention for decode_fh().

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread J. Bruce Fields
On Fri, Apr 01, 2016 at 11:33:00AM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 06:34:17PM -0400, J. Bruce Fields wrote:
> > I haven't looked at the code, but I assume a JUKEBOX-returning write to
> > an absent file brings into cache the bits necessary to perform the
> > write, but stops short of actually doing the write.
> 
> Not exactly, as all subsequent read/write/truncate requests will
> EJUKEBOX until the absent file has been brought back onto disk. Once
> that is done, the next operation attempt will proceed.
> 
> > That allows
> > handling the retried write quickly without doing the wrong thing in the
> > case the retry never comes.
> 
> Essentially. But if a retry never comes it means there's either a
> bug in the client NFS implementation or the client crashed,

NFS clients are under no obligation to retry operations after JUKEBOX.
And I'd expect them not to in the case the calling process was
interrupted, for example.

> > I guess it doesn't matter as much in practice, since the only way you're
> > likely to notice that fallocate unexpectedly succeeded would be if it
> > caused you to hit ENOSPC elsewhere.  Is that right?  Still, it seems a
> > little weird.
> 
> s/succeeded/failed/ and that statement is right.

Sorry, I didn't explain clearly.

The case I was worrying about was the case were the on-the-wire ALLOCATE
call returns JUKEBOX, but the server allocates anyway.

That behavior violates the spec as I understand it.

The client therefore assumes there was no allocation, when in fact there
was.

So, technically a bug, but I wondered if it's likely to bite anyone.
One of the only ways it seems someone would notice would be if it caused
the filesystem to run out of space earlier than I expected.  But perhaps
that's unlikely.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread J. Bruce Fields
On Fri, Apr 01, 2016 at 09:20:23AM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 01:47:50PM -0600, Andreas Dilger wrote:
> > On Mar 31, 2016, at 12:08 PM, J. Bruce Fields <bfie...@fieldses.org> wrote:
> > > 
> > > On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> > >> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> > >>> On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > >>>> On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > >>>>> Or is it ok that fallocate could block, potentially for a long time as
> > >>>>> we stream cows through the page cache (or however unshare works
> > >>>>> internally)?  Those same programs might not be expecting fallocate to
> > >>>>> take a long time.
> > >>>> 
> > >>>> Yes, it's perfectly fine for fallocate to block for long periods of
> > >>>> time. See what gfs2 does during preallocation of blocks - it ends up
> > >>>> calling sb_issue_zerout() because it doesn't have unwritten
> > >>>> extents, and hence can block for long periods of time
> > >>> 
> > >>> gfs2 fallocate is an implementation that will cause all but the most
> > >>> trivial users real pain.  Even the initial XFS implementation just
> > >>> marking the transactions synchronous made it unusable for all kinds
> > >>> of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> > >>> to gfs2 will probab;ly hand your connection for extended periods of
> > >>> time.
> > >>> 
> > >>> If we need to support something like what gfs2 does we should have a
> > >>> separate flag for it.
> > >> 
> > >> Using fallocate() for preallocation was always intended to
> > >> be a faster, more efficient method allocating zeroed space
> > >> than having userspace write blocks of data. Faster, more efficient
> > >> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> > >> that if the hardware has zeroing offloads (deterministic trim, write
> > >> same, etc) it will use them, and that will be much faster than
> > >> writing zeros from userspace.
> > >> 
> > >> IMO, what gfs2 is definitely within the intended usage of
> > >> fallocate() for accelerating the preallocation of blocks.
> > >> 
> > >> Yes, it may not be optimal for things like NFS servers which haven't
> > >> considered that a fallocate based offload operation might take some
> > >> time to execute, but that's not a problem with fallocate. i.e.
> > >> that's a problem with the nfs server ALLOCATE implementation not
> > >> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> > >> and timeouts while the operation is run
> > > 
> > > That's an interesting idea, but I don't think it's really legal.  I take
> > > JUKEBOX to mean "sorry, I'm failing this operation for now, try again
> > > later and it might succeed", not "OK, I'm working on it, try again and
> > > you may find out I've done it".
> > > 
> > > So if the client gets a JUKEBOX error but the server goes ahead and does
> > > the operation anyway, that'd be unexpected.
> > 
> > Well, the tape continued to be mounted in the background and/or the file
> > restored from the tape into the filesystem...
> 
> Right, and SGI have been shipping a DMAPI-aware Linux NFS server for
> many years, using the above NFSERR_JUKEBOX behaviour for operations
> that may block for a long time due to the need to pull stuff into
> the filesytsem from the slow backing store. Best explanation is in
> the relevant commit in the last published XFS+DMAPI branch from SGI,
> for example:
> 
> http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=28b171cf2b64167826474efbb82ad9d471a05f75

I haven't looked at the code, but I assume a JUKEBOX-returning write to
an absent file brings into cache the bits necessary to perform the
write, but stops short of actually doing the write.  That allows
handling the retried write quickly without doing the wrong thing in the
case the retry never comes.

Implementing fallocate by returning JUKEBOX while still continuing the
allocation in the background is a bit different.

I guess it doesn't matter as much in practice, since the only way you're
likely to notice that fallocate unexpectedly succeeded would be if it
caused you to hit ENOSPC elsewhere.  Is that right?  Still, it seems a
little weird.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fallocate mode flag for "unshare blocks"?

2016-03-31 Thread J. Bruce Fields
On Thu, Mar 31, 2016 at 10:18:50PM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 12:54:40AM -0700, Christoph Hellwig wrote:
> > On Thu, Mar 31, 2016 at 12:18:13PM +1100, Dave Chinner wrote:
> > > On Wed, Mar 30, 2016 at 11:27:55AM -0700, Darrick J. Wong wrote:
> > > > Or is it ok that fallocate could block, potentially for a long time as
> > > > we stream cows through the page cache (or however unshare works
> > > > internally)?  Those same programs might not be expecting fallocate to
> > > > take a long time.
> > > 
> > > Yes, it's perfectly fine for fallocate to block for long periods of
> > > time. See what gfs2 does during preallocation of blocks - it ends up
> > > calling sb_issue_zerout() because it doesn't have unwritten
> > > extents, and hence can block for long periods of time
> > 
> > gfs2 fallocate is an implementation that will cause all but the most
> > trivial users real pain.  Even the initial XFS implementation just
> > marking the transactions synchronous made it unusable for all kinds
> > of applications, and this is much worse.  E.g. a NFS ALLOCATE operation
> > to gfs2 will probab;ly hand your connection for extended periods of
> > time.
> > 
> > If we need to support something like what gfs2 does we should have a
> > separate flag for it.
> 
> Using fallocate() for preallocation was always intended to
> be a faster, more efficient method allocating zeroed space
> than having userspace write blocks of data. Faster, more efficient
> does not mean instantaneous, and gfs2 using sb_issue_zerout() means
> that if the hardware has zeroing offloads (deterministic trim, write
> same, etc) it will use them, and that will be much faster than
> writing zeros from userspace.
> 
> IMO, what gfs2 is definitely within the intended usage of
> fallocate() for accelerating the preallocation of blocks.
> 
> Yes, it may not be optimal for things like NFS servers which haven't
> considered that a fallocate based offload operation might take some
> time to execute, but that's not a problem with fallocate. i.e.
> that's a problem with the nfs server ALLOCATE implementation not
> being prepared to return NFSERR_JUKEBOX to prevent client side hangs
> and timeouts while the operation is run

That's an interesting idea, but I don't think it's really legal.  I take
JUKEBOX to mean "sorry, I'm failing this operation for now, try again
later and it might succeed", not "OK, I'm working on it, try again and
you may find out I've done it".

So if the client gets a JUKEBOX error but the server goes ahead and does
the operation anyway, that'd be unexpected.

I suppose it's comparable to the case where a slow fallocate is
interrupted--would it be legal to return EINTR in that case and leave
the application to sort out whether some part of the allocation had
already happened?  Would it be legal to continue the fallocate under the
covers even after returning EINTR?

But anyway my first inclination is to say that the NFS FALLOCATE
protocol just wasn't designed to handle long-running fallocates, and if
we really need that then we need to give it a way to either report
partial results or to report results asynchronously.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vfs: move btrfs clone ioctls to common code

2015-11-30 Thread J. Bruce Fields
On Thu, Nov 26, 2015 at 07:50:54PM +0100, Christoph Hellwig wrote:
> This patch set moves the existing btrfs clone ioctls that other file
> system have started to implement to common code, and allows the NFS
> server to export this functionality to remote systems.
> 
> This work is based originally on my NFS CLONE prototype, which reused
> code from Anna Schumaker's NFS COPY prototype, as well as various
> updates from Peng Tao to this code.

Looks good to me.  (In particular: ACK to the locks.c and nfsd patches.
But, disclaimer, I haven't tried to test clone.)

--b.

> 
> The patches are also available as a git branch and on gitweb:
> 
>   git://git.infradead.org/users/hch/pnfs.git clone-for-viro
>   
> http://git.infradead.org/users/hch/pnfs.git/shortlog/refs/heads/clone-for-viro
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] locks: new locks_mandatory_area calling convention

2015-11-30 Thread J. Bruce Fields
On Thu, Nov 26, 2015 at 07:50:56PM +0100, Christoph Hellwig wrote:
> Pass a loff_t end for the last byte instead of the 32-bit count
> parameter to allow full file clones even on 32-bit architectures.
> While we're at it also drop the pointless inode argument and simplify
> the read/write selection.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/locks.c | 22 +-
>  fs/read_write.c|  5 ++---
>  include/linux/fs.h | 28 +---
>  3 files changed, 24 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 0d2b326..d503669 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1227,21 +1227,17 @@ int locks_mandatory_locked(struct file *file)
>  
>  /**
>   * locks_mandatory_area - Check for a conflicting lock
> - * @read_write: %FLOCK_VERIFY_WRITE for exclusive access, %FLOCK_VERIFY_READ
> - *   for shared
> - * @inode:  the file to check
>   * @filp:   how the file was opened (if it was)
> - * @offset: start of area to check
> - * @count:  length of area to check
> + * @start:   first byte in the file to check
> + * @end: lastbyte in the file to check
> + * @write:   %true if checking for write access
>   *
>   * Searches the inode's list of locks to find any POSIX locks which conflict.
> - * This function is called from rw_verify_area() and
> - * locks_verify_truncate().
>   */
> -int locks_mandatory_area(int read_write, struct inode *inode,
> -  struct file *filp, loff_t offset,
> -  size_t count)
> +int locks_mandatory_area(struct file *filp, loff_t start, loff_t end,
> + bool write)
>  {
> + struct inode *inode = file_inode(filp);
>   struct file_lock fl;
>   int error;
>   bool sleep = false;
> @@ -1252,9 +1248,9 @@ int locks_mandatory_area(int read_write, struct inode 
> *inode,
>   fl.fl_flags = FL_POSIX | FL_ACCESS;
>   if (filp && !(filp->f_flags & O_NONBLOCK))
>   sleep = true;
> - fl.fl_type = (read_write == FLOCK_VERIFY_WRITE) ? F_WRLCK : F_RDLCK;
> - fl.fl_start = offset;
> - fl.fl_end = offset + count - 1;
> + fl.fl_type = write ? F_WRLCK : F_RDLCK;
> + fl.fl_start = start;
> + fl.fl_end = end;
>  
>   for (;;) {
>   if (filp) {
> diff --git a/fs/read_write.c b/fs/read_write.c
> index c81ef39..48157dd 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -396,9 +396,8 @@ int rw_verify_area(int read_write, struct file *file, 
> const loff_t *ppos, size_t
>   }
>  
>   if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
> - retval = locks_mandatory_area(
> - read_write == READ ? FLOCK_VERIFY_READ : 
> FLOCK_VERIFY_WRITE,
> - inode, file, pos, count);
> + retval = locks_mandatory_area(file, pos, pos + count - 1,
> + read_write == READ ? false : true);
>   if (retval < 0)
>   return retval;
>   }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 870a76e..e640f791 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2030,12 +2030,9 @@ extern struct kobject *fs_kobj;
>  
>  #define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK)
>  
> -#define FLOCK_VERIFY_READ  1
> -#define FLOCK_VERIFY_WRITE 2
> -
>  #ifdef CONFIG_FILE_LOCKING
>  extern int locks_mandatory_locked(struct file *);
> -extern int locks_mandatory_area(int, struct inode *, struct file *, loff_t, 
> size_t);
> +extern int locks_mandatory_area(struct file *, loff_t, loff_t, bool);
>  
>  /*
>   * Candidates for mandatory locking have the setgid bit set
> @@ -2068,14 +2065,16 @@ static inline int locks_verify_truncate(struct inode 
> *inode,
>   struct file *filp,
>   loff_t size)
>  {
> - if (inode->i_flctx && mandatory_lock(inode))
> - return locks_mandatory_area(
> - FLOCK_VERIFY_WRITE, inode, filp,
> - size < inode->i_size ? size : inode->i_size,
> - (size < inode->i_size ? inode->i_size - size
> -  : size - inode->i_size)
> - );
> - return 0;
> + if (!inode->i_flctx || !mandatory_lock(inode))
> + return 0;
> +
> + if (size < inode->i_size) {
> + return locks_mandatory_area(filp, size, inode->i_size - 1,
> + true);
> + } else {
> + return locks_mandatory_area(filp, inode->i_size, size - 1,
> + true);

I feel like these callers would be just slightly more self-documenting
if that last parameter was F_WRLCK instead of true.

But I could live with it either way, patch looks like an
improvement--ACK.

--b.

> + }
>  }
>  
>  static inline int break_lease(struct inode *inode, unsigned int mode)
> @@ -2144,9 +2143,8 @@ static inline int 

Re: [PATCH/RFC] make btrfs subvol mounts appear in /proc/mounts

2015-11-02 Thread J. Bruce Fields
On Wed, Oct 28, 2015 at 07:25:10AM +0900, Neil Brown wrote:
> 
> If you create a subvolume in btrfs and access it (by name) without
> mounting it, then the subvolume looks like a separate mount to some
> extent, returning a different st_dev to stat(), but it doesn't look like
> a separate mount in that it isn't listed in /proc/mounts. This
> inconsistency can confuse tools.
> 
> This patch causes these subvolumes to become separate mounts by using
> the VFS' automount functionality, much like NFS uses automount when it
> discovered mountpoints on the server.
> 
> The VFS currently makes it impossible to auto-mount a directory on to itself
> (i.e. a bind mount).  For NFS this isn't a problem as a new superblock
> is created for the child filesystem so there are two separate dentries
> (and inodes) for the one directory: one in the parent filesystem, one in
> the child (note that the two superblocks share a common connection to
> the server so there is still a lot of commonality).
> 
> BTRFS has chosen instead to use a single superblock for all subvolumes.

Naive question: was there a reason for that choice?

--b.

> This results in a single dentry for the subvol-root.  A dentry which
> must be auto-mounted on itself.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 5/4] copy_file_range.2: New page documenting copy_file_range()

2015-10-26 Thread J. Bruce Fields
On Mon, Oct 26, 2015 at 12:19:33PM +, Pádraig Brady wrote:
> On 26/10/15 03:39, Christoph Hellwig wrote:
> > On Sat, Oct 24, 2015 at 01:02:21PM +0100, P??draig Brady wrote:
> >> I'm a bit worried about the sparse expansion and default reflinking
> >> which might preclude cp(1) from using this call in most cases, but I will
> >> test and try to use it. coreutils has heuristics for determining if files
> >> are remote, which we might use to restrict to that use case.
> > 
> > Can you explain why reflinking and hole expansion are an issue if done
> > locally and not if done remotely?  I'd really like to make the call as
> > usable as possible for everyone, but we really need clear sem�ntics for
> > that.
> 
> Fair point on local vs remote. I was just assuming that remote
> copy offload would not do reflinking on the backend, or at
> least wasn't an exposed option over the remote interface.

The server could definitely do a reflink.  More generally, from the
description of the NFS COPY operation:

https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-39#page-64

If the copy completes successfully, either synchronously or
asynchronously, the data copied from the source file to the
destination file MUST appear identical to the NFS client.
However, the NFS server's on disk representation of the data in
the source file and destination file MAY differ.  For example,
the NFS server might encrypt, compress, deduplicate, or
otherwise represent the on disk data in the source and
destination file differently.

> I get the impression that you think reflinking should be hidden
> from the user, i.e. cp(1) should not have had the --reflink option
> (for the last 6 years)?  I'm not convinced of that, and even so
> I think lower level interfaces would benefit from finer grained options.
> This would be especially useful since there is no general interface
> to reflink at present. I was happy with the reflink control options,
> thinking the extra control could allow cp to use this by default.

Maybe that's a case for Christoph's "clone" operation.

I agree with him that it makes sense to allow the filesystem to
implement "copy" using reflink or similar tricks under the covers.  And
that in fact it's difficult to imagine how you'd prevent that in the
presence of layers of filesystem or block protocols underneath.

That "cp" flag seems strange to me, but if "cp" wants to take advantage
of a copy system call while continuing to make something like that
distinction then I suppose it could fallocate the destination range file
after the copy.

--b.

> > Also note that Annas current series allows for hole filling - any decent
> > implementation should not do them, but that's really a quality of
> > implementation and not an interface issue.
> 
> I think you're saying the default `cp --sparse=auto` operation
> could rely on copy_file_range(...complete file...), while
> cp --sparse={always,never} would have to iterate over the
> file, punching or filling holes as appropriate. I thought
> Anna indicated differently wrt splice filling holes by default.
> 
> TBH I'm not clear on the semantics of the current implementation,
> so need to test the above in various cases.
> 
> thanks,
> Pádraig.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 5/4] copy_file_range.2: New page documenting copy_file_range()

2015-10-19 Thread J. Bruce Fields
On Sun, Oct 18, 2015 at 11:30:13AM -0700, Christoph Hellwig wrote:
> Just commenting on the man page here as the comment is about sematics.
> All the infrastructure in the patch looks reasonable to me, but this
> is something we need to get right.
> 
> > +.B COPY_FR_REFLINK
> > +Create a lightweight "reflink", where data is not copied until
> > +one of the files is modified.
> > +.PP
> > +The default behavior
> > +.RI ( flags
> > +== 0) is to perform a full data copy of the requested range.
> > +.SH RETURN VALUE
> > +Upon successful completion,
> > +.BR copy_file_range ()
> > +will return the number of bytes copied between files.
> > +This could be less than the length originally requested.
> 
> As mentioned in the previous discussion I fundamentally disagree with
> the way your word the flags here.
> 
> flags = 0 gives you the data from source at dest, period.  How it's
> implemented is up to the file system as a user cannot observe how data
> actually is stored underneath.
> 
> Additionaly I think the 'clone' option with it's stronger guarantees
> should be a separate system call.  So for now just have no supported
> flag and leave it up to the file system and storage device how to
> implement it.

So, continue to include a "flags" field but just error out if it's
anything but zero for now?

Sounds fine by me.  We can always implement the other stuff later.

--b.

> 
> For the future a COPY_FALLOC flag taht guaranatees you do not get ENOSPC
> on the copied range will be very useful, but given the complexity I
> think it's not something we should add now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)

2015-06-26 Thread J. Bruce Fields
On Thu, Jun 25, 2015 at 06:12:57PM -0400, Theodore Ts'o wrote:
 On Thu, Jun 25, 2015 at 02:46:44PM -0400, J. Bruce Fields wrote:
  Looks OK to me.  As I say I'd expect i_version_seen == true to end up
  being the common case in a lot of v4 workloads, so I'm more skeptical of
  the claim of a performance improvement in the v4 case.
 
 Well, so long as we require i_version to be committed to disk on every
 single disk write, we're going to be trading off:
 
   * client-side performance of the advanced NFSv4 cacheing for reads
   * server-side performance for writes
   * data robustness in case of the server crashing and the client-side 
 cache
 getting out of sync with the server after the crash
 
 I don't see any way around that.  (So for example, with lazy mtime
 updates we wouldn't be updating the inode after every single
 non-allocating write; enabling i_version updates will trash that
 optimization.)
 
 I just want to reduce to a bare minimum the performance hit in the
 case where NFSv4 exports are not being used (since that is true in a
 very *large* number of ext4 deployments --- i.e., every single Android
 handset using ext4 :-), such that we can leave i_version updates
 turned on by default.

Definitely understood.  I think it's a good idea.

  Could maintaining the new flag be a significant drag in itself?  If not,
  then I guess we're not making things any worse there, so fine.
 
 I don't think so; it's a bit in the in-memory inode, so I don't think
 that should be an issue.

OK!

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)

2015-06-25 Thread J. Bruce Fields
On Tue, Jun 23, 2015 at 12:32:41PM -0400, Theodore Ts'o wrote:
 On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
  Moving the discussion to fsdevel.
  
  Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
  generic 'noiversion' option cannot be used to achieve that. It is
  processed before it reaches btrfs superblock callback, where
  MS_I_VERSION is forced.
  
  The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
  to which I object.
 
 I was talking to Mingming about this on today's ext4 conference call,
 and one of the reasons why ext4 turns off i_version update by default
 is because it does a real number on our performance as well --- and
 furthermore, the only real user of the field from what we can tell is
 NFSv4, which not all that many ext4 users actually care about.
 
 This has caused pain for the nfsv4 folks since it means that they need
 to tell people to use a special mount option for ext4 if they are
 actually using this for nfsv4, and I suspect they won't be all that
 eager to hear that btrfs is going to go the same way.

Yes, thanks for looking into this!

 This however got us thinking --- even in if NFSv4 is depending on
 i_version, it doesn't actually _look_ at that field all that often.

Most clients will query it on every write.  (I just took a quick look at
the code and I believe the Linux client's requesting it immediately
after every write, except in the O_DIRECT and delegated cases.)

 It's only going to look at it in a response to a client's getattr
 call, and that in turn is used to so the client can do its local disk
 cache invalidation if anby of the data blocks of the inode has changed.
 
 So what if we have a per-inode flag which don't update I_VERSION,
 which is off by default, but after the i_version has been updated at
 least once, is set, so the i_version field won't be updated again ---
 at least until something has actually looked at the i_version field,
 when the don't update I_VERSOIN flag will get cleared again.
 
 So basically, if we know there are no microphones in the forest, we
 don't need to make the tree fall.  However, if someone has sampled the
 i_version field, then the next time the inode gets updated, we will
 update the i_version field so the NFSv4 client can hear the sound of
 the tree crashing to the forst floor and so it can invalidate its
 local cache of the file.  :-)
 
 This should significantly improve the performance of using the
 i_version field if the file system is being exported via NFSv4, and if
 NFSv4 is not in use, no one will be looking at the i_version field, so
 the performance impact will be very slight, and thus we could enable
 i_version updates by default for btrfs and ext4.
 
 And this should make the distribution folks happy, since it will unify
 the behavior of all file systems, and make life easier for users who
 won't need to set certain magic mount options depending on what file
 system they are using and whether they are using NFSv4 or not.
 
 Does this sound reasonable?

Just to make sure I understand, the logic is something like:

to read the i_version:

inode-i_version_seen = true;
return inode-i_version

to update the i_version:

/*
 * If nobody's seen this value of i_version then we can
 * keep using it, otherwise we need a new one:
 */
if (inode-i_version_seen)
inode-i_version++;
inode-i_version_seen = false;

Looks OK to me.  As I say I'd expect i_version_seen == true to end up
being the common case in a lot of v4 workloads, so I'm more skeptical of
the claim of a performance improvement in the v4 case.

Could maintaining the new flag be a significant drag in itself?  If not,
then I guess we're not making things any worse there, so fine.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-14 Thread J. Bruce Fields
On Tue, Apr 14, 2015 at 11:22:41AM -0700, Zach Brown wrote:
 On Tue, Apr 14, 2015 at 02:19:11PM -0400, J. Bruce Fields wrote:
  On Tue, Apr 14, 2015 at 01:16:13PM -0400, Anna Schumaker wrote:
   On 04/14/2015 12:53 PM, Christoph Hellwig wrote:
On Sat, Apr 11, 2015 at 09:04:02AM -0400, Jeff Layton wrote:
Yuck! How the heck do you clean up the mess if that happens? I
guess you're just stuck redoing the copy with normal READ/WRITE?
   
Maybe we need to have the interface return a hard error in that
case and not try to give back any sort of offset?

The NFSv4.2 COPY interface is a train wreck.  At least for Linux I'd
expect us to simply ignore it and only implement my new CLONE
operation with sane semantics.  That is unless someone can show some
real life use case for the inter server copy, in which case we'll
have to deal with that mess.  But getting that one right at the VFS
level will be a nightmare anyway.

Make this a vote from me to not support partial copies and just
return and error in that case.
   
   Agreed.  Looking at the v4.2 spec, COPY does take ca_consecutive and a
   ca_synchronous flags that let the client state if the copy should be
   done consecutively or synchronously.  I expected to always set
   consecutive to true for the Linux client.
  
  That's supposed to mean results are well-defined in the partial-copy
  case, but I think Christoph's suggesting eliminating the partial-copy
  case entirely?
  
  Which would be fine with me.
  
  It might actually have been me advocating for partial copies.  But that
  was only because a partial-copy-handling-loop seemed simpler to me than
  progress callbacks if we were going to support long-running copies.
  
  I'm happy enough not to have it at all.
 
 Ah, OK, that's great news.
 
 I thought at one point we were worried about very long running RPCs on
 the server.  Are we not worried about that now?
 
 Is the client expected to cut the work up into arbitrarily managable
 chunks?  Is the server expected to fail COPY/CLONE requests that it
 thinks would take way too long?  Something else?

Christoph is proposing a CLONE rpc that's required to be atomic:


https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-35#section-15.13
The CLONE operation is atomic, that is either all changes or no
changes are seen by the client or other clients.

So that couldn't be really long-running (or the server is nuts).

So that'd mean Anna would rip out the server-side copy loop and we'd
initially just support btrfs or whatever.

I mean the server-side copy loop may also be useful but I'm all for
wiring up the obvious case first.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-14 Thread J. Bruce Fields
On Tue, Apr 14, 2015 at 01:16:13PM -0400, Anna Schumaker wrote:
 On 04/14/2015 12:53 PM, Christoph Hellwig wrote:
  On Sat, Apr 11, 2015 at 09:04:02AM -0400, Jeff Layton wrote:
  Yuck! How the heck do you clean up the mess if that happens? I
  guess you're just stuck redoing the copy with normal READ/WRITE?
 
  Maybe we need to have the interface return a hard error in that
  case and not try to give back any sort of offset?
  
  The NFSv4.2 COPY interface is a train wreck.  At least for Linux I'd
  expect us to simply ignore it and only implement my new CLONE
  operation with sane semantics.  That is unless someone can show some
  real life use case for the inter server copy, in which case we'll
  have to deal with that mess.  But getting that one right at the VFS
  level will be a nightmare anyway.
  
  Make this a vote from me to not support partial copies and just
  return and error in that case.
 
 Agreed.  Looking at the v4.2 spec, COPY does take ca_consecutive and a
 ca_synchronous flags that let the client state if the copy should be
 done consecutively or synchronously.  I expected to always set
 consecutive to true for the Linux client.

That's supposed to mean results are well-defined in the partial-copy
case, but I think Christoph's suggesting eliminating the partial-copy
case entirely?

Which would be fine with me.

It might actually have been me advocating for partial copies.  But that
was only because a partial-copy-handling-loop seemed simpler to me than
progress callbacks if we were going to support long-running copies.

I'm happy enough not to have it at all.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v2 0/5] add support for a lazytime mount option

2014-11-24 Thread J. Bruce Fields
On Mon, Nov 24, 2014 at 06:57:27AM -0500, Theodore Ts'o wrote:
 If we want to be paranoid, we handle i_version updates non-lazily; I
 can see arguments in favor of that.
 
 Ext4 only enables MS_I_VERSION if the user asks for it explicitly, so
 it wouldn't cause me any problems.  However, xfs and btrfs enables it
 by default, so that means xfs and btrfs wouldn't see the benefits of
 lazytime (if you're going to have to push I_VERSION to disk, you might
 as well update the [acm]time while you're at it).  I've always thought
 that we *should* do is to only enable it if nfsv4 is serving the file
 system, and not otherwise, though, which would also give us
 consistency across all the file systems.

I guess you need to worry about the case where you shutdown nfsd, modify
a file, then restart nfsd--you don't want a client to miss the
modification in that case.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/9] dcache: d_obtain_alias callers don't all want DISCONNECTED

2014-02-24 Thread J. Bruce Fields
On Thu, Feb 20, 2014 at 05:44:14PM -0800, Christoph Hellwig wrote:
   
  -   return d_obtain_alias(inode);
  +   return d_obtain_alias_root(inode);
 
 Can we call this d_obtain_root or similar, please?

Yes, I like d_obtain_root better, done.

I'll send out the updated series sometime.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol

2014-02-18 Thread J. Bruce Fields
On Fri, Feb 14, 2014 at 09:45:24PM -0500, J. Bruce Fields wrote:
 On Fri, Feb 14, 2014 at 05:40:55PM -0800, Eric W. Biederman wrote:
  J. Bruce Fields bfie...@fieldses.org writes:
  
   On Fri, Feb 14, 2014 at 01:43:48PM -0500, Josef Bacik wrote:
   A user was running into errors from an NFS export of a subvolume that 
   had a
   default subvol set.  When we mount a default subvol we will use 
   d_obtain_alias()
   to find an existing dentry for the subvolume in the case that the root 
   subvol
   has already been mounted, or a dummy one is allocated in the case that 
   the root
   subvol has not already been mounted.  This allows us to connect the 
   dentry later
   on if we wander into the path.  However if we don't ever wander into the 
   path we
   will keep DCACHE_DISCONNECTED set for a long time, which angers NFS.  It 
   doesn't
   appear to cause any problems but it is annoying nonetheless, so simply 
   unset
   DCACHE_DISCONNECTED in the get_default_root case and switch 
   btrfs_lookup() to
   use d_materialise_unique() instead which will make everything play nicely
   together and reconnect stuff if we wander into the defaul subvol path 
   from a
   different way.  With this patch I'm no longer getting the NFS errors when
   exporting a volume that has been mounted with a default subvol set.  
   Thanks,
  
   Looks obviously correct, but based on a quick grep, there are four
   d_obtain_alias callers outside export methods:
  
 - btrfs/super.c:get_default_root()
 - fs/ceph/super.c:open_root_dentry()
 - fs/nfs/getroot.c:nfs_get_root()
 - fs/nilfs2/super.c:nilfs_get_root_dentry()
  
   It'd be nice to give them a common d_obtain_alias variant instead of
   making them all clear this by hand.
  
  I am in favor of one small fix at a time, so that progress is made and
  fixing something just for btrfs seems reasonable for the short term.
  
   Of those nilfs2 also uses d_splice_alias.  I think that problem would
   best be solved by fixing d_splice_alias not to require a
   DCACHE_DISCONNECTED dentry; IS_ROOT() on its own should be fine.
  
  You mean by renaming d_splice_alias d_materialise_unique?
  
  Or is there a useful distinction you see that should be preserved
  between the two methods?
  
  Right now my inclination is that everyone should just use
  d_materialise_unique and we should kill d_splice_alias.
 
 Probably.  One remaining distinction:
 
   - In the local filesystem case if you discover a directory is
 already aliased elsewhere, you have a corrupted filesystem and
 want to error out the lookup.  (Didn't you propose a patch to
 do something like that before?)
   - In the distributed filesystem this is perfectly normal and we
 want to do our best to fix up our local cache to represent
 remote reality.

The following keeps the d_splice_alias/d_materialise_unique distinction
and (hopefully) fixes Josef's bug, and does a little cleanup (including
your suggested DISCONNECTED-CONNECTING rename).

Any better idea for the naming of d_materialise_unique?

I also didn't try to merge the implementations--the merged
d_splice_alias/d_materialise_unique was a little uglier than I expected.
I'll keep messing around with it though.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/9] dcache: d_splice_alias mustn't create directory aliases

2014-02-18 Thread J. Bruce Fields
From: J. Bruce Fields bfie...@redhat.com

Currently if d_splice_alias finds a directory with an alias that is not
IS_ROOT or not DCACHE_DISCONNECTED, it creates a duplicate directory.

Duplicate directory dentries are unacceptable; it is better just to
error out.

(In the case of a local filesystem the most likely case is filesystem
corruption: for example, perhaps two directories point to the same child
directory, and the other parent has already been found and cached.)

Signed-off-by: J. Bruce Fields bfie...@redhat.com
---
 fs/dcache.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fd50e52..4550227 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2672,6 +2672,9 @@ static void __d_materialise_dentry(struct dentry *dentry, 
struct dentry *anon)
  * DCACHE_DISCONNECTED), then d_move that in place of the given dentry
  * and return it, else simply d_add the inode to the dentry and return NULL.
  *
+ * If a non-IS_ROOT directory is found, the filesystem is corrupt, and
+ * we should error out: directories can't have multiple aliases.
+ *
  * This is needed in the lookup routine of any filesystem that is exportable
  * (via knfsd) so that we can build dcache paths to directories effectively.
  *
@@ -2692,9 +2695,13 @@ struct dentry *d_splice_alias(struct inode *inode, 
struct dentry *dentry)
 
if (inode  S_ISDIR(inode-i_mode)) {
spin_lock(inode-i_lock);
-   new = __d_find_alias(inode, 1);
+   new = __d_find_any_alias(inode);
if (new) {
-   BUG_ON(!(new-d_flags  DCACHE_DISCONNECTED));
+   if (!IS_ROOT(new) || !(new-d_flags  
DCACHE_DISCONNECTED)) {
+   spin_unlock(inode-i_lock);
+   dput(new);
+   return ERR_PTR(-EIO);
+   }
write_seqlock(rename_lock);
__d_materialise_dentry(dentry, new);
write_sequnlock(rename_lock);
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/9] dcache: close d_move race in d_splice_alias

2014-02-18 Thread J. Bruce Fields
From: J. Bruce Fields bfie...@redhat.com

d_splice_alias will d_move an IS_ROOT() directory dentry into place if
one exists.  This should be safe as long as the dentry remains IS_ROOT,
but I can't see what guarantees that: once we drop the i_lock all we
hold here is the i_mutex on an unrelated parent directory.

Instead copy the logic of d_materialise_unique.

Signed-off-by: J. Bruce Fields bfie...@redhat.com
---
 fs/dcache.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 332b58c..fd50e52 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2695,9 +2695,14 @@ struct dentry *d_splice_alias(struct inode *inode, 
struct dentry *dentry)
new = __d_find_alias(inode, 1);
if (new) {
BUG_ON(!(new-d_flags  DCACHE_DISCONNECTED));
+   write_seqlock(rename_lock);
+   __d_materialise_dentry(dentry, new);
+   write_sequnlock(rename_lock);
+   __d_drop(new);
+   _d_rehash(new);
+   spin_unlock(new-d_lock);
spin_unlock(inode-i_lock);
security_d_instantiate(new, inode);
-   d_move(new, dentry);
iput(inode);
} else {
/* already taking inode-i_lock, so d_add() by hand */
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/9] dcache: d_find_alias needn't recheck IS_ROOT DCACHE_DISCONNECTED

2014-02-18 Thread J. Bruce Fields
From: J. Bruce Fields bfie...@redhat.com

If we get to this point and discover the dentry is not a root dentry, or
not DCACHE_DISCONNECTED--great, we always prefer that anyway.

Signed-off-by: J. Bruce Fields bfie...@redhat.com
---
 fs/dcache.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index efe3d3b..448ef98 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -753,12 +753,9 @@ again:
alias = discon_alias;
spin_lock(alias-d_lock);
if (S_ISDIR(inode-i_mode) || !d_unhashed(alias)) {
-   if (IS_ROOT(alias) 
-   (alias-d_flags  DCACHE_DISCONNECTED)) {
-   __dget_dlock(alias);
-   spin_unlock(alias-d_lock);
-   return alias;
-   }
+   __dget_dlock(alias);
+   spin_unlock(alias-d_lock);
+   return alias;
}
spin_unlock(alias-d_lock);
goto again;
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] dcache: d_splice_alias should ignore DCACHE_DISCONNECTED

2014-02-18 Thread J. Bruce Fields
From: J. Bruce Fields bfie...@redhat.com

Any IS_ROOT() alias should be safe to use; there's nothing special about
DCACHE_DISCONNECTED dentries.

Signed-off-by: J. Bruce Fields bfie...@redhat.com
---
 fs/dcache.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4550227..e3e4618 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2668,9 +2668,9 @@ static void __d_materialise_dentry(struct dentry *dentry, 
struct dentry *anon)
  * @inode:  the inode which may have a disconnected dentry
  * @dentry: a negative dentry which we want to point to the inode.
  *
- * If inode is a directory and has a 'disconnected' dentry (i.e. IS_ROOT and
- * DCACHE_DISCONNECTED), then d_move that in place of the given dentry
- * and return it, else simply d_add the inode to the dentry and return NULL.
+ * If inode is a directory and has an IS_ROOT alias, then d_move that in
+ * place of the given dentry and return it, else simply d_add the inode
+ * to the dentry and return NULL.
  *
  * If a non-IS_ROOT directory is found, the filesystem is corrupt, and
  * we should error out: directories can't have multiple aliases.
@@ -2697,7 +2697,7 @@ struct dentry *d_splice_alias(struct inode *inode, struct 
dentry *dentry)
spin_lock(inode-i_lock);
new = __d_find_any_alias(inode);
if (new) {
-   if (!IS_ROOT(new) || !(new-d_flags  
DCACHE_DISCONNECTED)) {
+   if (!IS_ROOT(new)) {
spin_unlock(inode-i_lock);
dput(new);
return ERR_PTR(-EIO);
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/9] dcache: move d_splice_alias

2014-02-18 Thread J. Bruce Fields
From: J. Bruce Fields bfie...@redhat.com

Just a trivial move to locate it near (similar) d_materialise_unique
code and save some forward references in a following patch.

Signed-off-by: J. Bruce Fields bfie...@redhat.com
---
 fs/dcache.c | 104 ++--
 1 file changed, 52 insertions(+), 52 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 265e0ce..332b58c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1905,58 +1905,6 @@ struct dentry *d_obtain_alias(struct inode *inode)
 EXPORT_SYMBOL(d_obtain_alias);
 
 /**
- * d_splice_alias - splice a disconnected dentry into the tree if one exists
- * @inode:  the inode which may have a disconnected dentry
- * @dentry: a negative dentry which we want to point to the inode.
- *
- * If inode is a directory and has a 'disconnected' dentry (i.e. IS_ROOT and
- * DCACHE_DISCONNECTED), then d_move that in place of the given dentry
- * and return it, else simply d_add the inode to the dentry and return NULL.
- *
- * This is needed in the lookup routine of any filesystem that is exportable
- * (via knfsd) so that we can build dcache paths to directories effectively.
- *
- * If a dentry was found and moved, then it is returned.  Otherwise NULL
- * is returned.  This matches the expected return value of -lookup.
- *
- * Cluster filesystems may call this function with a negative, hashed dentry.
- * In that case, we know that the inode will be a regular file, and also this
- * will only occur during atomic_open. So we need to check for the dentry
- * being already hashed only in the final case.
- */
-struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
-{
-   struct dentry *new = NULL;
-
-   if (IS_ERR(inode))
-   return ERR_CAST(inode);
-
-   if (inode  S_ISDIR(inode-i_mode)) {
-   spin_lock(inode-i_lock);
-   new = __d_find_alias(inode, 1);
-   if (new) {
-   BUG_ON(!(new-d_flags  DCACHE_DISCONNECTED));
-   spin_unlock(inode-i_lock);
-   security_d_instantiate(new, inode);
-   d_move(new, dentry);
-   iput(inode);
-   } else {
-   /* already taking inode-i_lock, so d_add() by hand */
-   __d_instantiate(dentry, inode);
-   spin_unlock(inode-i_lock);
-   security_d_instantiate(dentry, inode);
-   d_rehash(dentry);
-   }
-   } else {
-   d_instantiate(dentry, inode);
-   if (d_unhashed(dentry))
-   d_rehash(dentry);
-   }
-   return new;
-}
-EXPORT_SYMBOL(d_splice_alias);
-
-/**
  * d_add_ci - lookup or allocate new dentry with case-exact name
  * @inode:  the inode case-insensitive lookup has found
  * @dentry: the negative dentry that was passed to the parent's lookup func
@@ -2716,6 +2664,58 @@ static void __d_materialise_dentry(struct dentry 
*dentry, struct dentry *anon)
 }
 
 /**
+ * d_splice_alias - splice a disconnected dentry into the tree if one exists
+ * @inode:  the inode which may have a disconnected dentry
+ * @dentry: a negative dentry which we want to point to the inode.
+ *
+ * If inode is a directory and has a 'disconnected' dentry (i.e. IS_ROOT and
+ * DCACHE_DISCONNECTED), then d_move that in place of the given dentry
+ * and return it, else simply d_add the inode to the dentry and return NULL.
+ *
+ * This is needed in the lookup routine of any filesystem that is exportable
+ * (via knfsd) so that we can build dcache paths to directories effectively.
+ *
+ * If a dentry was found and moved, then it is returned.  Otherwise NULL
+ * is returned.  This matches the expected return value of -lookup.
+ *
+ * Cluster filesystems may call this function with a negative, hashed dentry.
+ * In that case, we know that the inode will be a regular file, and also this
+ * will only occur during atomic_open. So we need to check for the dentry
+ * being already hashed only in the final case.
+ */
+struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
+{
+   struct dentry *new = NULL;
+
+   if (IS_ERR(inode))
+   return ERR_CAST(inode);
+
+   if (inode  S_ISDIR(inode-i_mode)) {
+   spin_lock(inode-i_lock);
+   new = __d_find_alias(inode, 1);
+   if (new) {
+   BUG_ON(!(new-d_flags  DCACHE_DISCONNECTED));
+   spin_unlock(inode-i_lock);
+   security_d_instantiate(new, inode);
+   d_move(new, dentry);
+   iput(inode);
+   } else {
+   /* already taking inode-i_lock, so d_add() by hand */
+   __d_instantiate(dentry, inode);
+   spin_unlock(inode-i_lock);
+   security_d_instantiate(dentry, inode

[PATCH 9/9] dcache: rename DCACHE_DISCONNECTED - DCACHE_CONNECTING

2014-02-18 Thread J. Bruce Fields
From: J. Bruce Fields bfie...@redhat.com

The DCACHE_DISCONNECTED flag was intended *only* to mark dentries which
were looked up by filehandle and are currently in the process of being
hooked up to the rest of dcache.

Over time people have also confused it with IS_ROOT, using it to mark
directories that are permanently disconnected just because for example
they're the root of some filesystem.

Perhaps a different name would have helped. (Thanks to Eric Biederman
for the suggestion of DCACHE_CONNECTING.)

Signed-off-by: J. Bruce Fields bfie...@redhat.com
---
 Documentation/filesystems/nfs/Exporting |  4 ++--
 drivers/staging/lustre/lustre/llite/namei.c |  4 ++--
 fs/dcache.c |  8 
 fs/exportfs/expfs.c | 16 
 fs/fat/namei_vfat.c |  4 ++--
 fs/nfsd/nfsfh.c |  4 ++--
 fs/ocfs2/dcache.c   |  2 +-
 fs/ocfs2/namei.c|  2 +-
 include/linux/dcache.h  |  4 ++--
 9 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/Documentation/filesystems/nfs/Exporting 
b/Documentation/filesystems/nfs/Exporting
index 9b7de5b..6d3b272 100644
--- a/Documentation/filesystems/nfs/Exporting
+++ b/Documentation/filesystems/nfs/Exporting
@@ -52,7 +52,7 @@ the dcache that are not needed for normal filesystem access.
 
 To implement these features, the dcache has:
 
-a/ A dentry flag DCACHE_DISCONNECTED which is set on
+a/ A dentry flag DCACHE_CONNECTING which is set on
any dentry that might not be part of the proper prefix.
This is set when anonymous dentries are created, and cleared when a
dentry is noticed to be a child of a dentry which is in the proper
@@ -69,7 +69,7 @@ c/ Helper routines to allocate anonymous dentries, and to 
help attach
 d_obtain_alias(inode) will return a dentry for the given inode.
   If the inode already has a dentry, one of those is returned.
   If it doesn't, a new anonymous (IS_ROOT and
-DCACHE_DISCONNECTED) dentry is allocated and attached.
+DCACHE_CONNECTING) dentry is allocated and attached.
   In the case of a directory, care is taken that only one dentry
   can ever be attached.
 d_splice_alias(inode, dentry) or d_materialise_unique(dentry, inode)
diff --git a/drivers/staging/lustre/lustre/llite/namei.c 
b/drivers/staging/lustre/lustre/llite/namei.c
index fc8d264..f27 100644
--- a/drivers/staging/lustre/lustre/llite/namei.c
+++ b/drivers/staging/lustre/lustre/llite/namei.c
@@ -368,7 +368,7 @@ static struct dentry *ll_find_alias(struct inode *inode, 
struct dentry *dentry)
LASSERT(alias != dentry);
 
spin_lock(alias-d_lock);
-   if (alias-d_flags  DCACHE_DISCONNECTED)
+   if (alias-d_flags  DCACHE_CONNECTING)
/* LASSERT(last_discon == NULL); LU-405, bz 20055 */
discon_alias = alias;
else if (alias-d_parent == dentry-d_parent 
@@ -395,7 +395,7 @@ static struct dentry *ll_find_alias(struct inode *inode, 
struct dentry *dentry)
 
 /*
  * Similar to d_splice_alias(), but lustre treats invalid alias
- * similar to DCACHE_DISCONNECTED, and tries to use it anyway.
+ * similar to DCACHE_CONNECTING, and tries to use it anyway.
  */
 struct dentry *ll_splice_alias(struct inode *inode, struct dentry *de)
 {
diff --git a/fs/dcache.c b/fs/dcache.c
index 448ef98..0d5e1a5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -726,7 +726,7 @@ EXPORT_SYMBOL(dget_parent);
  * it can be unhashed only if it has no children, or if it is the root
  * of a filesystem.
  *
- * If the inode has an IS_ROOT, DCACHE_DISCONNECTED alias, then prefer
+ * If the inode has an IS_ROOT, DCACHE_CONNECTING alias, then prefer
  * any other hashed alias over that one.
  */
 static struct dentry *__d_find_alias(struct inode *inode)
@@ -739,7 +739,7 @@ again:
spin_lock(alias-d_lock);
if (S_ISDIR(inode-i_mode) || !d_unhashed(alias)) {
if (IS_ROOT(alias) 
-   (alias-d_flags  DCACHE_DISCONNECTED)) {
+   (alias-d_flags  DCACHE_CONNECTING)) {
discon_alias = alias;
} else {
__dget_dlock(alias);
@@ -1860,7 +1860,7 @@ struct dentry *__d_obtain_alias(struct inode *inode, int 
disconnected)
add_flags = d_flags_for_inode(inode);
 
if (disconnected)
-   add_flags |= DCACHE_DISCONNECTED;
+   add_flags |= DCACHE_CONNECTING;
 
spin_lock(tmp-d_lock);
tmp-d_inode = inode;
@@ -1883,7 +1883,7 @@ struct dentry *__d_obtain_alias(struct inode *inode, int 
disconnected)
 }
 
 /**
- * d_obtain_alias - find or allocate a DISCONNECTED dentry for a given inode
+ * d_obtain_alias - find or allocate a CONNECTING dentry for a given inode

[PATCH 6/9] dcache: remove unused d_find_alias parameter

2014-02-18 Thread J. Bruce Fields
From: J. Bruce Fields bfie...@redhat.com

Signed-off-by: J. Bruce Fields bfie...@redhat.com
---
 fs/dcache.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3a1057a..efe3d3b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -719,8 +719,6 @@ EXPORT_SYMBOL(dget_parent);
 /**
  * d_find_alias - grab a hashed alias of inode
  * @inode: inode in question
- * @want_discon:  flag, used by d_splice_alias, to request
- *  that only a DISCONNECTED alias be returned.
  *
  * If inode has a hashed alias, or is a directory and has any alias,
  * acquire the reference to alias and return it. Otherwise return NULL.
@@ -729,10 +727,9 @@ EXPORT_SYMBOL(dget_parent);
  * of a filesystem.
  *
  * If the inode has an IS_ROOT, DCACHE_DISCONNECTED alias, then prefer
- * any other hashed alias over that one unless @want_discon is set,
- * in which case only return an IS_ROOT, DCACHE_DISCONNECTED alias.
+ * any other hashed alias over that one.
  */
-static struct dentry *__d_find_alias(struct inode *inode, int want_discon)
+static struct dentry *__d_find_alias(struct inode *inode)
 {
struct dentry *alias, *discon_alias;
 
@@ -744,7 +741,7 @@ again:
if (IS_ROOT(alias) 
(alias-d_flags  DCACHE_DISCONNECTED)) {
discon_alias = alias;
-   } else if (!want_discon) {
+   } else {
__dget_dlock(alias);
spin_unlock(alias-d_lock);
return alias;
@@ -775,7 +772,7 @@ struct dentry *d_find_alias(struct inode *inode)
 
if (!hlist_empty(inode-i_dentry)) {
spin_lock(inode-i_lock);
-   de = __d_find_alias(inode, 0);
+   de = __d_find_alias(inode);
spin_unlock(inode-i_lock);
}
return de;
@@ -2784,7 +2781,7 @@ struct dentry *d_materialise_unique(struct dentry 
*dentry, struct inode *inode)
struct dentry *alias;
 
/* Does an aliased dentry already exist? */
-   alias = __d_find_alias(inode, 0);
+   alias = __d_find_alias(inode);
if (alias) {
actual = alias;
write_seqlock(rename_lock);
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/9] dcache: d_obtain_alias callers don't all want DISCONNECTED

2014-02-18 Thread J. Bruce Fields
From: J. Bruce Fields bfie...@redhat.com

There are a few d_obtain_alias callers that are using it to get the
root of a filesystem which may already have an alias somewhere else.

This is not the same as the filehandle-lookup case, and none of them
actually need DCACHE_DISCONNECTED set.

In the btrfs case this was causing a spurious printk from
nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED
dentry.

Which isn't really a serious problem, but it would really be clearer if
we reserved DCACHE_DISCONNECTED for those cases where it's actually
needed.

Signed-off-by: J. Bruce Fields bfie...@redhat.com
---
 fs/btrfs/super.c   |  2 +-
 fs/ceph/super.c|  2 +-
 fs/dcache.c| 69 +++---
 fs/nfs/getroot.c   |  2 +-
 fs/nilfs2/super.c  |  2 +-
 include/linux/dcache.h |  1 +
 6 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 97cc241..07ce96b 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -925,7 +925,7 @@ setup_root:
return dget(sb-s_root);
}
 
-   return d_obtain_alias(inode);
+   return d_obtain_alias_root(inode);
 }
 
 static int btrfs_fill_super(struct super_block *sb,
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 2df963f..9666950 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -732,7 +732,7 @@ static struct dentry *open_root_dentry(struct 
ceph_fs_client *fsc,
goto out;
}
} else {
-   root = d_obtain_alias(inode);
+   root = d_obtain_alias_root(inode);
}
ceph_init_dentry(root);
dout(open_root_inode success, root dentry is %p\n, root);
diff --git a/fs/dcache.c b/fs/dcache.c
index e3e4618..3a1057a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1832,25 +1832,7 @@ struct dentry *d_find_any_alias(struct inode *inode)
 }
 EXPORT_SYMBOL(d_find_any_alias);
 
-/**
- * d_obtain_alias - find or allocate a dentry for a given inode
- * @inode: inode to allocate the dentry for
- *
- * Obtain a dentry for an inode resulting from NFS filehandle conversion or
- * similar open by handle operations.  The returned dentry may be anonymous,
- * or may have a full name (if the inode was already in the cache).
- *
- * When called on a directory inode, we must ensure that the inode only ever
- * has one dentry.  If a dentry is found, that is returned instead of
- * allocating a new one.
- *
- * On successful return, the reference to the inode has been transferred
- * to the dentry.  In case of an error the reference on the inode is released.
- * To make it easier to use in export operations a %NULL or IS_ERR inode may
- * be passed in and will be the error will be propagate to the return value,
- * with a %NULL @inode replaced by ERR_PTR(-ESTALE).
- */
-struct dentry *d_obtain_alias(struct inode *inode)
+struct dentry *__d_obtain_alias(struct inode *inode, int disconnected)
 {
static const struct qstr anonstring = QSTR_INIT(/, 1);
struct dentry *tmp;
@@ -1881,7 +1863,10 @@ struct dentry *d_obtain_alias(struct inode *inode)
}
 
/* attach a disconnected dentry */
-   add_flags = d_flags_for_inode(inode) | DCACHE_DISCONNECTED;
+   add_flags = d_flags_for_inode(inode);
+
+   if (disconnected)
+   add_flags |= DCACHE_DISCONNECTED;
 
spin_lock(tmp-d_lock);
tmp-d_inode = inode;
@@ -1902,9 +1887,53 @@ struct dentry *d_obtain_alias(struct inode *inode)
iput(inode);
return res;
 }
+
+/**
+ * d_obtain_alias - find or allocate a DISCONNECTED dentry for a given inode
+ * @inode: inode to allocate the dentry for
+ *
+ * Obtain a dentry for an inode resulting from NFS filehandle conversion or
+ * similar open by handle operations.  The returned dentry may be anonymous,
+ * or may have a full name (if the inode was already in the cache).
+ *
+ * When called on a directory inode, we must ensure that the inode only ever
+ * has one dentry.  If a dentry is found, that is returned instead of
+ * allocating a new one.
+ *
+ * On successful return, the reference to the inode has been transferred
+ * to the dentry.  In case of an error the reference on the inode is released.
+ * To make it easier to use in export operations a %NULL or IS_ERR inode may
+ * be passed in and the error will be propagated to the return value,
+ * with a %NULL @inode replaced by ERR_PTR(-ESTALE).
+ */
+struct dentry *d_obtain_alias(struct inode *inode)
+{
+   return __d_obtain_alias(inode, 1);
+}
 EXPORT_SYMBOL(d_obtain_alias);
 
 /**
+ * d_obtain_alias_root - find or allocate a dentry for a given inode
+ * @inode: inode to allocate the dentry for
+ *
+ * Obtain an IS_ROOT dentry for the root of a filesystem.
+ *
+ * We must ensure that directory inodes only ever have one dentry.  If a
+ * dentry is found, that is returned instead

[PATCH 8/9] exportfs: update Exporting documentation

2014-02-18 Thread J. Bruce Fields
From: J. Bruce Fields bfie...@redhat.com

Minor documentation updates:
- refer to d_obtain_alias rather than d_alloc_anon
- explain when to use d_splice_alias and when
  d_materialise_unique.
- cut some details of d_splice_alias/d_materialise_unique
  implementation.

Signed-off-by: J. Bruce Fields bfie...@redhat.com
---
 Documentation/filesystems/nfs/Exporting | 38 -
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/Documentation/filesystems/nfs/Exporting 
b/Documentation/filesystems/nfs/Exporting
index e543b1a..9b7de5b 100644
--- a/Documentation/filesystems/nfs/Exporting
+++ b/Documentation/filesystems/nfs/Exporting
@@ -66,23 +66,31 @@ b/ A per-superblock list s_anon of dentries which are the 
roots of
 
 c/ Helper routines to allocate anonymous dentries, and to help attach
loose directory dentries at lookup time. They are:
-d_alloc_anon(inode) will return a dentry for the given inode.
+d_obtain_alias(inode) will return a dentry for the given inode.
   If the inode already has a dentry, one of those is returned.
   If it doesn't, a new anonymous (IS_ROOT and
 DCACHE_DISCONNECTED) dentry is allocated and attached.
   In the case of a directory, care is taken that only one dentry
   can ever be attached.
-d_splice_alias(inode, dentry) will make sure that there is a
-  dentry with the same name and parent as the given dentry, and
-  which refers to the given inode.
-  If the inode is a directory and already has a dentry, then that
-  dentry is d_moved over the given dentry.
-  If the passed dentry gets attached, care is taken that this is
-  mutually exclusive to a d_alloc_anon operation.
-  If the passed dentry is used, NULL is returned, else the used
-  dentry is returned.  This corresponds to the calling pattern of
-  -lookup.
-  
+d_splice_alias(inode, dentry) or d_materialise_unique(dentry, inode)
+  will introduce a new dentry into the tree; either the passed-in
+  dentry or a preexisting alias for the given inode (such as an
+  anonymous one created by d_obtain_alias), if appropriate.  The two
+  functions differ in their handling of directories with preexisting
+  aliases:
+d_splice_alias will use any existing IS_ROOT dentry, but it will
+ return -EIO rather than try to move a dentry with a different
+ parent.  This is appropriate for local filesystems, which
+ should never see such an alias unless the filesystem is
+ corrupted somehow (for example, if two on-disk directory
+ entries refer to the same directory.)
+   d_obtain_alias will attempt to move any dentry.  This is
+ appropriate for distributed filesystems, where finding a
+ directory other than where we last cached it may be a normal
+ consequence of concurrent operations on other hosts.
+  Both functions return NULL when the passed-in dentry is used,
+  following the calling convention of -lookup.
+
  
 Filesystem Issues
 -
@@ -120,12 +128,12 @@ struct which has the following members:
 
   fh_to_dentry (mandatory)
 Given a filehandle fragment, this should find the implied object and
-create a dentry for it (possibly with d_alloc_anon).
+create a dentry for it (possibly with d_obtain_alias).
 
   fh_to_parent (optional but strongly recommended)
 Given a filehandle fragment, this should find the parent of the
-implied object and create a dentry for it (possibly with d_alloc_anon).
-May fail if the filehandle fragment is too small.
+implied object and create a dentry for it (possibly with
+d_obtain_alias).  May fail if the filehandle fragment is too small.
 
   get_parent (optional but strongly recommended)
 When given a dentry for a directory, this should return  a dentry for
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol

2014-02-14 Thread J. Bruce Fields
On Fri, Feb 14, 2014 at 01:43:48PM -0500, Josef Bacik wrote:
 A user was running into errors from an NFS export of a subvolume that had a
 default subvol set.  When we mount a default subvol we will use 
 d_obtain_alias()
 to find an existing dentry for the subvolume in the case that the root subvol
 has already been mounted, or a dummy one is allocated in the case that the 
 root
 subvol has not already been mounted.  This allows us to connect the dentry 
 later
 on if we wander into the path.  However if we don't ever wander into the path 
 we
 will keep DCACHE_DISCONNECTED set for a long time, which angers NFS.  It 
 doesn't
 appear to cause any problems but it is annoying nonetheless, so simply unset
 DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to
 use d_materialise_unique() instead which will make everything play nicely
 together and reconnect stuff if we wander into the defaul subvol path from a
 different way.  With this patch I'm no longer getting the NFS errors when
 exporting a volume that has been mounted with a default subvol set.  Thanks,

Looks obviously correct, but based on a quick grep, there are four
d_obtain_alias callers outside export methods:

- btrfs/super.c:get_default_root()
- fs/ceph/super.c:open_root_dentry()
- fs/nfs/getroot.c:nfs_get_root()
- fs/nilfs2/super.c:nilfs_get_root_dentry()

It'd be nice to give them a common d_obtain_alias variant instead of
making them all clear this by hand.

Of those nilfs2 also uses d_splice_alias.  I think that problem would
best be solved by fixing d_splice_alias not to require a
DCACHE_DISCONNECTED dentry; IS_ROOT() on its own should be fine.

--b.

 
 cc: bfie...@fieldses.org
 cc: ebied...@xmission.com
 Signed-off-by: Josef Bacik jba...@fb.com
 ---
  fs/btrfs/inode.c | 2 +-
  fs/btrfs/super.c | 9 -
  2 files changed, 9 insertions(+), 2 deletions(-)
 
 diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
 index 197edee..8dba152 100644
 --- a/fs/btrfs/inode.c
 +++ b/fs/btrfs/inode.c
 @@ -5157,7 +5157,7 @@ static struct dentry *btrfs_lookup(struct inode *dir, 
 struct dentry *dentry,
   return ERR_CAST(inode);
   }
  
 - return d_splice_alias(inode, dentry);
 + return d_materialise_unique(dentry, inode);
  }
  
  unsigned char btrfs_filetype_table[] = {
 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
 index 147ca1d..dc0a315 100644
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -855,6 +855,7 @@ static struct dentry *get_default_root(struct super_block 
 *sb,
   struct btrfs_path *path;
   struct btrfs_key location;
   struct inode *inode;
 + struct dentry *dentry;
   u64 dir_id;
   int new = 0;
  
 @@ -925,7 +926,13 @@ setup_root:
   return dget(sb-s_root);
   }
  
 - return d_obtain_alias(inode);
 + dentry = d_obtain_alias(inode);
 + if (!IS_ERR(dentry)) {
 + spin_lock(dentry-d_lock);
 + dentry-d_flags = ~DCACHE_DISCONNECTED;
 + spin_unlock(dentry-d_lock);
 + }
 + return dentry;
  }
  
  static int btrfs_fill_super(struct super_block *sb,
 -- 
 1.8.3.1
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol

2014-02-14 Thread J. Bruce Fields
On Fri, Feb 14, 2014 at 05:40:55PM -0800, Eric W. Biederman wrote:
 J. Bruce Fields bfie...@fieldses.org writes:
 
  On Fri, Feb 14, 2014 at 01:43:48PM -0500, Josef Bacik wrote:
  A user was running into errors from an NFS export of a subvolume that had a
  default subvol set.  When we mount a default subvol we will use 
  d_obtain_alias()
  to find an existing dentry for the subvolume in the case that the root 
  subvol
  has already been mounted, or a dummy one is allocated in the case that the 
  root
  subvol has not already been mounted.  This allows us to connect the dentry 
  later
  on if we wander into the path.  However if we don't ever wander into the 
  path we
  will keep DCACHE_DISCONNECTED set for a long time, which angers NFS.  It 
  doesn't
  appear to cause any problems but it is annoying nonetheless, so simply 
  unset
  DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() 
  to
  use d_materialise_unique() instead which will make everything play nicely
  together and reconnect stuff if we wander into the defaul subvol path from 
  a
  different way.  With this patch I'm no longer getting the NFS errors when
  exporting a volume that has been mounted with a default subvol set.  
  Thanks,
 
  Looks obviously correct, but based on a quick grep, there are four
  d_obtain_alias callers outside export methods:
 
  - btrfs/super.c:get_default_root()
  - fs/ceph/super.c:open_root_dentry()
  - fs/nfs/getroot.c:nfs_get_root()
  - fs/nilfs2/super.c:nilfs_get_root_dentry()
 
  It'd be nice to give them a common d_obtain_alias variant instead of
  making them all clear this by hand.
 
 I am in favor of one small fix at a time, so that progress is made and
 fixing something just for btrfs seems reasonable for the short term.
 
  Of those nilfs2 also uses d_splice_alias.  I think that problem would
  best be solved by fixing d_splice_alias not to require a
  DCACHE_DISCONNECTED dentry; IS_ROOT() on its own should be fine.
 
 You mean by renaming d_splice_alias d_materialise_unique?
 
 Or is there a useful distinction you see that should be preserved
 between the two methods?
 
 Right now my inclination is that everyone should just use
 d_materialise_unique and we should kill d_splice_alias.

Probably.  One remaining distinction:

- In the local filesystem case if you discover a directory is
  already aliased elsewhere, you have a corrupted filesystem and
  want to error out the lookup.  (Didn't you propose a patch to
  do something like that before?)
- In the distributed filesystem this is perfectly normal and we
  want to do our best to fix up our local cache to represent
  remote reality.

 And by everyone I mean all file systems that are either distributed
 (implementing d_revalidate) or exportable by knfsd.
 
 One of the interesting things that d_materialise_unique does is get the
 lazy rename case correct for a distributed filesystem.
 check_submounts_and_drop can drop a directory when it is found not to be
 accessible by that name, but later when we look it up
 d_materialise_uniuqe will resuscciate the existing dentry.

OK.  I'm not sure I understand how that helps.

Ugly untested draft follows.

--b.

diff --git a/fs/dcache.c b/fs/dcache.c
index 265e0ce..b4572fa 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1905,58 +1905,6 @@ struct dentry *d_obtain_alias(struct inode *inode)
 EXPORT_SYMBOL(d_obtain_alias);
 
 /**
- * d_splice_alias - splice a disconnected dentry into the tree if one exists
- * @inode:  the inode which may have a disconnected dentry
- * @dentry: a negative dentry which we want to point to the inode.
- *
- * If inode is a directory and has a 'disconnected' dentry (i.e. IS_ROOT and
- * DCACHE_DISCONNECTED), then d_move that in place of the given dentry
- * and return it, else simply d_add the inode to the dentry and return NULL.
- *
- * This is needed in the lookup routine of any filesystem that is exportable
- * (via knfsd) so that we can build dcache paths to directories effectively.
- *
- * If a dentry was found and moved, then it is returned.  Otherwise NULL
- * is returned.  This matches the expected return value of -lookup.
- *
- * Cluster filesystems may call this function with a negative, hashed dentry.
- * In that case, we know that the inode will be a regular file, and also this
- * will only occur during atomic_open. So we need to check for the dentry
- * being already hashed only in the final case.
- */
-struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
-{
-   struct dentry *new = NULL;
-
-   if (IS_ERR(inode))
-   return ERR_CAST(inode);
-
-   if (inode  S_ISDIR(inode-i_mode)) {
-   spin_lock(inode-i_lock);
-   new = __d_find_alias(inode, 1);
-   if (new) {
-   BUG_ON(!(new-d_flags  DCACHE_DISCONNECTED));
-   spin_unlock(inode-i_lock

Re: [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support

2013-05-15 Thread J. Bruce Fields
On Tue, May 14, 2013 at 02:15:26PM -0700, Zach Brown wrote:
 This crude patch illustrates the simplest plumbing involved in
 supporting sys_call_range with the NFS COPY operation that's pending in
 the 4.2 draft spec.
 
 The patch is based on a previous prototype that used the COPY op to
 implement sys_copyfileat which created a new file (based on the ocfs2
 reflink ioctl).  By contrast, this copies file contents between existing
 files.
 
 There's still a lot of implementation and testing to do, but this can
 get discussion going.

I'm using:

git://github.com/loghyr/NFSv4.2

as my reference for the draft protocol.

On a quick skim, one thing this is missing before it complies is a
client implementation of CB_OFFLOAD: If a client desires an
intra-server file copy, then it MUST support the COPY and CB_OFFLOAD
operations.

The server doesn't have to implement CB_OFFLOAD, though, so we should
ditch these todo's:

 +/*
 + * XXX:
 + *  - do something with stateids :)
 + *  - implement callback results and OFFLOAD_ABORT
 + *  - inter-server copies?
 + */

...

 + /* don't support async callbacks yet */

...

lest someone go try to implement them for no reason.  (Stranger things
have happened.)

Nits, possibly to ignore for now:

 + copy-u.ok.cr_callback_id_length = 0;
 +
 + return status;
 +}
 +
  /* This routine never returns NFS_OK!  If there are no other errors, it
   * will return NFSERR_SAME or NFSERR_NOT_SAME depending on whether the
   * attributes matched.  VERIFY is implemented by mapping NFSERR_SAME
 @@ -1798,6 +1829,10 @@ static struct nfsd4_operation nfsd4_ops[] = {
   .op_get_currentstateid = (stateid_getter)nfsd4_get_freestateid,
   .op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize,
   },
 + [OP_COPY] = {
 + .op_func = (nfsd4op_func)nfsd4_copy,
 + .op_name = OP_COPY,
 + },

There's some more boilerplate to fill in (see other ops).

 +static __be32
  nfsd4_decode_noop(struct nfsd4_compoundargs *argp, void *p)
  {
   return nfs_ok;
 @@ -1557,6 +1577,7 @@ static nfsd4_dec nfsd41_dec_ops[] = {
   [OP_WANT_DELEGATION]= (nfsd4_dec)nfsd4_decode_notsupp,
   [OP_DESTROY_CLIENTID]   = (nfsd4_dec)nfsd4_decode_destroy_clientid,
   [OP_RECLAIM_COMPLETE]   = (nfsd4_dec)nfsd4_decode_reclaim_complete,
 + [OP_COPY]   = (nfsd4_dec)nfsd4_decode_copy,

And this should be made 4.2-specific.

  };
  
  struct nfsd4_minorversion_ops {
 @@ -3394,6 +3415,27 @@ nfsd4_encode_test_stateid(struct nfsd4_compoundres 
 *resp, __be32 nfserr,
  }
  
  static __be32
 +nfsd4_encode_copy(struct nfsd4_compoundres *resp, __be32 nfserr,
 + struct nfsd4_copy *copy)
 +{
 + __be32 *p;
 +
 + if (!nfserr) {
 + RESERVE_SPACE(4);
 + WRITE32(copy-u.ok.cr_callback_id_length);
 + ADJUST_ARGS();
 + if (copy-u.ok.cr_callback_id_length == 1)
 + nfsd4_encode_stateid(resp, copy-u.ok.cr_callback_id);
 + } else {
 + RESERVE_SPACE(8);
 + WRITE64(copy-u.cr_bytes_copied);
 + ADJUST_ARGS();
 + }
 +
 + return nfserr;
 +}
 +
 +static __be32
  nfsd4_encode_noop(struct nfsd4_compoundres *resp, __be32 nfserr, void *p)
  {
   return nfserr;
 @@ -3465,6 +3507,7 @@ static nfsd4_enc nfsd4_enc_ops[] = {
   [OP_WANT_DELEGATION]= (nfsd4_enc)nfsd4_encode_noop,
   [OP_DESTROY_CLIENTID]   = (nfsd4_enc)nfsd4_encode_noop,
   [OP_RECLAIM_COMPLETE]   = (nfsd4_enc)nfsd4_encode_noop,
 + [OP_COPY]   = (nfsd4_enc)nfsd4_encode_copy,
  };
  
  /*
 diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
 index 84ce601..0c1b427 100644
 --- a/fs/nfsd/vfs.c
 +++ b/fs/nfsd/vfs.c
 @@ -28,6 +28,8 @@
  #include asm/uaccess.h
  #include linux/exportfs.h
  #include linux/writeback.h
 +#include linux/fs_struct.h
 +#include linux/kmod.h
  
  #ifdef CONFIG_NFSD_V3
  #include xdr3.h
 @@ -621,6 +623,45 @@ int nfsd4_is_junction(struct dentry *dentry)
   return 0;
   return 1;
  }
 +
 +__be32
 +nfsd_copy_range(struct svc_rqst *rqstp, struct svc_fh *fhp_in, u64 pos_in,
 + struct svc_fh *fhp_out, u64 pos_out, u64 count)
 +{
 + struct file *filp_in = NULL;
 + struct file *filp_out = NULL;
 + int err;
 +
 + /* XXX verify pos and count within sane limits? */
 +
 + err = nfsd_open(rqstp, fhp_in, S_IFREG, NFSD_MAY_READ, filp_in);
 + if (err)
 + goto out;
 +
 + err = nfsd_open(rqstp, fhp_out, S_IFREG, NFSD_MAY_WRITE, filp_out);
 + if (err)
 + goto out;

Looking at the xdr... the COPY operation takes stateid's, which nfsd can
use to look up files, so the opens shouldn't be required.

--b.

 +
 + err = vfs_copy_range(filp_in, pos_in, filp_out, pos_out, count);
 + /* fall back if .copy_range isn't supported */
 +
 + if (!err  EX_ISSYNC(fhp_out-fh_export))
 + err = vfs_fsync_range(filp_out, pos_out, pos_out + count-1, 

Re: [RFC v0 4/4] nfs, nfsd: rough sys_copy_range and COPY support

2013-05-15 Thread J. Bruce Fields
On Wed, May 15, 2013 at 08:21:54PM +, Myklebust, Trond wrote:
 On Wed, 2013-05-15 at 16:19 -0400, J. Bruce Fields wrote:
  On Tue, May 14, 2013 at 02:15:26PM -0700, Zach Brown wrote:
   This crude patch illustrates the simplest plumbing involved in
   supporting sys_call_range with the NFS COPY operation that's pending in
   the 4.2 draft spec.
   
   The patch is based on a previous prototype that used the COPY op to
   implement sys_copyfileat which created a new file (based on the ocfs2
   reflink ioctl).  By contrast, this copies file contents between existing
   files.
   
   There's still a lot of implementation and testing to do, but this can
   get discussion going.
  
  I'm using:
  
  git://github.com/loghyr/NFSv4.2
  
  as my reference for the draft protocol.
  
  On a quick skim, one thing this is missing before it complies is a
  client implementation of CB_OFFLOAD: If a client desires an
  intra-server file copy, then it MUST support the COPY and CB_OFFLOAD
  operations.
 
 Note that Bryan is currently working on updating the NFS implementation
 to match the draft protocol.

OK, good.--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 2/2] Btrfs: move over to use -update_time

2012-04-09 Thread J. Bruce Fields
On Wed, Apr 04, 2012 at 02:16:22PM -0400, Josef Bacik wrote:
 On Wed, Apr 04, 2012 at 09:12:57PM +0300, Kasatkin, Dmitry wrote:
  On Wed, Apr 4, 2012 at 8:47 PM, Mimi Zohar zo...@linux.vnet.ibm.com wrote:
   On Wed, 2012-04-04 at 13:43 -0400, Josef Bacik wrote:
   On Wed, Apr 04, 2012 at 08:24:19PM +0300, Kasatkin, Dmitry wrote:
Hello,
   
Mimi and I working on IMA/EVM (security/integrity) and it uses
i_version for checking if file content has been changed.
extX file systems support i_version updates with mounting file system
with iversion option or via kernel command line parameter
i_version
   
It seems iversion option is not recognized when mounting btrfs.
I see this patchset deals with i_version update as well..
Can you please give an advice how to use i_version with btrfs?
   
  
   Oh good somebody uses this?  We actually have a -sequence thing we use 
   for
   this, the grand idea was to make it smarter about telling nfs when 
   something
   changed, but if you guys use i_version we could probably get rid of our 
   in-core
   sequence and use the normal inodes i_version and then just store it in 
   our
   sequence field on disk.  I'll do it without a mount option tho so it 
   just works,
   does that sound good to you?  Thanks,
  
  Hello,
  
  Thank you for the answer...
  But can you a bit clarify...
  
  Looking to file_update_time() I see that it does:
  
  if (IS_I_VERSION(inode))
  sync_it |= S_VERSION;
  
  Basically it should be (inode-i_sb-s_flags  MS_I_VERSION)
  
  use of i_version is controlled by iversion mount flag.
  for ext4 I see in parse_options():
  
  case Opt_i_version:
  set_opt(sb, I_VERSION);
  sb-s_flags |= MS_I_VERSION;
  break;
  
  
  But who sets MS_I_VERSION in s_flags on btrfs?
  
 
 Nobody yet, I'm going to send a patch shortly that will support this.  Thanks,

Great.  It would also be far preferable if it was just always on (at
least by default) rather than requiring a mount option.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-09 Thread J. Bruce Fields
On Wed, Dec 08, 2010 at 09:41:33PM -0700, Andreas Dilger wrote:
 On 2010-12-08, at 16:07, Neil Brown wrote:
  On Mon, 6 Dec 2010 11:48:45 -0500 J. Bruce Fields bfie...@redhat.com
  wrote:
  
  On Fri, Dec 03, 2010 at 04:01:44PM -0700, Andreas Dilger wrote:
  Any chance we can add a -get_fsid(sb, inode) method to
  export_operations (or something simiar), that allows the
  filesystem to generate an FSID based on the volume and
  inode that is being exported?
  
  No objection from here.
  
  My standard objection here is that you cannot guarantee that the
  fsid is 100% guarantied to be unique across all filesystems in
  the system (including filesystems mounted from dm snapshots of
  filesystems that are currently mounted).  NFSd needs this uniqueness.
 
 Sure, but you also cannot guarantee that the devno is constant across 
 reboots, yet NFS continues to use this much-less-constant value...
 
  This is only really an objection if user-space cannot over-ride
  the fsid provided by the filesystem.
 
 Agreed.  It definitely makes sense to allow this, for whatever strange 
 circumstances might arise.  However, defaulting to using the filesystem UUID 
 definitely makes the most sense, and looking at the nfs-utils mountd code, it 
 seems that this is already standard behaviour for local block devices 
 (excluding btrfs filesystems).
 
  I'd be very happy to see an interface to user-space whereby
  user-space can get a reasonably unique fsid for a given
  filesystem.
 
 Hmm, maybe I'm missing something, but why does userspace need to be able to 
 get this value?  I would think that nfsd gets it from the filesystem directly 
 in the kernel, but if a uuid= option is present in the exports file that is 
 preferentially used over the value from the filesystem.

Well, the kernel can't distinguish the case of an explicit uuid=
option in /etc/exports from one that was (as is the normal default)
generated automatically by mountd.  Maybe not a big deal.

The uuid seems like a useful thing to have access to from userspace
anyway, for userspace nfs servers if for no other reason:

 That said, I think Aneesh's open_by_handle patchset also made the UUID 
 visible in /proc/pid/mountinfo, after the filesystems stored it in
 sb-s_uuid at mount time.  That _should_ make it visible for non-block 
 mountpoints as well, assuming they fill in s_uuid.
 
  Whether this is an export_operations method or some field in the
  'struct super' which gets copied out doesn't matter to me.
 
 Since Aneesh has already developed patches, is there any objection to using 
 those (last sent to linux-fsdevel on 2010-10-29):
 
 [PATCH -V22 12/14] vfs: Export file system uuid via /proc/pid/mountinfo
 [PATCH -V22 13/14] ext3: Copy fs UUID to superblock.
 [PATCH -V22 14/14] ext4: Copy fs UUID to superblock

I can't see anything wrong with that.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-08 Thread J. Bruce Fields
On Wed, Dec 08, 2010 at 10:16:29AM -0700, Andreas Dilger wrote:
 On 2010-12-07, at 10:02, Trond Myklebust wrote:
 
  On Tue, 2010-12-07 at 17:51 +0100, Christoph Hellwig wrote:
  It's just as stable as a real dev_t in the times of hotplug and udev.
  As long as you don't touch anything including not upgrading the kernel
  it's remain stable, otherwise it will break.  That's why modern
  nfs-utils default to using the uuid-based filehandle schemes instead of
  the dev_t based ones.  At least that's what I told - I really hope it's
  using the real UUIDs from the filesystem and not the horrible fsid hack
  that was once added - for some filesystems like XFS that field does not
  actually have any relation to the UUID historically.  And while we
  could have changed that it's too late now that nfs was hacked into
  abusing that field.
  
  IIRC, NFS uses the full true uuid for NFSv3 and NFSv4 filehandles, but
  they won't fit into the NFSv2 32-byte filehandles, so there is an
  '8-byte fsid' and '4-byte fsid + inode number' workaround for that...
  
  See the mk_fsid() helper in fs/nfsd/nfsfh.h
 
 It looks like mk_fsid() is only actually using the UUID if it is specified in 
 the /etc/exports file (AFAICS, this depends on ex_uuid being set from a 
 uuid=... option).

No, if you look at the nfs-utils source you'll find mountd sets a uuid
by default (in utils/mountd/cache.c:uuid_by_path()).

 There was a patch in the open_by_handle() patch series that added an s_uuid 
 field to the superblock, that could be used if no uuid= option is specified 
 in the /etc/exports file.

Agreed that doing this in the kernel would probably be simpler.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-07 Thread J. Bruce Fields
On Tue, Dec 07, 2010 at 05:52:13PM +0100, hch wrote:
 On Fri, Dec 03, 2010 at 05:45:26PM -0500, J. Bruce Fields wrote:
  We're using statfs64.fs_fsid for this; I believe that's both stable
  across reboots and distinguishes between subvolumes, so that's OK.
 
 It's a field that doesn't have any useful specification and basically
 contains random garbage that a filesystem put into it.  Using it is a
 very bad idea.

I meant the above statement to apply only to btrfs; and nfs-utils is
using fs_fsid only in the case where the filesystem type is btrfs.  So
I believe the current code does work.

But I agree that constructing filehandles differently based on a
strcmp() of the filesystem type is not a sustainable design, to say the
least.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-03 Thread J. Bruce Fields
On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
 So now that I've actually looked at everything, it looks like the semantics 
 are
 all right for subvolumes
 
 1) readdir - we return the root id in d_ino, which is unique across the fs

Though Michael Vrable pointed out an apparent collision with normal
inode numbers on the parent filesystem?

 2) stat - we return 256 for all subvolumes, because that is their inode number
 3) dev_t - we setup an anon super for all volumes, so they all get their own
 dev_t, which is set properly for all of their children, see below
 
 [r...@test1244 btrfs-test]# stat .
   File: `.'
   Size: 20  Blocks: 8  IO Block: 4096   directory
 Device: 15h/21d Inode: 256 Links: 1
 Access: (0555/dr-xr-xr-x)  Uid: (0/root)   Gid: (0/root)
 Access: 2010-12-03 15:35:41.931679393 -0500
 Modify: 2010-12-03 15:35:20.405679493 -0500
 Change: 2010-12-03 15:35:20.405679493 -0500
 
 [r...@test1244 btrfs-test]# stat foo
   File: `foo'
   Size: 12  Blocks: 0  IO Block: 4096   directory
 Device: 19h/25d Inode: 256 Links: 1
 Access: (0700/drwx--)  Uid: (0/root)   Gid: (0/root)
 Access: 2010-12-03 15:35:17.501679393 -0500
 Modify: 2010-12-03 15:35:59.150680051 -0500
 Change: 2010-12-03 15:35:59.150680051 -0500
 
 [r...@test1244 btrfs-test]# stat foo/foobar 
   File: `foo/foobar'
   Size: 0   Blocks: 0  IO Block: 4096   regular empty file
 Device: 19h/25d Inode: 257 Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 Access: 2010-12-03 15:35:59.150680051 -0500
 Modify: 2010-12-03 15:35:59.150680051 -0500
 Change: 2010-12-03 15:35:59.150680051 -0500
 
 So as far as the user is concerned, everything should come out right.  
 Obviously
 we had to do the NFS trickery still because as far as VFS is concerned the
 subvolumes are all on the same mount.  So the question is this (and really 
 this
 is directed at Christoph and Bruce and anybody else who may care), is this 
 good
 enough, or do we want to have a seperate vfsmount for each subvolume?  Thanks,

For nfsd's purposes, we need to be able find out about filesystems in
two different ways:

1. Lookup by filehandle: we need to be able to identify which
subvolume we're dealing with from a filehandle.
2. Lookup by path: we need to notice when we cross into a
subvolume.

Looks like #1 already works.  Not #2: the current nfsd code just checks
for mountpoints.  We could modify nfsd to also check whether dev_t
changed each time it did a lookup.  I suppose it would work, though it's
annoying to have to do it just for the case of btrfs.

As far as I can tell, crossing into a subvolume is like crossing a
mountpoint in every way except for the lack of a separate vfsmount.  I'd
worry that the inconsistency will end up requiring more special cases
down the road, but I don't have any in mind.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-03 Thread J. Bruce Fields
On Fri, Dec 03, 2010 at 05:29:24PM -0500, Chris Mason wrote:
 Excerpts from Dave Chinner's message of 2010-12-03 17:27:56 -0500:
  On Fri, Dec 03, 2010 at 04:45:27PM -0500, Josef Bacik wrote:
   On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
Hello,

Various people have complained about how BTRFS deals with subvolumes 
recently,
specifically the fact that they all have the same inode number, and 
there's no
discrete seperation from one subvolume to another.  Christoph asked 
that I lay
out a basic design document of how we want subvolumes to work so we can 
hash
everything out now, fix what is broken, and then move forward with a 
design that
everybody is more or less happy with.  I apologize in advance for how 
freaking
long this email is going to be.  I assume that most people are generally
familiar with how BTRFS works, so I'm not going to bother explaining in 
great
detail some stuff.

  
are things that cannot be fixed now.  Some of these changes will require
incompat format changes, but it's either we fix it now, or later on 
down the
road when BTRFS starts getting used in production really find out how 
many
things our current scheme breaks and then have to do the changes then.  
Thanks,

   
   So now that I've actually looked at everything, it looks like the 
   semantics are
   all right for subvolumes
   
   1) readdir - we return the root id in d_ino, which is unique across the fs
   2) stat - we return 256 for all subvolumes, because that is their inode 
   number
   3) dev_t - we setup an anon super for all volumes, so they all get their 
   own
   dev_t, which is set properly for all of their children, see below
  
  A property of NFS fileshandles is that they must be stable across
  server reboots. Is this anon dev_t used as part of the NFS
  filehandle and if so how can you guarantee that it is stable?
 
 It isn't today, that's something we'll have to address.

We're using statfs64.fs_fsid for this; I believe that's both stable
across reboots and distinguishes between subvolumes, so that's OK.

(That said, since fs_fsid doesn't work for other filesystems, we depend
on an explicit check for a filesystem type of btrfs, which is
awful--btrfs won't always be the only filesystem that wants to do this
kind of thing, etc.)

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-01 Thread J. Bruce Fields
On Wed, Dec 01, 2010 at 09:21:36AM -0500, Josef Bacik wrote:
 Hello,
 
 Various people have complained about how BTRFS deals with subvolumes recently,
 specifically the fact that they all have the same inode number, and there's no
 discrete seperation from one subvolume to another.  Christoph asked that I lay
 out a basic design document of how we want subvolumes to work so we can hash
 everything out now, fix what is broken, and then move forward with a design 
 that
 everybody is more or less happy with.  I apologize in advance for how freaking
 long this email is going to be.  I assume that most people are generally
 familiar with how BTRFS works, so I'm not going to bother explaining in great
 detail some stuff.
 
 === What are subvolumes? ===
 
 They are just another tree.  In BTRFS we have various b-trees to describe the
 filesystem.  A few of them are filesystem wide, such as the extent tree, chunk
 tree, root tree etc.  The tree's that hold the actual filesystem data, that is
 inodes and such, are kept in their own b-tree.  This is how subvolumes and
 snapshots appear on disk, they are simply new b-trees with all of the file 
 data
 contained within them.
 
 === What do subvolumes look like? ===
 
 All the user sees are directories.  They act like any other directory acts, 
 with
 a few exceptions
 
 1) You cannot hardlink between subvolumes.  This is because subvolumes have
 their own inode numbers and such, think of them as seperate mounts in this 
 case,
 you cannot hardlink between two mounts because the link needs to point to the
 same on disk inode, which is impossible between two different filesystems.  
 The
 same is true for subvolumes, they have their own trees with their own inodes 
 and
 inode numbers, so it's impossible to hardlink between them.

OK, so I'm unclear: would it be possible for nfsd to export subvolumes
independently?

For that to work, we need to be able to take an inode that we just
looked up by filehandle, and see which subvolume it belongs in.  So if
two subvolumes can point to the same inode, it doesn't work, but if
st_dev is different between them, e.g., that'd be fine.  Sounds like
you're seeing the latter is possible, good!

 
 1a) In case it wasn't clear from above, each subvolume has their own inode
 numbers, so you can have the same inode numbers used between two different
 subvolumes, since they are two different trees.
 
 2) Obviously you can't just rm -rf subvolumes.  Because they are roots there's
 extra metadata to keep track of them, so you have to use one of our ioctls to
 delete subvolumes/snapshots.
 
 But permissions and everything else they are the same.
 
 There is one tricky thing.  When you create a subvolume, the directory inode
 that is created in the parent subvolume has the inode number of 256.

Is that the right way to say this?  Doing a quick test, the inode
numbers that a readdir of the parent directory returns *are* distinct.
It's just the inode number that you get when you stat that is different.

Which is all fine and normal, *if* you treat this as a real mountpoint
with its own vfsmount, st_dev, etc.

 === How do we want subvolumes to work from a user perspective? ===
 
 1) Users need to be able to create their own subvolumes.  The permission
 semantics will be absolutely the same as creating directories, so I don't 
 think
 this is too tricky.  We want this because you can only take snapshots of
 subvolumes, and so it is important that users be able to create their own
 discrete snapshottable targets.
 
 2) Users need to be able to snapshot their subvolumes.  This is basically the
 same as #1, but it bears repeating.
 
 3) Subvolumes shouldn't need to be specifically mounted.  This is also
 important, we don't want users to have to go around mounting their subvolumes 
 up
 manually one-by-one.  Today users just cd into subvolumes and it works, just
 like cd'ing into a directory.

And the separate nfsd exports is another thing I'd really love to see
work: currently you can export a subtree of a filesystem if you want,
but it's trivial to escape the subtree by guessing filehandles.  So this
gives us an easy way for administrators to create secure separate
exports without having to manage entirely separate volumes.

If subvolumes got real mountpoints and so on, this would be easy.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-01 Thread J. Bruce Fields
On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
 Oh well crud, I was hoping that I could leave the inode numbers as 256 for
 everything, but I forgot about readdir.  So the inode item in the parent would
 have to have a unique inode number that would get spit out in readdir, but 
 then
 if we stat'ed the directory we'd get 256 for the inode number.  Oh well,
 incompat flag it is then.

I think you're already fine:

# mkdir TMP
# dd if=/dev/zero of=TMP-image bs=1M count=512
# mkfs.btrfs TMP-image
# mount -oloop TMP-image TMP/
# btrfs subvolume create sub-a
# btrfs subvolume create sub-b
../readdir-inos .
. 256 256
.. 256 4130609
sub-a 256 256
sub-b 257 256

Where readdir-inos is my silly test program below, and the first number is from
readdir, the second from stat.

?

--b.

#include stdio.h
#include err.h
#include sys/types.h
#include sys/stat.h
#include unistd.h
#include dirent.h

/* demonstrate that for mountpoints, readdir ino of mounted-on
 * directory, stat returns ino of mounted directory. */

int main(int argc, char *argv[])
{
struct dirent *de;
int ret;
DIR *d;

if (argc != 2)
errx(1, usage: %s directory, argv[0]);
ret = chdir(argv[1]);
if (ret)
errx(1, chdir /);
d = opendir(.);
if (!d)
errx(1, opendir .);
while (de = readdir(d)) {
struct stat st;

ret = stat(de-d_name, st);
if (ret)
errx(1, stat %s, de-d_name);
printf(%s %d %d\n, de-d_name, de-d_ino, st.st_ino);
}
}

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What to do about subvolumes?

2010-12-01 Thread J. Bruce Fields
On Wed, Dec 01, 2010 at 03:09:52PM -0500, Josef Bacik wrote:
 On Wed, Dec 01, 2010 at 03:00:08PM -0500, J. Bruce Fields wrote:
  On Wed, Dec 01, 2010 at 02:54:33PM -0500, Josef Bacik wrote:
   Oh well crud, I was hoping that I could leave the inode numbers as 256 for
   everything, but I forgot about readdir.  So the inode item in the parent 
   would
   have to have a unique inode number that would get spit out in readdir, 
   but then
   if we stat'ed the directory we'd get 256 for the inode number.  Oh well,
   incompat flag it is then.
  
  I think you're already fine:
  
  # mkdir TMP
  # dd if=/dev/zero of=TMP-image bs=1M count=512
  # mkfs.btrfs TMP-image
  # mount -oloop TMP-image TMP/
  # btrfs subvolume create sub-a
  # btrfs subvolume create sub-b
  ../readdir-inos .
  . 256 256
  .. 256 4130609
  sub-a 256 256
  sub-b 257 256
  
  Where readdir-inos is my silly test program below, and the first number is 
  from
  readdir, the second from stat.
 
 
 Heh as soon as I typed my email I went and actually looked at the code, looks
 like for readdir we fill in the root id, which will be unique, so hotdamn we 
 are
 good and I don't have to use a stupid incompat flag.  Thanks for checking that
 :),

My only complaint was just about how you said this:

When you create a subvolume, the directory inode that is
created in the parent subvolume has the inode number of 256

If you revise that you might want to clarify.  (Maybe Every subvolume
has a root directory inode with inode number 256?)

The way you've stated it sounds like you're talking about the
readdir-returned number, which would normally come from the inode that
has been covered up by the mount, and which really is an inode in the
parent filesystem

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Improve support for exporting btrfs subvolumes.

2010-06-23 Thread J. Bruce Fields
On Thu, Jun 17, 2010 at 02:54:01PM +1000, Neil Brown wrote:
 
 If you export two subvolumes of a btrfs filesystem, they will both be
 given the same uuid so lookups will be confused.
 blkid cannot differentiate the two, so we must use the fsid from
 statfs64 to identify the filesystem.
 
 We cannot tell if blkid or statfs is best without knowing internal
 details of the filesystem in question, so we need to encode specific
 knowledge of btrfs in mountd.  This is unfortunate.
 
 To ensure smooth handling of this and possible future changes in uuid
 generation, we add infrastructure for multiple different uuids to be
 recognised on old filehandles, but only the preferred on is used on
 new filehandles.

Could you just contatenate the two (or hash them somehow)?  Or does that
just use up too much space in the filehandle?

--b.

 
 Signed-off-by: NeilBrown ne...@suse.de
 
 --
 This is a substantially revised version of a patch I posted a while
 ago.
 I tried to find a way to do it would hard coding knowledge of btrfs in
 nfs-utils, but it isn't possible.  For some filesystems, f_fsid is
 best, for some it is worst.  No way to tell the difference.
 
 This patch add infrastructure so that if we find a better way to get a
 good uuid (e.g. a new syscall), we can slot it in for new filehandles,
 but old filehandles using the old uuid will still work.
 
 I believe this is ready for inclusion upstream.
 Thanks,
 NeilBrown
 
 
 diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
 index caef5b2..85cd829 100644
 --- a/utils/mountd/cache.c
 +++ b/utils/mountd/cache.c
 @@ -170,13 +170,16 @@ void auth_unix_gid(FILE *f)
  #if USE_BLKID
  static const char *get_uuid_blkdev(char *path)
  {
 + /* We set *safe if we know that we need the
 +  * fsid from statfs too.
 +  */
   static blkid_cache cache = NULL;
   struct stat stb;
   char *devname;
   blkid_tag_iterate iter;
   blkid_dev dev;
   const char *type;
 - const char *val = NULL;
 + const char *val, *uuid = NULL;
  
   if (cache == NULL)
   blkid_get_cache(cache, NULL);
 @@ -193,42 +196,29 @@ static const char *get_uuid_blkdev(char *path)
   iter = blkid_tag_iterate_begin(dev);
   if (!iter)
   return NULL;
 - while (blkid_tag_next(iter, type, val) == 0)
 + while (blkid_tag_next(iter, type, val) == 0) {
   if (strcmp(type, UUID) == 0)
 + uuid = val;
 + if (strcmp(type, TYPE) == 0 
 + strcmp(val, btrfs) == 0) {
 + uuid = NULL;
   break;
 + }
 + }
   blkid_tag_iterate_end(iter);
 - return val;
 + return uuid;
  }
  #else
  #define get_uuid_blkdev(path) (NULL)
  #endif
  
 -int get_uuid(char *path, char *uuid, int uuidlen, char *u)
 +int get_uuid(const char *val, int uuidlen, char *u)
  {
   /* extract hex digits from uuidstr and compose a uuid
* of the given length (max 16), xoring bytes to make
 -  * a smaller uuid.  Then compare with uuid
 +  * a smaller uuid.
*/
   int i = 0;
 - const char *val = NULL;
 - char fsid_val[17];
 -
 - if (path) {
 - val = get_uuid_blkdev(path);
 - if (!val) {
 - struct statfs64 st;
 -
 - if (statfs64(path, st))
 - return 0;
 - if (!st.f_fsid.__val[0]  !st.f_fsid.__val[1])
 - return 0;
 - snprintf(fsid_val, 17, %08x%08x,
 -  st.f_fsid.__val[0], st.f_fsid.__val[1]);
 - val = fsid_val;
 - }
 - } else {
 - val = uuid;
 - }
   
   memset(u, 0, uuidlen);
   for ( ; *val ; val++) {
 @@ -252,6 +242,60 @@ int get_uuid(char *path, char *uuid, int uuidlen, char 
 *u)
   return 1;
  }
  
 +int uuid_by_path(char *path, int type, int uuidlen, char *uuid)
 +{
 + /* get a uuid for the filesystem found at 'path'.
 +  * There are several possible ways of generating the
 +  * uuids (types).
 +  * Type 0 is used for new filehandles, while other types
 +  * may be used to interpret old filehandle - to ensure smooth
 +  * forward migration.
 +  * We return 1 if a uuid was found (and it might be worth 
 +  * trying the next type) or 0 if no more uuid types can be
 +  * extracted.
 +  */
 +
 + /* Possible sources of uuid are
 +  * - blkid uuid
 +  * - statfs64 uuid
 +  *
 +  * On some filesystems (e.g. vfat) the statfs64 uuid is simply an
 +  * encoding of the device that the filesystem is mounted from, so
 +  * it we be very bad to use that (as device numbers change).  blkid
 +  * must be preferred.
 +  * On other filesystems (e.g. btrfs) the statfs64 uuid contains
 +  * important info that the blkid uuid cannot contain:  This happens
 +  * when multiple subvolumes are 

Re: [PATCH] Improve support for exporting btrfs subvolumes.

2010-06-23 Thread J. Bruce Fields
On Thu, Jun 24, 2010 at 07:31:57AM +1000, Neil Brown wrote:
 On Wed, 23 Jun 2010 14:28:38 -0400
 J. Bruce Fields bfie...@fieldses.org wrote:
 
  On Thu, Jun 17, 2010 at 02:54:01PM +1000, Neil Brown wrote:
   
   If you export two subvolumes of a btrfs filesystem, they will both be
   given the same uuid so lookups will be confused.
   blkid cannot differentiate the two, so we must use the fsid from
   statfs64 to identify the filesystem.
   
   We cannot tell if blkid or statfs is best without knowing internal
   details of the filesystem in question, so we need to encode specific
   knowledge of btrfs in mountd.  This is unfortunate.
   
   To ensure smooth handling of this and possible future changes in uuid
   generation, we add infrastructure for multiple different uuids to be
   recognised on old filehandles, but only the preferred on is used on
   new filehandles.
  
  Could you just contatenate the two (or hash them somehow)?  Or does that
  just use up too much space in the filehandle?
 
 I did consider xoring them together but came to the conclusion that would
 actually be a regression.
 If you look down at the comment that I included in uuid_by_path, you will see
 that some filesystems (e.g. VFAT) just use the major/minor device number for
 the f_fsid from statfs.  As you know that is not necessarily stable over
 reboots, while the UUID that blkid gives is.
 So if we always adding the two uuids somehow, it would be an improvement for
 btrfs, no change for e.g. ext3/XFS, and a regression for VFAT (and others I
 think).  Not good.

OK, got it.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html