Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-20 Thread NeilBrown
On Sun, May 20 2018, Kent Overstreet wrote:

> Jens - this series does the rest of the conversions that Christoph wanted, and
> drops bioset_create().
>
> Only lightly tested, but the changes are pretty mechanical. Based on your
> for-next tree.
>
> It's also in the for-jens branch at https://evilpiepirate.org/git/bcachefs.git
>
> Kent Overstreet (12):
>   block: convert bounce, q->bio_split to bioset_init()/mempool_init()
>   drbd: convert to bioset_init()/mempool_init()
>   pktcdvd: convert to bioset_init()/mempool_init()
>   lightnvm: convert to bioset_init()/mempool_init()
>   bcache: convert to bioset_init()/mempool_init()
>   md: convert to bioset_init()/mempool_init()

Hi Kent,
 this conversion looks really good, thanks for Ccing me on it.
 However as Shaohua Li is now the maintainer of md, it probably should
 have gone to him as well.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH v4 01/19] fs: new API for handling inode->i_version

2017-12-22 Thread NeilBrown
On Fri, Dec 22 2017, Jeff Layton wrote:

> From: Jeff Layton <jlay...@redhat.com>
>
> Add a documentation blob that explains what the i_version field is, how
> it is expected to work, and how it is currently implemented by various
> filesystems.
>
> We already have inode_inc_iversion. Add several other functions for
> manipulating and accessing the i_version counter. For now, the
> implementation is trivial and basically works the way that all of the
> open-coded i_version accesses work today.
>
> Future patches will convert existing users of i_version to use the new
> API, and then convert the backend implementation to do things more
> efficiently.
>
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> ---
>  fs/btrfs/file.c  |   1 +
>  fs/btrfs/inode.c |   1 +
>  fs/btrfs/ioctl.c |   1 +
>  fs/btrfs/xattr.c |   1 +
>  fs/ext4/inode.c  |   1 +
>  fs/ext4/namei.c  |   1 +
>  fs/inode.c   |   1 +
>  include/linux/fs.h   |  15 
>  include/linux/iversion.h | 205 
> +++
>  9 files changed, 212 insertions(+), 15 deletions(-)
>  create mode 100644 include/linux/iversion.h
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index eb1bac7c8553..c95d7b2efefb 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -31,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e1a7f3cb5be9..27f008b33fc1 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -43,6 +43,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 2ef8acaac688..aa452c9e2eff 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -43,6 +43,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c
> index 2c7e53f9ff1b..5258c1714830 100644
> --- a/fs/btrfs/xattr.c
> +++ b/fs/btrfs/xattr.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "btrfs_inode.h"
>  #include "transaction.h"
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 7df2c5644e59..fa5d8bc52d2d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -39,6 +39,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "ext4_jbd2.h"
>  #include "xattr.h"
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 798b3ac680db..bcf0dff517be 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "ext4.h"
>  #include "ext4_jbd2.h"
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index 03102d6ef044..19e72f500f71 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -18,6 +18,7 @@
>  #include  /* for inode_has_buffers */
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "internal.h"
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 511fbaabf624..76382c24e9d0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2036,21 +2036,6 @@ static inline void inode_dec_link_count(struct inode 
> *inode)
>   mark_inode_dirty(inode);
>  }
>  
> -/**
> - * inode_inc_iversion - increments i_version
> - * @inode: inode that need to be updated
> - *
> - * Every time the inode is modified, the i_version field will be incremented.
> - * The filesystem has to be mounted with i_version flag
> - */
> -
> -static inline void inode_inc_iversion(struct inode *inode)
> -{
> -   spin_lock(>i_lock);
> -   inode->i_version++;
> -   spin_unlock(>i_lock);
> -}
> -
>  enum file_time_flags {
>   S_ATIME = 1,
>   S_MTIME = 2,
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> new file mode 100644
> index ..bb50d27c71f9
> --- /dev/null
> +++ b/include/linux/iversion.h
> @@ -0,0 +1,205 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_IVERSION_H
> +#define _LINUX_IVERSION_H
> +
> +#include 
> +
> +/*
> + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> + * knfsd, but is also use

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-11 Thread NeilBrown
On Thu, May 11 2017, J. Bruce Fields wrote:

> On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
>> On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
>> > 1) Keep i_version as is, make clients also check for i_ctime.
>> 
>> That would be a protocol revision, which we'd definitely rather avoid.
>> 
>> But can't we accomplish the same by using something like
>> 
>>  ctime * (some constant) + i_version
>> 
>> ?
>> 
>> >Pro: No on-disk format changes.
>> >Cons: After a crash, i_version can go backwards (but when file changes
>> >i_version, i_ctime pair should be still different) or not, data can be
>> >old or not.
>> 
>> This is probably good enough for NFS purposes: typically on an NFS
>> filesystem, results of a read in the face of a concurrent write open are
>> undefined.  And writers sync before close.
>> 
>> So after a crash with a dirty inode, we're in a situation where an NFS
>> client still needs to resend some writes, sync, and close.  I'm OK with
>> things being inconsistent during this window.
>> 
>> I do expect things to return to normal once that client's has resent its
>> writes--hence the worry about actually resuing old values after boot
>> (such as if i_version regresses on boot and then increments back to the
>> same value after further writes).  Factoring in ctime fixes that.
>
> So for now I'm thinking of just doing something like the following.
>
> Only nfsd needs it for now, but it could be moved to a vfs helper for
> statx, or for individual filesystems that want to do something
> different.  (The NFSv4 client will want to use the server's change
> attribute instead, I think.  And other filesystems might want to try
> something more ambitious like Neil's proposal.)
>
> --b.
>
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 12feac6ee2fd..9636c9a60aba 100644
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f84fe6bf9aee..14f09f1ef605 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_saved = false;
>  }
>  
> +static inline u64 nfsd4_change_attribute(struct inode *inode)
> +{
> + u64 chattr;
> +
> + chattr = inode->i_ctime.tv_sec << 30;
> + chattr += inode->i_ctime.tv_nsec;
> + chattr += inode->i_version;
> + return chattr;

So if I chmod a file, all clients will need to flush the content from their 
cache?
Maybe they already do?  Maybe it is a boring corner case?

> +}
> +
>  /*
>   * Fill in the pre_op attr for the wcc data
>   */
> @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_mtime = inode->i_mtime;
>   fhp->fh_pre_ctime = inode->i_ctime;
>   fhp->fh_pre_size  = inode->i_size;
> - fhp->fh_pre_change = inode->i_version;
> + fhp->fh_pre_change = nfsd4_change_attribute(inode);
>   fhp->fh_pre_saved = true;
>   }
>  }
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>   printk("nfsd: inode locked twice during operation.\n");
>  
>   err = fh_getattr(fhp, >fh_post_attr);
> - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
>   if (err) {
>   fhp->fh_post_saved = false;
>   /* Grab the ctime anyway - set_change_info might use it */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 26780d53a6f9..a09532d4a383 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> *stat, struct inode *inode,
>   *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
>   *p++ = 0;
>   } else if (IS_I_VERSION(inode)) {
> - p = xdr_encode_hyper(p, inode->i_version);
> + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
>   } else {
>   *p++ = cpu_to_be32(stat->ctime.tv_sec);
>   *p++ = cpu_to_be32(stat->ctime.tv_nsec);

It is *really* confusing to find that fh_post_change is only set in nfs3
code, and only used in nfs4 code.
It is probably time to get a 'version' field in 'struct kstat'.
That would allow this code to get a little cleaner.

(to me, this exercise is just a reminder that the NFSv4 change attribute
is poorly designed ... so it just makes me grumpy).

NeilBrown


> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it

2017-05-09 Thread NeilBrown
On Tue, May 09 2017, Jeff Layton wrote:

> An errseq_t is a way of recording errors in one place, and allowing any
> number of "subscribers" to tell whether an error has been set again
> since a previous time.
>
> It's implemented as an unsigned 32-bit value that is managed with atomic
> operations. The low order bits are designated to hold an error code
> (max size of MAX_ERRNO). The upper bits are used as a counter.
>
> The API works with consumers sampling an errseq_t value at a particular
> point in time. Later, that value can be used to tell whether new errors
> have been set since that time.
>
> Note that there is a 1 in 512k risk of collisions here if new errors
> are being recorded frequently, since we have so few bits to use as a
> counter. To mitigate this, one bit is used as a flag to tell whether the
> value has been sampled since a new value was recorded. That allows
> us to avoid bumping the counter if no one has sampled it since it
> was last bumped.
>
> Later patches will build on this infrastructure to change how writeback
> errors are tracked in the kernel.
>
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

I like that this is a separate lib/*.c - nicely structured too.

Reviewed-by: NeilBrown <ne...@suse.com>

Thanks,
NeilBrown


> ---
>  include/linux/errseq.h |  19 +
>  lib/Makefile   |   2 +-
>  lib/errseq.c   | 199 
> +
>  3 files changed, 219 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/errseq.h
>  create mode 100644 lib/errseq.c
>
> diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> new file mode 100644
> index ..0d2555f310cd
> --- /dev/null
> +++ b/include/linux/errseq.h
> @@ -0,0 +1,19 @@
> +#ifndef _LINUX_ERRSEQ_H
> +#define _LINUX_ERRSEQ_H
> +
> +/* See lib/errseq.c for more info */
> +
> +typedef u32  errseq_t;
> +
> +void __errseq_set(errseq_t *eseq, int err);
> +static inline void errseq_set(errseq_t *eseq, int err)
> +{
> + /* Optimize for the common case of no error */
> + if (unlikely(err))
> + __errseq_set(eseq, err);
> +}
> +
> +errseq_t errseq_sample(errseq_t *eseq);
> +int errseq_check(errseq_t *eseq, errseq_t since);
> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
> +#endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 320ac46a8725..2423afef40f7 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o 
> random32.o \
>gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
>bsearch.o find_bit.o llist.o memweight.o kfifo.o \
>percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
> -  once.o refcount.o
> +  once.o refcount.o errseq.o
>  obj-y += string_helpers.o
>  obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
>  obj-y += hexdump.o
> diff --git a/lib/errseq.c b/lib/errseq.c
> new file mode 100644
> index ..0f8b4ed460f0
> --- /dev/null
> +++ b/lib/errseq.c
> @@ -0,0 +1,199 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * An errseq_t is a way of recording errors in one place, and allowing any
> + * number of "subscribers" to tell whether it has changed since an arbitrary
> + * time of their choosing.
> + *
> + * It's implemented as an unsigned 32-bit value. The low order bits are
> + * designated to hold an error code (between 0 and -MAX_ERRNO). The upper 
> bits
> + * are used as a counter. This is done with atomics instead of locking so 
> that
> + * these functions can be called from any context.
> + *
> + * The general idea is for consumers to sample an errseq_t value at a
> + * particular point in time. Later, that value can be used to tell whether 
> any
> + * new errors have occurred since that time.
> + *
> + * Note that there is a risk of collisions, if new errors are being recorded
> + * frequently, since we have so few bits to use as a counter.
> + *
> + * To mitigate this, one bit is used as a flag to tell whether the value has
> + * been sampled since a new value was recorded. That allows us to avoid 
> bumping
> + * the counter if no one has sampled it since the last time an error was
> + * recorded.
> + *
> + * A new errseq_t should always be zeroed out.  A errseq_t value of all 
> zeroes
> + * is the special (but common) case where there has never been an error. An 
> all
> + * zero value thus serves as the "epoch" if one wishes to know whether there
> + * has ever been an error set since it was first initialized.
> + */
> +
> +/* The low bits are 

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread NeilBrown
On Wed, Apr 05 2017, Jan Kara wrote:

> On Wed 05-04-17 11:43:32, NeilBrown wrote:
>> On Tue, Apr 04 2017, J. Bruce Fields wrote:
>> 
>> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
>> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
>> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> >> > > > Because if above is acceptable we could make reported i_version to 
>> >> > > > be a sum
>> >> > > > of "superblock crash counter" and "inode i_version". We increment
>> >> > > > "superblock crash counter" whenever we detect unclean filesystem 
>> >> > > > shutdown.
>> >> > > > That way after a crash we are guaranteed each inode will report new
>> >> > > > i_version (the sum would probably have to look like "superblock 
>> >> > > > crash
>> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
>> >> > > > possible
>> >> > > > i_version numbers we gave away but did not write to disk but 
>> >> > > > still...).
>> >> > > > Thoughts?
>> >> > 
>> >> > How hard is this for filesystems to support?  Do they need an on-disk
>> >> > format change to keep track of the crash counter?  Maybe not, maybe the
>> >> > high bits of the i_version counters are all they need.
>> >> > 
>> >> 
>> >> Yeah, I imagine we'd need a on-disk change for this unless there's
>> >> something already present that we could use in place of a crash counter.
>> >
>> > We could consider using the current time instead.  So, put the current
>> > time (or time of last boot, or this inode's ctime, or something) in the
>> > high bits of the change attribute, and keep the low bits as a counter.
>> 
>> This is a very different proposal.
>> I don't think Jan was suggesting that the i_version be split into two
>> bit fields, one the change-counter and one the crash-counter.
>> Rather, the crash-counter was multiplied by a large-number and added to
>> the change-counter with the expectation that while not ever
>> change-counter landed on disk, at least 1 in every large-number would.
>> So after each crash we effectively add large-number to the
>> change-counter, and can be sure that number hasn't been used already.
>
> Yes, that was my thinking.
>
>> To store the crash-counter in each inode (which does appeal) you would
>> need to be able to remove it before adding the new crash counter, and
>> that requires bit-fields.  Maybe there are enough bits.
>
> Furthermore you'd have a potential problem that you need to change
> i_version on disk just because you are reading after a crash and such
> changes tend to be problematic (think of read-only mounts and stuff like
> that).
>  
>> If you want to ensure read-only files can remain cached over a crash,
>> then you would have to mark a file in some way on stable storage
>> *before* allowing any change.
>> e.g. you could use the lsb.  Odd i_versions might have been changed
>> recently and crash-count*large-number needs to be added.
>> Even i_versions have not been changed recently and nothing need be
>> added.
>> 
>> If you want to change a file with an even i_version, you subtract
>>   crash-count*large-number
>> to the i_version, then set lsb.  This is written to stable storage before
>> the change.
>> 
>> If a file has not been changed for a while, you can add
>>   crash-count*large-number
>> and clear lsb.
>> 
>> The lsb of the i_version would be for internal use only.  It would not
>> be visible outside the filesystem.
>> 
>> It feels a bit clunky, but I think it would work and is the best
>> combination of Jan's idea and your requirement.
>> The biggest cost would be switching to 'odd' before an changes, and the
>> unknown is when does it make sense to switch to 'even'.
>
> Well, there is also a problem that you would need to somehow remember with
> which 'crash count' the i_version has been previously reported as that is
> not stored on disk with my scheme. So I don't think we can easily use your
> scheme.

I don't think there is a problem here maybe I didn't explain
properly or something.

I'm assuming there is a crash-count that is stored once per filesystem.
This might be a disk-format change, 

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread NeilBrown
On Tue, Apr 04 2017, J. Bruce Fields wrote:

> On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
>> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
>> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> > > > Because if above is acceptable we could make reported i_version to be 
>> > > > a sum
>> > > > of "superblock crash counter" and "inode i_version". We increment
>> > > > "superblock crash counter" whenever we detect unclean filesystem 
>> > > > shutdown.
>> > > > That way after a crash we are guaranteed each inode will report new
>> > > > i_version (the sum would probably have to look like "superblock crash
>> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
>> > > > i_version numbers we gave away but did not write to disk but still...).
>> > > > Thoughts?
>> > 
>> > How hard is this for filesystems to support?  Do they need an on-disk
>> > format change to keep track of the crash counter?  Maybe not, maybe the
>> > high bits of the i_version counters are all they need.
>> > 
>> 
>> Yeah, I imagine we'd need a on-disk change for this unless there's
>> something already present that we could use in place of a crash counter.
>
> We could consider using the current time instead.  So, put the current
> time (or time of last boot, or this inode's ctime, or something) in the
> high bits of the change attribute, and keep the low bits as a counter.

This is a very different proposal.
I don't think Jan was suggesting that the i_version be split into two
bit fields, one the change-counter and one the crash-counter.
Rather, the crash-counter was multiplied by a large-number and added to
the change-counter with the expectation that while not ever
change-counter landed on disk, at least 1 in every large-number would.
So after each crash we effectively add large-number to the
change-counter, and can be sure that number hasn't been used already.

To store the crash-counter in each inode (which does appeal) you would
need to be able to remove it before adding the new crash counter, and
that requires bit-fields.  Maybe there are enough bits.

If you want to ensure read-only files can remain cached over a crash,
then you would have to mark a file in some way on stable storage
*before* allowing any change.
e.g. you could use the lsb.  Odd i_versions might have been changed
recently and crash-count*large-number needs to be added.
Even i_versions have not been changed recently and nothing need be
added.

If you want to change a file with an even i_version, you subtract
  crash-count*large-number
to the i_version, then set lsb.  This is written to stable storage before
the change.

If a file has not been changed for a while, you can add
  crash-count*large-number
and clear lsb.

The lsb of the i_version would be for internal use only.  It would not
be visible outside the filesystem.

It feels a bit clunky, but I think it would work and is the best
combination of Jan's idea and your requirement.
The biggest cost would be switching to 'odd' before an changes, and the
unknown is when does it make sense to switch to 'even'.

NeilBrown


signature.asc
Description: PGP signature


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread NeilBrown
On Tue, Apr 04 2017, Dave Chinner wrote:

> On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
>> On Sun 02-04-17 09:05:26, Dave Chinner wrote:
>> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
>> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> > > > > Because if above is acceptable we could make reported i_version to 
>> > > > > be a sum
>> > > > > of "superblock crash counter" and "inode i_version". We increment
>> > > > > "superblock crash counter" whenever we detect unclean filesystem 
>> > > > > shutdown.
>> > > > > That way after a crash we are guaranteed each inode will report new
>> > > > > i_version (the sum would probably have to look like "superblock crash
>> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
>> > > > > possible
>> > > > > i_version numbers we gave away but did not write to disk but 
>> > > > > still...).
>> > > > > Thoughts?
>> > > 
>> > > How hard is this for filesystems to support?  Do they need an on-disk
>> > > format change to keep track of the crash counter?
>> > 
>> > Yes. We'll need version counter in the superblock, and we'll need to
>> > know what the increment semantics are. 
>> > 
>> > The big question is how do we know there was a crash? The only thing
>> > a journalling filesystem knows at mount time is whether it is clean
>> > or requires recovery. Filesystems can require recovery for many
>> > reasons that don't involve a crash (e.g. root fs is never unmounted
>> > cleanly, so always requires recovery). Further, some filesystems may
>> > not even know there was a crash at mount time because their
>> > architecture always leaves a consistent filesystem on disk (e.g. COW
>> > filesystems)
>> 
>> What filesystems can or cannot easily do obviously differs. Ext4 has a
>> recovery flag set in superblock on RW mount/remount and cleared on
>> umount/RO remount.
>
> Even this doesn't help. A recent bug that was reported to the XFS
> list - turns out that systemd can't remount-ro the root
> filesystem sucessfully on shutdown because there are open write fds
> on the root filesystem when it attempts the remount. So it just
> reboots without a remount-ro. This uncovered a bug in grub in

Filesystems could use register_reboot_notifier() to get a notification
that even systemd cannot stuff-up.  It could check for dirty data and, if
there is none (which there shouldn't be if a sync happened), it does a
single write to disk to update the superblock (or a single write to each
disk... or something).
md does this, because getting the root device to be marked read-only is
even harder than getting the root filesystem to be remounted read-only.

NeilBrown


signature.asc
Description: PGP signature


Re: [RFC PATCH v1 29/30] fs: track whether the i_version has been queried with an i_state flag

2017-03-03 Thread NeilBrown
On Wed, Dec 21 2016, Jeff Layton wrote:

> @@ -2072,7 +2093,12 @@ inode_cmp_iversion(const struct inode *inode, const 
> u64 old)
>  static inline bool
>  inode_iversion_need_inc(struct inode *inode)
>  {
> - return true;
> + bool ret;
> +
> + spin_lock(>i_lock);
> + ret = inode->i_state & I_VERS_BUMP;
> + spin_unlock(>i_lock);
> + return ret;
>  }
>  

I know this code gets removed, so this isn't really important.
By why do you take the spinlock here?  What are you racing again?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [RFC PATCH v1 30/30] fs: convert i_version counter over to an atomic64_t

2017-03-03 Thread NeilBrown
On Wed, Dec 21 2016, Jeff Layton wrote:

>  
> +/*
> + * We borrow the top bit in the i_version to use as a flag to tell us whether
> + * it has been queried since we last bumped it. If it has, then we must bump
> + * it and set the flag. Note that this means that we have to handle wrapping
> + * manually.
> + */
> +#define INODE_I_VERSION_QUERIED  (1ULL<<63)
> +

I would prefer that the least significant bit were used, rather than the
most significant.

Partly, this is because the "queried" state is less significant than
that "number has changed" state.
But most, this would mean we wouldn't need inode_cmp_iversion() at all.
We could just use "<" or ">=" or whatever.
The number returned by inode_get_iversion() would always be even (or
maybe odd) and wrapping (after the end of time) would "just work".

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [RFC PATCH v1 11/30] fs: new API for handling i_version

2017-03-03 Thread NeilBrown
On Wed, Dec 21 2016, Jeff Layton wrote:

> We already have inode_inc_iversion. Add inode_set_iversion,
> inode_get_iversion, inode_cmp_iversion and inode_iversion_need_inc.

This list of added interfaces is incomplete.
And some of these interfaces could really use some justification up
front.

You later add a "force" parameter to inode_inc_version.
Why not do that up front here?

> +
> +/**
> + * inode_get_iversion - read i_version for later use
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter. This should be used by callers that wish
> + * to store the returned i_version for later comparison.
> + */
> +static inline u64
> +inode_get_iversion(const struct inode *inode)
> +{
> + return inode_get_iversion_raw(inode);
> +}

I don't understand why this can use the _raw version rather than the
_read version.
Surely you need to know about any changes after this read.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] exportfs: be careful to only return expected errors.

2016-10-06 Thread NeilBrown
On Thu, Aug 04 2016, NeilBrown wrote:

>
>
> When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors.
> In particular it can be tempting to return ENOENT, but this is not
> handled well by nfsd.
>
> Rather than requiring strict adherence to error code code filesystems,
> treat all unexpected error codes the same as ESTALE.  This is safest.
>
> Signed-off-by: NeilBrown <ne...@suse.com>
> ---
>
> I didn't add a dprintk for unexpected error messages, partly
> because dprintk isn't usable in exportfs.  I could have used pr_debug()
> but I really didn't see much value.
>
> This has been tested together with the btrfs change, and it restores
> correct functionality.
>
> Thanks,
> NeilBrown


Hi Bruce,
 I notice that this patch isn't in -next, but the btrfs change which
 motivated it is.
 That means, unless there is some other work around somewhere, btrfs
 might start having problems with nfs export.

 Can we get this patch into 4.9-rc??

 Or has someone fixed it a different way?

Thanks,
NeilBrown


>
>  fs/exportfs/expfs.c  | 10 ++
>  include/linux/exportfs.h | 13 +++--
>  2 files changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
> index 207ba8d627ca..a4b531be9168 100644
> --- a/fs/exportfs/expfs.c
> +++ b/fs/exportfs/expfs.c
> @@ -428,10 +428,10 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, 
> struct fid *fid,
>   if (!nop || !nop->fh_to_dentry)
>   return ERR_PTR(-ESTALE);
>   result = nop->fh_to_dentry(mnt->mnt_sb, fid, fh_len, fileid_type);
> - if (!result)
> - result = ERR_PTR(-ESTALE);
> - if (IS_ERR(result))
> - return result;
> + if (PTR_ERR(result) == -ENOMEM)
> + return ERR_CAST(result);
> + if (IS_ERR_OR_NULL(result))
> + return ERR_PTR(-ESTALE);
>  
>   if (d_is_dir(result)) {
>   /*
> @@ -541,6 +541,8 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, 
> struct fid *fid,
>  
>   err_result:
>   dput(result);
> + if (err != -ENOMEM)
> + err = -ESTALE;
>   return ERR_PTR(err);
>  }
>  EXPORT_SYMBOL_GPL(exportfs_decode_fh);
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index b03c0625fa6e..5ab958cdc50b 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -157,12 +157,13 @@ struct fid {
>   *@fh_to_dentry is given a  super_block (@sb) and a file handle
>   *fragment (@fh, @fh_len). It should return a  dentry which refers
>   *to the same file that the file handle fragment refers to.  If it 
> cannot,
> - *it should return a %NULL pointer if the file was found but no 
> acceptable
> - * were available, or an %ERR_PTR error code indicating why it
> - *couldn't be found (e.g. %ENOENT or %ENOMEM).  Any suitable dentry can 
> be
> - *returned including, if necessary, a new dentry created with 
> d_alloc_root.
> - *The caller can then find any other extant dentries by following the
> - *d_alias links.
> + *it should return a %NULL pointer if the file cannot be found, or an
> + *%ERR_PTR error code of %ENOMEM if a memory allocation failure occurred.
> + *Any other error code is treated like %NULL, and will cause an %ESTALE 
> error
> + *for callers of exportfs_decode_fh().
> + *Any suitable dentry can be returned including, if necessary, a new 
> dentry
> + *created with d_alloc_root.  The caller can then find any other extant
> + *dentries by following the d_alias links.
>   *
>   * fh_to_parent:
>   *Same as @fh_to_dentry, except that it returns a pointer to the parent
> -- 
> 2.9.2


signature.asc
Description: PGP signature


Re: [PATCH] exportfs: be careful to only return expected errors.

2016-08-04 Thread NeilBrown
On Thu, Aug 04 2016, Christoph Hellwig wrote:

> On Thu, Aug 04, 2016 at 10:19:06AM +1000, NeilBrown wrote:
>> 
>> 
>> When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors.
>> In particular it can be tempting to return ENOENT, but this is not
>> handled well by nfsd.
>> 
>> Rather than requiring strict adherence to error code code filesystems,
>> treat all unexpected error codes the same as ESTALE.  This is safest.
>> 
>> Signed-off-by: NeilBrown <ne...@suse.com>
>> ---
>> 
>> I didn't add a dprintk for unexpected error messages, partly
>> because dprintk isn't usable in exportfs.  I could have used pr_debug()
>> but I really didn't see much value.
>> 
>> This has been tested together with the btrfs change, and it restores
>> correct functionality.
>
> I don't really like all this magic which is partially historic.  I think
> we should instead allow the fs to return any error from the export
> operations, and forbid returning NULL entirely.  Then the actualy caller
> (nfsd) can sort out which errors it wants to send over the wire.

I'm certainly open to that possibility.
But is the "actual caller":
  nfsd_set_fh_dentry(), or
  fh_verify() or
  the various callers of fh_verify() which might have different rules
  about which error codess are acceptable?

I could probably make an argument for having fh_verify() be careful
about error codes, but as exportfs_decode_fh() is a more public
interface, I think it is more important that it have well defined error
options.

Are there *any* errors that could sensibly be returned from
exportfs_decode_fh() other than
  -ESTALE (there is no such file), or
  -ENOMEM (there probably is a file, but I cannot allocate a dentry for
   it) or
  -EACCES (there is such a file, but it isn't "acceptable")

???

If there aren't, why should we let them through?

NeilBrown


signature.asc
Description: PGP signature


[PATCH] exportfs: be careful to only return expected errors.

2016-08-03 Thread NeilBrown


When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors.
In particular it can be tempting to return ENOENT, but this is not
handled well by nfsd.

Rather than requiring strict adherence to error code code filesystems,
treat all unexpected error codes the same as ESTALE.  This is safest.

Signed-off-by: NeilBrown <ne...@suse.com>
---

I didn't add a dprintk for unexpected error messages, partly
because dprintk isn't usable in exportfs.  I could have used pr_debug()
but I really didn't see much value.

This has been tested together with the btrfs change, and it restores
correct functionality.

Thanks,
NeilBrown

 fs/exportfs/expfs.c  | 10 ++
 include/linux/exportfs.h | 13 +++--
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 207ba8d627ca..a4b531be9168 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -428,10 +428,10 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, 
struct fid *fid,
if (!nop || !nop->fh_to_dentry)
return ERR_PTR(-ESTALE);
result = nop->fh_to_dentry(mnt->mnt_sb, fid, fh_len, fileid_type);
-   if (!result)
-   result = ERR_PTR(-ESTALE);
-   if (IS_ERR(result))
-   return result;
+   if (PTR_ERR(result) == -ENOMEM)
+   return ERR_CAST(result);
+   if (IS_ERR_OR_NULL(result))
+   return ERR_PTR(-ESTALE);
 
if (d_is_dir(result)) {
/*
@@ -541,6 +541,8 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, 
struct fid *fid,
 
  err_result:
dput(result);
+   if (err != -ENOMEM)
+   err = -ESTALE;
return ERR_PTR(err);
 }
 EXPORT_SYMBOL_GPL(exportfs_decode_fh);
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index b03c0625fa6e..5ab958cdc50b 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -157,12 +157,13 @@ struct fid {
  *@fh_to_dentry is given a  super_block (@sb) and a file handle
  *fragment (@fh, @fh_len). It should return a  dentry which refers
  *to the same file that the file handle fragment refers to.  If it cannot,
- *it should return a %NULL pointer if the file was found but no acceptable
- * were available, or an %ERR_PTR error code indicating why it
- *couldn't be found (e.g. %ENOENT or %ENOMEM).  Any suitable dentry can be
- *returned including, if necessary, a new dentry created with d_alloc_root.
- *The caller can then find any other extant dentries by following the
- *d_alias links.
+ *it should return a %NULL pointer if the file cannot be found, or an
+ *%ERR_PTR error code of %ENOMEM if a memory allocation failure occurred.
+ *Any other error code is treated like %NULL, and will cause an %ESTALE 
error
+ *for callers of exportfs_decode_fh().
+ *Any suitable dentry can be returned including, if necessary, a new dentry
+ *created with d_alloc_root.  The caller can then find any other extant
+ *dentries by following the d_alias links.
  *
  * fh_to_parent:
  *Same as @fh_to_dentry, except that it returns a pointer to the parent
-- 
2.9.2



signature.asc
Description: PGP signature


Re: [PATCH 1/2] Btrfs: be more precise on errors when getting an inode from disk

2016-07-21 Thread NeilBrown
On Fri, Jul 22 2016, J. Bruce Fields wrote:

> On Fri, Jul 22, 2016 at 11:08:17AM +1000, NeilBrown wrote:
>> On Fri, Jun 10 2016, fdman...@kernel.org wrote:
>> 
>> > From: Filipe Manana <fdman...@suse.com>
>> >
>> > When we attempt to read an inode from disk, we end up always returning an
>> > -ESTALE error to the caller regardless of the actual failure reason, which
>> > can be an out of memory problem (when allocating a path), some error found
>> > when reading from the fs/subvolume btree (like a genuine IO error) or the
>> > inode does not exists. So lets start returning the real error code to the
>> > callers so that they don't treat all -ESTALE errors as meaning that the
>> > inode does not exists (such as during orphan cleanup). This will also be
>> > needed for a subsequent patch in the same series dealing with a special
>> > fsync case.
>> >
>> > Signed-off-by: Filipe Manana <fdman...@suse.com>
>> 
>> SNIP
>> 
>> > @@ -5594,7 +5602,8 @@ struct inode *btrfs_iget(struct super_block *s, 
>> > struct btrfs_key *location,
>> >} else {
>> >unlock_new_inode(inode);
>> >iput(inode);
>> > -  inode = ERR_PTR(-ESTALE);
>> > +  ASSERT(ret < 0);
>> > +  inode = ERR_PTR(ret < 0 ? ret : -ESTALE);
>> >}
>> 
>> Just a heads-up.  This change breaks NFS :-(
>> 
>> The change in error code percolates up the call chain:
>> 
>>  nfs4_pufh->fh_verify->nfsd_set_fh_dentry->exportfs_decode_fh
>> ->btrfs_fh_to_dentry->ntrfs_get_dentry->btrfs_iget
>> 
>> and nfsd returns NFS4ERR_NOENT to the client instead of NFS4ERR_STALE,
>> and the client doesn't handle that quite the same way.
>> 
>> This doesn't mean that the change is wrong, but it could mean we need to
>> fix something else in the path to sanitize the error code.
>> 
>> nfsd_set_fh_dentry already has
>> 
>>  error = nfserr_stale;
>>  if (PTR_ERR(exp) == -ENOENT)
>>  return error;
>> 
>>  if (IS_ERR(exp))
>>  return nfserrno(PTR_ERR(exp));
>> 
>> for a different error case, so duplicating that would work, but I doubt
>> it is best.  At the very least we should check for valid errors, not
>> specific invalid ones.
>> 
>> Bruce: do you have an opinion where we should make sure that PUTFH (and
>> various other requests) returns a valid error code?
>
> Uh, I guess not.  Maybe exportfs_decode_fh?
>
> Though my kneejerk reaction is to be cranky and wonder why btrfs
> suddenly needs a different convention for decode_fh().
>

I can certainly agree with that perspective, though it would be
appropriate in that case to make sure we document the requirements for
fh_to_dentry (the current spelling for 'decode_fh').  So I went looking
for documentation and found:

 * fh_to_dentry:
 *@fh_to_dentry is given a  super_block (@sb) and a file handle
 *fragment (@fh, @fh_len). It should return a  dentry which refers
 *to the same file that the file handle fragment refers to.  If it cannot,
 *it should return a %NULL pointer if the file was found but no acceptable
 * were available, or an %ERR_PTR error code indicating why it
 *couldn't be found (e.g. %ENOENT or %ENOMEM).  Any suitable dentry can be
 *returned including, if necessary, a new dentry created with d_alloc_root.
 *The caller can then find any other extant dentries by following the
 *d_alias links.
 *

So the new btrfs code is actually conformant!!
That documentation dates back to 2002 when I wrote it
And it looks like ENOENT wasn't handled correctly then :-(

I suspect anything that isn't ENOMEM should be converted to ESTALE.
ENOMEM causes the client to be asked to retry the request later.

Does this look reasonable to you?
(Adding Christof as he as contributed a lot to exportfs)

If there is agreement I'll test and post a proper patch.

Thanks,
NeilBrown


diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 207ba8d627ca..3527b58cd5bc 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -428,10 +428,10 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, 
struct fid *fid,
if (!nop || !nop->fh_to_dentry)
return ERR_PTR(-ESTALE);
result = nop->fh_to_dentry(mnt->mnt_sb, fid, fh_len, fileid_type);
-   if (!result)
-   result = ERR_PTR(-ESTALE);
-   if (IS_ERR(result))
-   return result;
+   if (PTR_ERR(result) == -ENOMEM)
+   return ERR_CAST(result)
+   if (IS_ERR_OR_NULL(result))
+   return ERR_PTR(-ESTALE);
 
if (d_is_dir(result)) {
/*
@@ -541,6 +541,8 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, 
struct fid *fid,
 
  err_result:
dput(result);
+   if (err != -ENOMEM)
+   err = -ESTALE;
return ERR_PTR(err);
 }
 EXPORT_SYMBOL_GPL(exportfs_decode_fh);


signature.asc
Description: PGP signature


Re: [PATCH 1/2] Btrfs: be more precise on errors when getting an inode from disk

2016-07-21 Thread NeilBrown
On Fri, Jun 10 2016, fdman...@kernel.org wrote:

> From: Filipe Manana <fdman...@suse.com>
>
> When we attempt to read an inode from disk, we end up always returning an
> -ESTALE error to the caller regardless of the actual failure reason, which
> can be an out of memory problem (when allocating a path), some error found
> when reading from the fs/subvolume btree (like a genuine IO error) or the
> inode does not exists. So lets start returning the real error code to the
> callers so that they don't treat all -ESTALE errors as meaning that the
> inode does not exists (such as during orphan cleanup). This will also be
> needed for a subsequent patch in the same series dealing with a special
> fsync case.
>
> Signed-off-by: Filipe Manana <fdman...@suse.com>

SNIP

> @@ -5594,7 +5602,8 @@ struct inode *btrfs_iget(struct super_block *s, struct 
> btrfs_key *location,
>   } else {
>   unlock_new_inode(inode);
>   iput(inode);
> - inode = ERR_PTR(-ESTALE);
> + ASSERT(ret < 0);
> + inode = ERR_PTR(ret < 0 ? ret : -ESTALE);
>   }

Just a heads-up.  This change breaks NFS :-(

The change in error code percolates up the call chain:

 nfs4_pufh->fh_verify->nfsd_set_fh_dentry->exportfs_decode_fh
->btrfs_fh_to_dentry->ntrfs_get_dentry->btrfs_iget

and nfsd returns NFS4ERR_NOENT to the client instead of NFS4ERR_STALE,
and the client doesn't handle that quite the same way.

This doesn't mean that the change is wrong, but it could mean we need to
fix something else in the path to sanitize the error code.

nfsd_set_fh_dentry already has

error = nfserr_stale;
if (PTR_ERR(exp) == -ENOENT)
return error;

if (IS_ERR(exp))
return nfserrno(PTR_ERR(exp));

for a different error case, so duplicating that would work, but I doubt
it is best.  At the very least we should check for valid errors, not
specific invalid ones.

Bruce: do you have an opinion where we should make sure that PUTFH (and
various other requests) returns a valid error code?

Thanks,
NeilBrown



signature.asc
Description: PGP signature


Re: [PATCH 0/2] scop GFP_NOFS api

2016-05-05 Thread NeilBrown
On Wed, May 04 2016, Dave Chinner wrote:

> FWIW, I don't think making evict() non-blocking is going to be worth
> the effort here. Making memory reclaim wait on a priority ordered
> queue while asynchronous reclaim threads run reclaim as efficiently
> as possible and wakes waiters as it frees the memory the waiters
> require is a model that has been proven to work in the past, and
> this appears to me to be the model you are advocating for. I agree
> that direct reclaim needs to die and be replaced with something
> entirely more predictable, controllable and less prone to deadlock
> contexts - you just need to convince the mm developers that it will
> perform and scale better than what we have now.
>
> In the mean time, having a slightly more fine grained GFP_NOFS
> equivalent context will allow us to avoid the worst of the current
> GFP_NOFS problems with very little extra code.

You have painted two pictures here.  The first is an ideal which does
look a lot like the sort of outcome I was aiming for, but is more than a
small step away.
The second is a band-aid which would take us in exactly the wrong
direction.  It makes an interface which people apparently find hard to
use (or easy to misused) - the setting of __GFP_FS - and makes it more
complex.  Certainly it would be more powerful, but I think it would also
be more misused.

So I ask myself:  can we take some small steps towards 'A' and thereby
enable at least the functionality enabled by 'B'?

A core design principle for me is to enable filesystems to take control
of their own destiny.   They should have the information available to
make the decisions they need to make, and the opportunity to carry them
out.

All the places where direct reclaim currently calls into filesystems
carry the 'gfp' flags so the file system can decide what to do, with one
exception: evict_inode.  So my first proposal would be to rectify that.

 - redefine .nr_cached_objects and .free_cached_objects so that, if they
   are defined, they are responsible for s_dentry_lru and s_inode_lru.
   e.g. super_cache_count *either* calls ->nr_cached_objects *or* makes
   two calls to list_lru_shrink_count.  This would require exporting
   prune_dcache_sb and prune_icache_sb but otherwise should be a fairly
   straight forward change.
   If nr_cached_objects were defined, super_cache_scan would no longer
   abort without __GFP_FS - that test would be left to the filesystem.

 - Now any filesystem that wants to can stash it's super_block pointer
   in current->journal_info while doing memory allocations, and abort
   any reclaim attempts (release_page, shrinker, nr_cached_objects) if
   and only if current->journal_info == "my superblock".  This can be
   done without the core mm code knowing any more than it already does.

 - A more sophisticated filesystem might import much of the code for
   prune_icache_sb() - either by copy/paste or by exporting some vfs
   internals - and then store an inode pointer in current->journal_info
   and only abort reclaim which touches that inode.

 - if a filesystem happens to know that it will never block in any of
   these reclaim calls, it can always allow prune_dcache_sb to run, and
   never needs to use GFP_NOFS.  I think NFS might be close to being
   able to do this as it flushes everything on last-close.  But that is
   something that NFS developers can care about (or not) quite
   independently from mm people.

 - Maybe some fs developer will try to enable free_cached_objects to do
   as much work as possible for every inode, but never deadlock.  It
   could do its own fs-specfic deadlock detection, or could queue work
   to a work queue and wait a limited time for it.  Or something.
   If some filesystem developer comes up with something that works
   really well, developers of other filesystems might copy it - or not
   as they choose.

Maybe ->journal_info isn't perfect for this.  It is currently only safe
for reclaim code to compare it against a known value.  It is not safe to
dereference it to see if it points to a known value.  That could possibly be
cleaned up, or another task_struct field could be provided for
filesystems to track their state.  Or do you find a task_struct field
unacceptable and there is some reason and that an explicitly passed cookie
is superior?

My key point is that we shouldn't try to plumb some new abstraction
through the MM code so there is a new pattern for all filesystems to
follow.  Rather the mm/vfs should get out of the filesystems' way as much
as possible and let them innovate independently.

Thanks for your time,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 0/2] scop GFP_NOFS api

2016-05-03 Thread NeilBrown
On Wed, May 04 2016, Michal Hocko wrote:

> Hi,
>
> On Sun 01-05-16 07:55:31, NeilBrown wrote:
> [...]
>> One particular problem with your process-context idea is that it isn't
>> inherited across threads.
>> Steve Whitehouse's example in gfs shows how allocation dependencies can
>> even cross into user space.
>
> Hmm, I am still not sure I understand that example completely but making
> a dependency between direct reclaim and userspace can hardly work.

No it can't.  Specifically: if direct reclaim blocks on user-space that
must be a problem.
I think the point of this example is that some filesystem things can
block on user-space in ways that are very hard to encode in with flags
as they are multi-level indirect.
So the conclusion (my conclusion) is that direct reclaim mustn't block.

When I was working on deadlock avoidance in loop-back NFS I went down
the path of adding GFP flags and extended the PF_FSTRANS flag and got it
working (think) but no-one liked it.  It was way too intrusive.

Some how I landed on the idea of making nfs_release_page non blocking
and everything suddenly became much much simpler.  Problems evaporated.

NFS has a distinct advantage here.  The "Close-to-open" cache semantic
means that all dirty pages must be flushed on last close.  So when
->evict_inode is finally called there is nothing much to do - just free
everything up.  So I could fix NFS without worrying about (or even
noticing) evict_inode.


> Especially when the direct reclaim might be sitting on top of hard to
> guess pile of locks. So unless I've missed anything what Steve has
> described is a clear NOFS context.
>
>> A more localized one that I have seen is that NFSv4 sometimes needs to
>> start up a state-management thread (particularly if the server
>> restarted).
>> It uses kthread_run(), which doesn't actually create the thread but asks
>> kthreadd to do it.  If NFS writeout is waiting for state management it
>> would need to make sure that kthreadd runs in allocation context to
>> avoid deadlock.
>> I feel that I've forgotten some important detail here and this might
>> have been fixed somehow, but the point still stands that the allocation
>> context can cross from thread to thread and can effectively become
>> anything and everything.
>
> Not sure I understand your point here but relying on kthread_run
> from GFP_NOFS context has always been deadlock prone with or without
> scope GFP_NOFS semantic so I am not really sure I see your point
> here. Similarly relying on a work item which doesn't have a dedicated
> WQ_MEM_RECLAIM WQ is deadlock prone.  You simply shouldn't do that.

The point is really that saying "You shouldn't do that" isn't much good
when "that" is exactly what the fs developer wants to do and it seems to
work and never triggers a warning.

You can create lots of rules about what is or is not allowed, or
you can make everything that it not explicit forbidden (ideally at
compile time but possibly at runtime with might_sleep or lockdep),
permitted.

I prefer the latter.

>
>> It is OK to wait for memory to be freed.  It is not OK to wait for any
>> particular piece of memory to be freed because you don't always know who
>> is waiting for you, or who you really are waiting on to free that
>> memory.
>> 
>> Whenever trying to free memory I think you need to do best-effort
>> without blocking.
>
> I agree with that. Or at least you have to wait on something that is
> _guaranteed_ to make a forward progress. I am not really that sure this
> is easy to achieve with the current code base.

I accept that it isn't "easy".  But I claim that it isn't particularly
difficult either.

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 0/2] scop GFP_NOFS api

2016-04-30 Thread NeilBrown
On Sat, Apr 30 2016, Dave Chinner wrote:

> On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
>> On Tue, Apr 26 2016, Michal Hocko wrote:
>> 
>> > Hi,
>> > we have discussed this topic at LSF/MM this year. There was a general
>> > interest in the scope GFP_NOFS allocation context among some FS
>> > developers. For those who are not aware of the discussion or the issue
>> > I am trying to sort out (or at least start in that direction) please
>> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> > which basically copies what we have for the scope GFP_NOIO allocation
>> > context. I haven't converted any of the FS myself because that is way
>> > beyond my area of expertise but I would be happy to help with further
>> > changes on the MM front as well as in some more generic code paths.
>> >
>> > Dave had an idea on how to further improve the reclaim context to be
>> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> > and FS specific cookie set in the FS allocation context and consumed
>> > by the FS reclaim context to allow doing some provably save actions
>> > that would be skipped due to GFP_NOFS normally.  I like this idea and
>> > I believe we can go that direction regardless of the approach taken here.
>> > Many filesystems simply need to cleanup their NOFS usage first before
>> > diving into a more complex changes.>
>> 
>> This strikes me as over-engineering to work around an unnecessarily
>> burdensome interface but without details it is hard to be certain.
>> 
>> Exactly what things happen in "FS reclaim context" which may, or may
>> not, be safe depending on the specific FS allocation context?  Do they
>> need to happen at all?
>> 
>> My research suggests that for most filesystems the only thing that
>> happens in reclaim context that is at all troublesome is the final
>> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
>> inode to storage.  Some time ago we moved most dirty-page writeout out
>> of the reclaim context and into kswapd.  I think this was an excellent
>> advance in simplicity.
>
> No, we didn't move dirty page writeout to kswapd - we moved it back
> to the background writeback threads where it can be done
> efficiently.  kswapd should almost never do single page writeback
> because of how inefficient it is from an IO perspective, even though
> it can. i.e. if we are doing any significant amount of dirty page
> writeback from memory reclaim (direct, kswapd or otherwise) then
> we've screwed something up.
>
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
>
> When lots of GFP_NOFS allocation is being done, this already
> happens. The shrinkers that can't run due to context accumulate the
> work on the shrinker structure, and when the shrinker can next run
> (e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
> reclaim contexts.
>
> IOWs, we already move shrinker work from direct reclaim to kswapd
> when appropriate.
>
>> The exceptions include:
>>  - nfs and any filesystem using fscache can block for up to 1 second
>>in ->releasepage().  They used to block waiting for some IO, but that
>>caused deadlocks and wasn't really needed.  I left the timeout because
>>it seemed likely that some throttling would help.  I suspect that a
>>careful analysis will show that there is sufficient throttling
>>elsewhere.
>> 
>>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>>for IO so it can free some quotainfo things. 
>
> No it's not. evict() can block on IO - waiting for data or inode
> writeback to complete, or even for filesystems to run transactions
> on the inode. Hence the superblock shrinker can and does block in
> inode cache reclaim.

That is why I said "nearly" :-)

>
> Indeed, blocking the superblock shrinker in reclaim is a key part of
> balancing inode cache pressure in XFS. If the shrinker starts
> hitting dirty inodes, it blocks on cleaning them, thereby slowing
> the rate of allocation to that which inodes can be cleaned and
> reclaimed. There are also background threads that walk ahead freeing
> clean inodes, but we have to throttle direct reclaim in this manner
> otherwise the allocation pressure vastly outweighs the ability to
> reclaim inodes. if we don't balance this, inode allocation triggers
> the OOM killer because reclaim 

Re: [PATCH 0/2] scop GFP_NOFS api

2016-04-30 Thread NeilBrown
On Fri, Apr 29 2016, Michal Hocko wrote:

>
> One think I have learned is that shrinkers can be really complex and
> getting rid of GFP_NOFS will be really hard so I would really like to
> start the easiest way possible and remove the direct usage and replace
> it by scope one which would at least _explain_ why it is needed. I think
> this is a reasonable _first_ step and a large step ahead because we have
> a good chance to get rid of a large number of those which were used
> "just because I wasn't sure and this should be safe, right?". I wouldn't
> be surprised if we end up with a very small number of both scope and
> direct usage in the end.

Yes, shrinkers can be complex.  About two of them are.  We could fix
lots and lots of call sites, or fix two shrinkers.
OK, that's a bit unfair as fixing one of the shrinkers involves changing
many ->evict_inode() functions.  But that would be a very focused
change.

I think your proposal is little more than re-arranging deck chairs on
the titanic.  Yes, it might give everybody a better view of the iceberg
but the iceberg is still there and in reality we can already see it.

The main iceberg is evict_inode.  It appears in both examples given so
far: xfs and gfs.  There are other little icebergs but they won't last
long after evict_inode is dealt with.

One particular problem with your process-context idea is that it isn't
inherited across threads.
Steve Whitehouse's example in gfs shows how allocation dependencies can
even cross into user space.

A more localized one that I have seen is that NFSv4 sometimes needs to
start up a state-management thread (particularly if the server
restarted).
It uses kthread_run(), which doesn't actually create the thread but asks
kthreadd to do it.  If NFS writeout is waiting for state management it
would need to make sure that kthreadd runs in allocation context to
avoid deadlock.
I feel that I've forgotten some important detail here and this might
have been fixed somehow, but the point still stands that the allocation
context can cross from thread to thread and can effectively become
anything and everything.

It is OK to wait for memory to be freed.  It is not OK to wait for any
particular piece of memory to be freed because you don't always know who
is waiting for you, or who you really are waiting on to free that
memory.

Whenever trying to free memory I think you need to do best-effort
without blocking.

>
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

I think the only part of prune_dcache_sb() that might need care is
iput() which boils down to evict().  The final unlink for NFS
silly-rename might happen in there too (in d_iput).
shrinking the dcache seems rather late to be performing that unlink
though, so I've probably missed some key detail.

If we find a way for evict(), when called from the shrinker, to be
non-blocking, and generally require all shrinkers to be non-blocking,
then many of these allocation problems evaporate.

Thanks,
NeilBrown



signature.asc
Description: PGP signature


Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-04-30 Thread NeilBrown
On Fri, Apr 29 2016, Steven Whitehouse wrote:

> Hi,
>
> On 29/04/16 06:35, NeilBrown wrote:
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
> evict() is an issue, but moving it into kswapd would be a potential 
> problem for GFS2. We already have a memory allocation issue when 
> recovery is taking place and memory is short. The code path is as follows:
>
>   1. Inode is scheduled for eviction (which requires deallocation)
>   2. The glock is required in order to perform the deallocation, which 
> implies getting a DLM lock
>   3. Another node in the cluster fails, so needs recovery
>   4. When the DLM lock is requested, it gets blocked until recovery is 
> complete (for the failed node)
>   5. Recovery is performed using a userland fencing utility
>   6. Fencing requires memory and then blocks on the eviction
>   7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on 
> DLM lock, DLM lock waiting on fencing)

You even have user-space in the loop there - impressive!  You can't
really pass GFP_NOFS to a user-space thread, can you :-?

>
> It doesn't happen often, but we've been looking at the best place to 
> break that cycle, and one of the things we've been wondering is whether 
> we could avoid deallocation evictions from memory related contexts, or 
> at least make it async somehow.

I think "async" is definitely the answer and I think
evict()/evict_inode() is the best place to focus attention.

I can see now (thanks) that just moving the evict() call to kswapd isn't
really a solution as it will just serve to block kswapd and so lots of
other freeing of memory won't happen.

I'm now imagining giving ->evict_inode() a "don't sleep" flag and
allowing it to return -EAGAIN.  In that case evict would queue the inode
to kswapd (or maybe another thread) for periodic retry.

The flag would only get set when prune_icache_sb() calls dispose_list()
to call evict().  Other uses (e.g. unmount, iput) would still be
synchronous.

How difficult would it be to change gfs's evict_inode() to optionally
never block?

For this to work we would need to add a way for
deactivate_locked_super() to wait for all the async evictions to
complete.  Currently prune_icache_sb() is called under s_umount.  If we
moved part of the eviction out of that lock some other synchronization
would be needed.  Possibly a per-superblock list of "inodes being
evicted" would suffice.

Thanks,
NeilBrown


>
> The issue is that it is not possible to know in advance whether an 
> eviction will result in mearly writing things back to disk (because the 
> inode is being dropped from cache, but still resides on disk) which is 
> easy to do, or whether it requires a full deallocation (n_link==0) which 
> may require significant resources and time,
>
> Steve.


signature.asc
Description: PGP signature


Re: [PATCH 0/2] scop GFP_NOFS api

2016-04-29 Thread NeilBrown
On Tue, Apr 26 2016, Michal Hocko wrote:

> Hi,
> we have discussed this topic at LSF/MM this year. There was a general
> interest in the scope GFP_NOFS allocation context among some FS
> developers. For those who are not aware of the discussion or the issue
> I am trying to sort out (or at least start in that direction) please
> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> which basically copies what we have for the scope GFP_NOIO allocation
> context. I haven't converted any of the FS myself because that is way
> beyond my area of expertise but I would be happy to help with further
> changes on the MM front as well as in some more generic code paths.
>
> Dave had an idea on how to further improve the reclaim context to be
> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> and FS specific cookie set in the FS allocation context and consumed
> by the FS reclaim context to allow doing some provably save actions
> that would be skipped due to GFP_NOFS normally.  I like this idea and
> I believe we can go that direction regardless of the approach taken here.
> Many filesystems simply need to cleanup their NOFS usage first before
> diving into a more complex changes.>

This strikes me as over-engineering to work around an unnecessarily
burdensome interface but without details it is hard to be certain.

Exactly what things happen in "FS reclaim context" which may, or may
not, be safe depending on the specific FS allocation context?  Do they
need to happen at all?

My research suggests that for most filesystems the only thing that
happens in reclaim context that is at all troublesome is the final
'evict()' on an inode.  This needs to flush out dirty pages and sync the
inode to storage.  Some time ago we moved most dirty-page writeout out
of the reclaim context and into kswapd.  I think this was an excellent
advance in simplicity.
If we could similarly move evict() into kswapd (and I believe we can)
then most file systems would do nothing in reclaim context that
interferes with allocation context.

The exceptions include:
 - nfs and any filesystem using fscache can block for up to 1 second
   in ->releasepage().  They used to block waiting for some IO, but that
   caused deadlocks and wasn't really needed.  I left the timeout because
   it seemed likely that some throttling would help.  I suspect that a
   careful analysis will show that there is sufficient throttling
   elsewhere.

 - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
   for IO so it can free some quotainfo things.  If it could be changed
   to just schedule the IO without waiting for it then I think this
   would be safe to be called in any FS allocation context.  It already
   uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
   if the lock is held.

I think you/we would end up with a much simpler system if instead of
focussing on the places where GFP_NOFS is used, we focus on places where
__GFP_FS is tested, and try to remove them.  If we get rid of enough of
them the remainder could just use __GFP_IO.

> The patch 2 is a debugging aid which warns about explicit allocation
> requests from the scope context. This is should help to reduce the
> direct usage of the NOFS flags to bare minimum in favor of the scope
> API. It is not aimed to be merged upstream. I would hope Andrew took it
> into mmotm tree to give it linux-next exposure and allow developers to
> do further cleanups.  There is a new kernel command line parameter which
> has to be used for the debugging to be enabled.
>
> I think the GFP_NOIO should be seeing the same clean up.

I think you are suggesting that use of GFP_NOIO should (largely) be
deprecated in favour of memalloc_noio_save().  I think I agree.
Could we go a step further and deprecate GFP_ATOMIC in favour of some
in_atomic() test?  Maybe that is going too far.

Thanks,
NeilBrown

>
> Any feedback is highly appreciated.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: [PATCH v2 3/3] block: avoid to call .bi_end_io() recursively

2016-04-26 Thread NeilBrown
On Wed, Apr 27 2016, Ming Lei wrote:

> There were reports about heavy stack use by recursive calling
> .bi_end_io()([1][2][3]). For example, more than 16K stack is
> consumed in a single bio complete path[3], and in [2] stack
> overflow can be triggered if 20 nested dm-crypt is used.
>
> Also patches[1] [2] [3] were posted for addressing the issue,
> but never be merged. And the idea in these patches is basically
> similar, all serializes the recursive calling of .bi_end_io() by
> percpu list.
>
> This patch still takes the same idea, but uses bio_list to
> implement it, which turns out more simple and the code becomes
> more readable meantime.
>
> One corner case which wasn't covered before is that
> bi_endio() may be scheduled to run in process context(such
> as btrfs), and this patch just bypasses the optimizing for
> that case because one new context should have enough stack space,
> and this approach isn't capable of optimizing it too because
> there isn't easy way to get a per-task linked list head.
>
> xfstests(-g auto) is run with this patch and no regression is
> found on ext4, xfs and btrfs.
>
> [1] http://marc.info/?t=12142850204=1=2
> [2] http://marc.info/?l=dm-devel=139595190620008=2
> [3] http://marc.info/?t=14597464411=1=2
>
> Cc: Shaun Tancheff <shaun.tanch...@seagate.com>
> Cc: Christoph Hellwig <h...@infradead.org>
> Cc: Mikulas Patocka <mpato...@redhat.com>
> Cc: Alan Cox <a...@linux.intel.com>
> Cc: Neil Brown <ne...@suse.de>
> Cc: Liu Bo <bo.li@oracle.com>
> Signed-off-by: Ming Lei <ming@canonical.com>
> ---
>  block/bio.c | 57 +++--
>  1 file changed, 55 insertions(+), 2 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 807d25e..6b4ca7b 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -68,6 +68,8 @@ static DEFINE_MUTEX(bio_slab_lock);
>  static struct bio_slab *bio_slabs;
>  static unsigned int bio_slab_nr, bio_slab_max;
>  
> +static DEFINE_PER_CPU(struct bio_list *, bio_end_list) = { NULL };
> +
>  static struct kmem_cache *bio_find_or_create_slab(unsigned int extra_size)
>  {
>   unsigned int sz = sizeof(struct bio) + extra_size;
> @@ -1737,6 +1739,58 @@ static inline bool bio_remaining_done(struct bio *bio)
>   return false;
>  }
>  
> +static void __bio_endio(struct bio *bio)
> +{
> + if (bio->bi_end_io)
> + bio->bi_end_io(bio);
> +}
> +
> +/* disable local irq when manipulating the percpu bio_list */
> +static void unwind_bio_endio(struct bio *bio)
> +{
> + struct bio_list *bl;
> + unsigned long flags;
> +
> + /*
> +  * We can't optimize if bi_endio() is scheduled to run from
> +  * process context because there isn't easy way to get a
> +  * per-task bio list head or allocate a per-task variable.
> +  */
> + if (!in_interrupt()) {
> + /*
> +  * It has to be a top calling when it is run from
> +  * process context.
> +  */
> + WARN_ON(this_cpu_read(bio_end_list));
> + __bio_endio(bio);
> + return;
> + }
> +
> + local_irq_save(flags);
> + bl = __this_cpu_read(bio_end_list);
> + if (!bl) {
> + struct bio_list bl_in_stack;
> +
> + bl = _in_stack;
> + bio_list_init(bl);
> + __this_cpu_write(bio_end_list, bl);

The patch seems to make sense, but this bit bothers me.
You are expecting bl_in_stack to still be usable after this block of
code completes.  While it probably is, I don't think it is a good idea
to depend on it.
If you move the "struct bio_list bl_in_stack" to the top of the function
I would be a lot happier.

Or you could change the code to:

   if (bl) {
   bio_list_add(bl, bio);
   } else {
   struct bio_list bl_in_stack;
   ... use bl_in_stack,
   while loop
   set bio_end_list to NULL
   }

and the code flow would all be must clearer.

Thanks,
NeilBrown

> + } else {
> + bio_list_add(bl, bio);
> + goto out;
> + }
> +
> + while (bio) {
> + local_irq_restore(flags);
> + __bio_endio(bio);
> + local_irq_save(flags);
> +> +  bio = bio_list_pop(bl);
> + }
> + __this_cpu_write(bio_end_list, NULL);
> + out:
> + local_irq_restore(flags);
> +}
> +
>  /**
>   * bio_endio - end I/O on a bio
>   * @bio: bio
> @@ -1765,8 +1819,7 @@ again:
>   goto again;
>   }
>  
> - if (bio->bi_end_io)
> - bio->bi_end_io(bio);
> + unwind_bio_endio(bio);
>  }
>  EXPORT_SYMBOL(bio_endio);
>  
> -- 
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: [PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe

2016-04-05 Thread NeilBrown
On Wed, Apr 06 2016, Shaohua Li wrote:

> On Tue, Apr 05, 2016 at 03:36:57PM +0200, Lars Ellenberg wrote:
>> blk_check_plugged() will return a pointer
>> to an object linked on current->plug->cb_list.
>> 
>> That list may "at any time" be implicitly cleared by
>> blk_flush_plug_list()
>>  flush_plug_callbacks()
>> either as a result of blk_finish_plug(),
>> or implicitly by schedule() [and maybe other implicit mechanisms?]
>> 
>> If there is no protection against an implicit unplug
>> between the call to blk_check_plug() and using its return value,
>> that implicit unplug may have already happened,
>> even before the plug is actually initialized or populated,
>> and we may be using a pointer to already free()d data.
>
> This isn't correct. flush plug is never called in preemption, which is 
> designed
> only called when the task is going to sleep. See sched_submit_work. Am I
> missing anything?

Ahh yes, thanks.

Only two places call blk_schedule_flush_plug().
One is io_schedule_timeout() which must be called explicitly.
There other is, as you say, sched_submit_work().  It starts:

static inline void sched_submit_work(struct task_struct *tsk)
{
if (!tsk->state || tsk_is_pi_blocked(tsk))
return;

so if the task is runnable, then as
include/linux/sched.h:#define TASK_RUNNING  0

it will never call blk_schedule_flush_plug().

So I don't think you are missing anything, we were.

Lars:  have your concerns been relieved or do you still have reason to
think there is a problem?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] [RFC] fix potential access after free: return value of blk_check_plugged() must be used schedule() safe

2016-04-05 Thread NeilBrown
On Tue, Apr 05 2016, Lars Ellenberg wrote:

> blk_check_plugged() will return a pointer
> to an object linked on current->plug->cb_list.
>
> That list may "at any time" be implicitly cleared by
> blk_flush_plug_list()
>  flush_plug_callbacks()
> either as a result of blk_finish_plug(),
> or implicitly by schedule() [and maybe other implicit mechanisms?]

I think the only risk here is preemption, so
  preempt_disable() / preempt_enable()
or as you say a spinlock, is sufficient protection.
I would suggest preempt_{dis,en}able for the raid5 code.
Maybe for raid1/raid10 too just for consistency.

>
> If there is no protection against an implicit unplug
> between the call to blk_check_plug() and using its return value,
> that implicit unplug may have already happened,
> even before the plug is actually initialized or populated,
> and we may be using a pointer to already free()d data.
>
> I suggest that both raid1 and raid10 can easily be fixed
> by moving the call to blk_check_plugged() inside the spinlock.
>
> For md/raid5 and btrfs/raid56,
> I'm unsure how (if) this needs to be fixed.
>
> The other current in-tree users of blk_check_plugged()
> are mm_check_plugged(), and mddev_check_plugged().
>
> mm_check_plugged() is already used safely inside a spinlock.
>
> with mddev_check_plugged() I'm unsure, at least on a preempt kernel.

I think this is only an issue on a preempt kernel, and in that case: yes
- mddev_check_plugged() needs protection.  Maybe preempt enable/disable
could be done in blk_check_plugged() so those calls which don't
dereference the pointer don't need further protection.
Or maybe blk_check_plugged should have WARN_ON_ONCE(!in_atomic());

>
> Did I overlook any magic that protects against such implicit unplug?

Just the fortunate lack of preemption probably.

>
> Also, why pretend that a custom plug struct (such as raid1_plug_cb)
> may have its member "struct blk_plug_cb cb" at an arbitrary offset?
> As it is, raid1_check_plugged() below is actually just a cast.

Fair point.  I generally prefer container_of to casts because it is more
obviously correct, and type checked.
However as blk_check_plugged performs the allocation, the blk_plug_cb
must be at the start of the containing structure, so the complex tests
for handling NULL are just noise.
I'd be happy for that to be changed.

Thanks,
NeilBrown



>
> Signed-off-by: Lars Ellenberg <lars.ellenb...@linbit.com>
> ---
>  drivers/md/raid1.c  | 19 +--
>  drivers/md/raid10.c | 21 +
>  drivers/md/raid5.c  |  5 +
>  fs/btrfs/raid56.c   |  5 +
>  4 files changed, 36 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 39fb21e..55dc960 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1044,6 +1044,18 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool 
> from_schedule)
>   kfree(plug);
>  }
>  
> +static struct raid1_plug_cb *raid1_check_plugged(struct mddev *mddev)
> +{
> + /* return (struct raid1_plug_cb*)blk_check_plugged(...); */
> + struct blk_plug_cb *cb;
> + struct raid1_plug_cb *plug = NULL;
> +
> + cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
> + if (cb)
> + plug = container_of(cb, struct raid1_plug_cb, cb);
> + return plug;
> +}
> +
>  static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  {
>   struct r1conf *conf = mddev->private;
> @@ -1060,7 +1072,6 @@ static void raid1_make_request(struct mddev *mddev, 
> struct bio * bio)
> & (REQ_DISCARD | REQ_SECURE));
>   const unsigned long do_same = (bio->bi_rw & REQ_WRITE_SAME);
>   struct md_rdev *blocked_rdev;
> - struct blk_plug_cb *cb;
>   struct raid1_plug_cb *plug = NULL;
>   int first_clone;
>   int sectors_handled;
> @@ -1382,12 +1393,8 @@ read_again:
>  
>   atomic_inc(_bio->remaining);
>  
> - cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
> - if (cb)
> - plug = container_of(cb, struct raid1_plug_cb, cb);
> - else
> - plug = NULL;
>   spin_lock_irqsave(>device_lock, flags);
> + plug = raid1_check_plugged(mddev);
>   if (plug) {
>   bio_list_add(>pending, mbio);
>   plug->pending_cnt++;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index e3fd725..d7d4397 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1052,6 +1052,18 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool 
> from_schedule

Re: [PATCH v7 01/20] btrfs: dedup: Introduce dedup framework and its header

2016-03-13 Thread NeilBrown
On Sun, Mar 13 2016, Qu Wenruo wrote:

> Qu Wenruo wrote on 2016/03/12 16:16 +0800:
>>
>>
>> On 03/11/2016 07:43 PM, David Sterba wrote:
>>> On Thu, Mar 10, 2016 at 08:57:12AM +0800, Qu Wenruo wrote:
>>>>> The ioctl FIDEDUPERANGE and the tool duperemove both use "dupe".
>>>>> It would be nice if we could be consistent and all use the same
>>>>> abbreviation.
>>>>
>>>> Yes, current kernel VFS level offline dedup uses the name "dedupe".
>>>> But on the other hand, ZFS uses the name "dedup" for their online dedup.
>>>>
>>>> And personally speaking, I'd like some difference to distinguish inline
>>>> dedup and offline dedup.
>>>> In that case, the extra "e" seems somewhat useful.
>>>> With "e", it's intended for offline use. Without "e", it's intended for
>>>> online use.
>>>
>>> Such difference is very subtle and I think we should stick to just one
>>> spelling, which shall be 'dedupe'.
>>
>> OK, I'll change them to 'dedupe' in next bug fix version.
>>
>> Thanks,
>> Qu
>>
>>
> BTW, I am always interested in, why de-duplication can be shorted as 
> 'dedupe'.

The "u" in "duplicate" is pronounced as a long vowel sound, almost like
   d-you-plicate.

Normal pronunciation rules for English indicate that "dup" should be
pronounced with a short vowel sound, like "cup".  So "dup" sounds wrong.

To make a vowel long you can add an 'e' at the end of a word.
So:
   tub or cub  have a short "u"
   tube or cube  have a long "u".

by analogy, "dupe" has a long "u" and so sounds like the first syllable
of "duplicate".

NeilBrown


>
> I didn't see any 'e' in the whole word "DUPlication".
> Or it's an abbreviation of "DUPlicatE" instead of "DUPlication"?
>
> Thanks,
> Qu
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: [PATCH v7 01/20] btrfs: dedup: Introduce dedup framework and its header

2016-03-09 Thread NeilBrown
On Thu, Feb 18 2016, Qu Wenruo wrote:

> +
> +/*
> + * Dedup storage backend
> + * On disk is persist storage but overhead is large
> + * In memory is fast but will lose all its hash on umount
> + */
> +#define BTRFS_DEDUP_BACKEND_INMEMORY 0
> +#define BTRFS_DEDUP_BACKEND_ONDISK   1
> +#define BTRFS_DEDUP_BACKEND_LAST 2

Hi,

This may seem petty, but I'm here to complain about the names. :-)

Firstly, "2" is *not* the LAST backend.  The LAST backed is clearly
"ONDISK" with is "1:.
"2" is the number of backends, or the count of them.
So
> +#define BTRFS_DEDUP_BACKEND_LAST 1

would be OK, as would

> +#define BTRFS_DEDUP_BACKEND_COUNT2

but what you have is wrong.

The place where you use this define:

+   if (backend >= BTRFS_DEDUP_BACKEND_LAST)
+   return -EINVAL;

is correct, but it looks wrong.  It looks like it is saying that it is
invalid to use the LAST backend!

Secondly, you use "dup" as an abbreviation of "duplicate".
The ioctl FIDEDUPERANGE and the tool duperemove both use "dupe".
It would be nice if we could be consistent and all use the same
abbreviation.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] block: add a bi_error field to struct bio

2015-07-21 Thread NeilBrown
On Mon, 20 Jul 2015 15:29:37 +0200 Christoph Hellwig h...@lst.de wrote:

 Currently we have two different ways to signal an I/O error on a BIO:
 
  (1) by clearing the BIO_UPTODATE flag
  (2) by returning a Linux errno value to the bi_end_io callback
 
 The first one has the drawback of only communicating a single possible
 error (-EIO), and the second one has the drawback of not beeing persistent
 when bios are queued up, and are not passed along from child to parent
 bio in the ever more popular chaining scenario.  Having both mechanisms
 available has the additional drawback of utterly confusing driver authors
 and introducing bugs where various I/O submitters only deal with one of
 them, and the others have to add boilerplate code to deal with both kinds
 of error returns.
 
 So add a new bi_error field to store an errno value directly in struct
 bio and remove the existing mechanisms to clean all this up.
 
 Signed-off-by: Christoph Hellwig h...@lst.de
 ---

Reviewed-by: NeilBrown ne...@suse.com (umem and md/raid).

i.e. these files.
  drivers/block/umem.c|  4 +--
  drivers/md/faulty.c |  4 +--
  drivers/md/linear.c |  2 +-
  drivers/md/md.c | 18 +--
  drivers/md/multipath.c  | 12 +++
  drivers/md/raid0.c  |  2 +-
  drivers/md/raid1.c  | 53 ---
  drivers/md/raid10.c | 55 +++-
  drivers/md/raid5.c  | 52 +++


Thanks,
NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] NILFS2: support NFSv2 export

2015-05-11 Thread NeilBrown

The fh_len passed to -fh_to_* is not guaranteed to be that same as
that returned by encode_fh - it may be larger.

With NFSv2, the filehandle is fixed length, so it may appear longer
than expected and be zero-padded.

So we must test that fh_len is at least some value, not exactly equal
to it.

Signed-off-by: NeilBrown ne...@suse.de

diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 22180836ec22..37dd6b05b1b5 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -496,8 +496,7 @@ static struct dentry *nilfs_fh_to_dentry(struct super_block 
*sb, struct fid *fh,
 {
struct nilfs_fid *fid = (struct nilfs_fid *)fh;
 
-   if ((fh_len != NILFS_FID_SIZE_NON_CONNECTABLE 
-fh_len != NILFS_FID_SIZE_CONNECTABLE) ||
+   if (fh_len  NILFS_FID_SIZE_NON_CONNECTABLE ||
(fh_type != FILEID_NILFS_WITH_PARENT 
 fh_type != FILEID_NILFS_WITHOUT_PARENT))
return NULL;
@@ -510,7 +509,7 @@ static struct dentry *nilfs_fh_to_parent(struct super_block 
*sb, struct fid *fh,
 {
struct nilfs_fid *fid = (struct nilfs_fid *)fh;
 
-   if (fh_len != NILFS_FID_SIZE_CONNECTABLE ||
+   if (fh_len  NILFS_FID_SIZE_CONNECTABLE ||
fh_type != FILEID_NILFS_WITH_PARENT)
return NULL;
 


pgpyMcq1J_s2Q.pgp
Description: OpenPGP digital signature


Re: [PATCH 3/3] btrfs: set FS_SUPPORTS_SEEK_HOLE flag.

2015-04-26 Thread NeilBrown
On Mon, 20 Apr 2015 02:48:55 -0700 Christoph Hellwig h...@infradead.org
wrote:

 On Mon, Apr 20, 2015 at 10:46:49AM +0100, David Howells wrote:
  NeilBrown ne...@suse.de wrote:
  
   Missing patch 2 of the 3-patch series?
  
  Yes. :-)
  
  Do ext4 and xfs support this, do you know?
 
 Yes.  As do f2fs, ocfs2, gfs2, ceph and NFSv4.2

Are you sure about NFSv4.2?

I see that it *can* report holes, but is there any guarantee that if you
create a new file and write only the 5th block, then READ_PLUS will reliably
report that the first 4 block are holes??

Because if it doesn't guarantee that, then NFSv4.2 doesn't fit the with the
others where SEEK_HOLE reliable reports holes.
On the other hand if NFSv4.2 *does* guarantee that then the current READ_PLUS
server patches are broken because they just use vfs_llseek and assume that
trust what it says.

It would be really nice if SEEK_{DATA,HOLE} either reported holes reliably or
returned ENXIO, but I guess there was a goo reason not to do that.

Thanks,
NeilBrown


pgpepP_eu_MoM.pgp
Description: OpenPGP digital signature


Re: [PATCH 2/3] fscache/cachefiles: optionally use SEEK_DATA instead of -bmap.

2015-04-21 Thread NeilBrown
On Mon, 20 Apr 2015 02:45:39 -0700 Christoph Hellwig h...@infradead.org
wrote:

 On Mon, Apr 20, 2015 at 04:27:00PM +1000, NeilBrown wrote:
  A worthwhile goal, but I certainly wouldn't consider pursuing it until what 
  I
  have submitted so far as been accepted - let's not reject good while
  waiting for perfect.
 
 It's still broken.  You add conditional flag for the almost right
 (almost because the flag in the filesystem type needs to go)

Why does it have to go?  I suspect you have a reason, but I can't read your
mind.

  while
 leaving the broken option th default. 

You say it is broken, and yet people are using it and are having a degree of
success.

Surely the appropriate process is:
 - introduce a better option
 - examine each relevant filesystem and transition over to use the new option.
 - remove the not so good option.

I'm still at step 1.

   So what you propose here is not
 good, it's at best just as bad as the old version because you don't
 remove broken code but add a lot more clutter at the same time.

What I propose is measurably better because it works with BTRFS now, and
there seems to be a reasonable path towards making to generally better if
someone cares enough to examine each filesystem.

So I still claim you are pushing back against good because you want
perfect.

NeilBrown


pgpDjC0pr_j3t.pgp
Description: OpenPGP digital signature


Re: [PATCH 3/3] btrfs: set FS_SUPPORTS_SEEK_HOLE flag.

2015-04-20 Thread NeilBrown
On Mon, 20 Apr 2015 09:47:42 +0100 David Howells dhowe...@redhat.com wrote:

 NeilBrown ne...@suse.de wrote:
 
  +   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
  + FS_SUPPORTS_SEEK_HOLE,
 
 I must be missing something:
 
   warthoggit merge linus/master
   Already up-to-date.
   warthogstg id
   09d51602cf84a1264946711dd4ea0dddbac599a1
   warthoggrep -r FS_SUPPORTS_SEEK_HOLE include/
   warthog1git grep FS_SUPPORTS_SEEK_HOLE
   warthog1
 

Missing patch 2 of the 3-patch series?

NeilBrown


pgpH8svVeTADM.pgp
Description: OpenPGP digital signature


Re: [PATCH 2/3] fscache/cachefiles: optionally use SEEK_DATA instead of -bmap.

2015-04-20 Thread NeilBrown
On Sun, 19 Apr 2015 23:08:18 -0700 Christoph Hellwig h...@infradead.org
wrote:

 Please just usse SEEK_HOLE/DATA support unconditioanlly.  -bmap is a
 horrible hack that is completely unsafe.
 

A worthwhile goal, but I certainly wouldn't consider pursuing it until what I
have submitted so far as been accepted - let's not reject good while
waiting for perfect.

NeilBrown


pgp5aCT52PYUG.pgp
Description: OpenPGP digital signature


Re: [PATCH/RFC] fscache/cachefiles versus btrfs

2015-04-19 Thread NeilBrown
On Fri, 10 Apr 2015 11:24:31 +1000 NeilBrown ne...@suse.de wrote:

 On Thu, 09 Apr 2015 10:23:08 +0100 David Howells dhowe...@redhat.com wrote:
 
  NeilBrown ne...@suse.de wrote:
  
Is there a better way?  Could a better way be created?  Maybe
SEEK_DATA_RELIABLE ??
  
  fiemap() maybe?
  
   Also, if you do try to use fscache on btrfs with 3.19, then nothing gets
   cached (as expected) and with a heavy load you can lose a race and get an
   asserting fail in fscache_enqueue_operation
  
  Do you have the patches here applied?
  
  
  http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes
  
 
 Do I don't.  I had looked through them and while they did seem to be
 addressing similar sorts of races, nothing seems like an obvious match.
 I haven't been able to reproduce the BUG_ON myself.  I only have a report
 of it repeatedly affecting someone else:
 
   https://bugzilla.opensuse.org/show_bug.cgi?id=908706
 
 I'll probably have to be happy with fixing usage on btrfs, and hope the other
 bug is fixed already or doesn't become a problem.

I managed to reproduce the bug, and when I applied your patches I cannot any
more.  So it looks like you've fixed it - thanks.

That just leave the bmap issue.  I'll post a patch which causes lseek to be
used when the fs says that is OK.

Thanks,
NeilBrown


pgplDBNb5PbIF.pgp
Description: OpenPGP digital signature


[PATCH 3/3] btrfs: set FS_SUPPORTS_SEEK_HOLE flag.

2015-04-19 Thread NeilBrown
This allows fscache to cachefiles in a btrfs filesystem.

Signed-off-by: NeilBrown ne...@suse.de
---
 fs/btrfs/super.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 05fef198ff94..d3c5d2b40f8e 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1904,7 +1904,8 @@ static struct file_system_type btrfs_fs_type = {
.name   = btrfs,
.mount  = btrfs_mount,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+ FS_SUPPORTS_SEEK_HOLE,
 };
 MODULE_ALIAS_FS(btrfs);
 


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Allow fscache to work on BTRFS

2015-04-19 Thread NeilBrown
The following three patches allow fs to cachefiles in a BTRFS
filesystem.

The first is a minor cleanup to cachefiles.
The second is the main change - it teaches cachefile to use
lseek(SEEK_DATA) to find allocated blocks in a file, rather than bmap.
The third patch simply enables this for btrfs.

Thanks,
NeilBrown


---

NeilBrown (3):
  cachefiles: perform test on s_blocksize when opening cache file.
  fscache/cachefiles: optionally use SEEK_DATA instead of -bmap.
  btrfs: set FS_SUPPORTS_SEEK_HOLE flag.


 fs/btrfs/super.c  |3 +
 fs/cachefiles/namei.c |   13 -
 fs/cachefiles/rdwr.c  |  125 ++---
 include/linux/fs.h|1 
 4 files changed, 88 insertions(+), 54 deletions(-)

--
Signature

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] fscache/cachefiles: optionally use SEEK_DATA instead of -bmap.

2015-04-19 Thread NeilBrown
cachefiles currently uses 'bmap' to determine if a given block
in a file has been cached, or not.
Not all filesystems support bmap, particularly BTRFS.

SEEK_DATA can be used to determine if a block in a file has
been allocated, but not all filesystems support this reliably.
On filesystems without explicit report, SEEK_DATA will report anything
below i_size to be valid  data.

So:
 - add a file_system_type flag which confirms that SEEK_DATA and
   SEEK_HOLE will reliably report holes,
 - change cachefiles to use vfs_lseek if FS_SUPPORTS_SEEK_HOLE is
   set, and only use -bmap if it isn't.

Subsequent patch will set flag for btrfs.  Other filesystems could
usefully have FS_SUPPORTS_SEEK_HOLE set, but I'll leave that to the
relevant maintainers to decide.

Signed-off-by: NeilBrown ne...@suse.de
---
 fs/cachefiles/namei.c |   15 --
 fs/cachefiles/rdwr.c  |  119 +++--
 include/linux/fs.h|1 
 3 files changed, 86 insertions(+), 49 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 5404afcdee98..5d5e56c645ec 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -643,14 +643,17 @@ lookup_again:
/* open a file interface onto a data file */
if (object-type != FSCACHE_COOKIE_TYPE_INDEX) {
if (d_is_reg(object-dentry)) {
-   const struct address_space_operations *aops;
 
ret = -EPERM;
-   aops = object-dentry-d_inode-i_mapping-a_ops;
-   if (!aops-bmap)
-   goto check_error;
-   if (object-dentry-d_sb-s_blocksize  PAGE_SIZE)
-   goto check_error;
+   if (!(object-dentry-d_sb-s_type-fs_flags
+  FS_SUPPORTS_SEEK_HOLE)) {
+   const struct address_space_operations *aops;
+   aops = 
object-dentry-d_inode-i_mapping-a_ops;
+   if (!aops-bmap)
+   goto check_error;
+   if (object-dentry-d_sb-s_blocksize  
PAGE_SIZE)
+   goto check_error;
+   }
 
object-backer = object-dentry;
} else {
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 9bd44ff48cc0..7c7cbfae7b19 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -394,9 +394,8 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval 
*op,
struct cachefiles_object *object;
struct cachefiles_cache *cache;
struct inode *inode;
-   sector_t block0, block;
-   unsigned shift;
int ret;
+   bool have_data;
 
object = container_of(op-op.object,
  struct cachefiles_object, fscache);
@@ -410,31 +409,47 @@ int cachefiles_read_or_alloc_page(struct 
fscache_retrieval *op,
 
inode = object-backer-d_inode;
ASSERT(S_ISREG(inode-i_mode));
-   ASSERT(inode-i_mapping-a_ops-bmap);
ASSERT(inode-i_mapping-a_ops-readpages);
 
-   /* calculate the shift required to use bmap */
-   shift = PAGE_SHIFT - inode-i_sb-s_blocksize_bits;
-
op-op.flags = FSCACHE_OP_KEEP_FLAGS;
op-op.flags |= FSCACHE_OP_ASYNC;
op-op.processor = cachefiles_read_copier;
 
-   /* we assume the absence or presence of the first block is a good
-* enough indication for the page as a whole
-* - TODO: don't use bmap() for this as it is _not_ actually good
-*   enough for this as it doesn't indicate errors, but it's all we've
-*   got for the moment
-*/
-   block0 = page-index;
-   block0 = shift;
-
-   block = inode-i_mapping-a_ops-bmap(inode-i_mapping, block0);
-   _debug(%llx - %llx,
-  (unsigned long long) block0,
-  (unsigned long long) block);
-
-   if (block) {
+   if (inode-i_sb-s_type-fs_flags  FS_SUPPORTS_SEEK_HOLE) {
+   /* Use llseek */
+   struct path path;
+   struct file *file;
+   loff_t addr;
+   path.mnt = cache-mnt;
+   path.dentry = object-backer;
+   file = dentry_open(path, O_RDONLY, cache-cache_cred);
+   if (IS_ERR(file))
+   goto enobufs;
+   addr = page-index;
+   addr = PAGE_SHIFT;
+   have_data = (addr == vfs_llseek(file, addr, SEEK_DATA));
+   filp_close(file, NULL);
+   } else {
+   /* we assume the absence or presence of the first block is a 
good
+* enough indication for the page as a whole
+* - TODO: don't use bmap() for this as it is _not_ actually 
good
+*   enough for this as it doesn't indicate errors, but it's 
all we've
+*   got

[PATCH 1/3] cachefiles: perform test on s_blocksize when opening cache file.

2015-04-19 Thread NeilBrown
cachefiles requires that s_blocksize in the cache is not greater than
PAGE_SIZE, and performs the check every time a block is accessed.

Move the test to the place where the file is opened, where other
file-validity tests are performed.

Signed-off-by: NeilBrown ne...@suse.de
---
 fs/cachefiles/namei.c |2 ++
 fs/cachefiles/rdwr.c  |6 --
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 1e51714eb33e..5404afcdee98 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -649,6 +649,8 @@ lookup_again:
aops = object-dentry-d_inode-i_mapping-a_ops;
if (!aops-bmap)
goto check_error;
+   if (object-dentry-d_sb-s_blocksize  PAGE_SIZE)
+   goto check_error;
 
object-backer = object-dentry;
} else {
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index c6cd8d7a4eef..9bd44ff48cc0 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -414,9 +414,6 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval 
*op,
ASSERT(inode-i_mapping-a_ops-readpages);
 
/* calculate the shift required to use bmap */
-   if (inode-i_sb-s_blocksize  PAGE_SIZE)
-   goto enobufs;
-
shift = PAGE_SHIFT - inode-i_sb-s_blocksize_bits;
 
op-op.flags = FSCACHE_OP_KEEP_FLAGS;
@@ -711,9 +708,6 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval 
*op,
ASSERT(inode-i_mapping-a_ops-readpages);
 
/* calculate the shift required to use bmap */
-   if (inode-i_sb-s_blocksize  PAGE_SIZE)
-   goto all_enobufs;
-
shift = PAGE_SHIFT - inode-i_sb-s_blocksize_bits;
 
pagevec_init(pagevec, 0);


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC] fscache/cachefiles versus btrfs

2015-04-09 Thread NeilBrown

hi,
 fscache cannot currently be used with btrfs as the backing store for the
 cache (managed by cachefilesd).
 This is because cachefiles needs the -bmap address_space_operation, and
 btrfs doesn't provide it.

 cachefiles only uses this to find out if a particular page is a 'hole' or
 not.  For btrfs, this can be done with 'SEEK_DATA'.

 Unfortunately it doesn't seem to be possible to query a filesystem or a file
 to see if SEEK_DATA is reliable or not, so we cannot simply use SEEK_DATA
 when reliable, else -bmap if available.

 The following patch make fscache work for me on btrfs.  It explicitly checks
 for BTRFS_SUPER_MAGIC.  Not really a nice solution, but all I could think of.

 Is there a better way?  Could a better way be created?  Maybe
 SEEK_DATA_RELIABLE ??

 Comments, suggestions welcome.


Also, if you do try to use fscache on btrfs with 3.19, then nothing gets
cached (as expected) and with a heavy load you can lose a race and get an
asserting fail in fscache_enqueue_operation

ASSERT(fscache_object_is_available(op-object));

It looks like the object is being killed before it is available...

[  859.700765] kernel BUG at ../fs/fscache/operation.c:38!
...
[  859.703124] Call Trace:
[  859.703193]  [a0448044] fscache_run_op.isra.4+0x34/0x80 [fscache]
[  859.703260]  [a0448160] fscache_start_operations+0xa0/0xf0 
[fscache]
[  859.703388]  [a0446cd8] fscache_kill_object+0x98/0xc0 [fscache]
[  859.703455]  [a04475c1] fscache_object_work_func+0x151/0x210 
[fscache]
[  859.703578]  [81078b07] process_one_work+0x147/0x3c0
[  859.703642]  [8107929c] worker_thread+0x20c/0x470

I haven't figured out the cause of that yet.


Thanks,
NeilBrown




diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 1e51714eb33e..1389d8483d5d 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -20,6 +20,7 @@
 #include linux/namei.h
 #include linux/security.h
 #include linux/slab.h
+#include linux/magic.h
 #include internal.h
 
 #define CACHEFILES_KEYBUF_SIZE 512
@@ -647,7 +648,8 @@ lookup_again:
 
ret = -EPERM;
aops = object-dentry-d_inode-i_mapping-a_ops;
-   if (!aops-bmap)
+   if (!aops-bmap 
+   object-dentry-d_sb-s_magic != BTRFS_SUPER_MAGIC)
goto check_error;
 
object-backer = object-dentry;
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index c6cd8d7a4eef..49fb330c0ab8 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -410,11 +410,11 @@ int cachefiles_read_or_alloc_page(struct 
fscache_retrieval *op,
 
inode = object-backer-d_inode;
ASSERT(S_ISREG(inode-i_mode));
-   ASSERT(inode-i_mapping-a_ops-bmap);
ASSERT(inode-i_mapping-a_ops-readpages);
 
/* calculate the shift required to use bmap */
-   if (inode-i_sb-s_blocksize  PAGE_SIZE)
+   if (inode-i_mapping-a_ops-bmap 
+   inode-i_sb-s_blocksize  PAGE_SIZE)
goto enobufs;
 
shift = PAGE_SHIFT - inode-i_sb-s_blocksize_bits;
@@ -423,20 +423,36 @@ int cachefiles_read_or_alloc_page(struct 
fscache_retrieval *op,
op-op.flags |= FSCACHE_OP_ASYNC;
op-op.processor = cachefiles_read_copier;
 
-   /* we assume the absence or presence of the first block is a good
-* enough indication for the page as a whole
-* - TODO: don't use bmap() for this as it is _not_ actually good
-*   enough for this as it doesn't indicate errors, but it's all we've
-*   got for the moment
-*/
-   block0 = page-index;
-   block0 = shift;
-
-   block = inode-i_mapping-a_ops-bmap(inode-i_mapping, block0);
-   _debug(%llx - %llx,
-  (unsigned long long) block0,
-  (unsigned long long) block);
+   if (inode-i_mapping-a_ops-bmap) {
+   /* we assume the absence or presence of the first block is a 
good
+* enough indication for the page as a whole
+* - TODO: don't use bmap() for this as it is _not_ actually 
good
+*   enough for this as it doesn't indicate errors, but it's 
all we've
+*   got for the moment
+*/
+   block0 = page-index;
+   block0 = shift;
 
+   block = inode-i_mapping-a_ops-bmap(inode-i_mapping, block0);
+   _debug(%llx - %llx,
+  (unsigned long long) block0,
+  (unsigned long long) block);
+   } else {
+   /* Use llseek */
+   struct path path;
+   struct file *file;
+   path.mnt = cache-mnt;
+   path.dentry = object-backer;
+   file = dentry_open(path, O_RDONLY, cache-cache_cred);
+   if (IS_ERR(file))
+   goto enobufs;
+   block = vfs_llseek(file

Re: __might_sleep() warnings on v3.19-rc6

2015-02-01 Thread NeilBrown

(high-jacking the thread a bit... I don't have the patch that I want to reply
to still in my mail box:  the subject still matches...)

I just got a might-sleep warning in my own testing.
This was introduced by 

commit e22b886a8a43b147e1994a9f970f678fc0df2033
Author: Peter Zijlstra pet...@infradead.org
Date:   Wed Sep 24 10:18:48 2014 +0200

sched/wait: Add might_sleep() checks


In particular:

@@ -318,6 +320,7 @@ do {
  */
 #define wait_event_cmd(wq, condition, cmd1, cmd2)  \
 do {   \
+   might_sleep();  \
if (condition)  \
break;  \
__wait_event_cmd(wq, condition, cmd1, cmd2);\


Where I call this in raid5_quiesce(), 'cmd1' releases a lock and enables
interrupts and  cmd2 takes the lock and disables interrupts.

So it is perfectly OK to sleep at the point where schedule is called, but not
at the point where wait_event_cmd is called.

I can't use wait_event_lock_irq_cmd() as there are actually several spinlocks
I need to manipulate.

So I'm hoping that this part of the patch (at least) can be reverted.

Otherwise I guess I'll need to use __wait_event_cmd().

Thanks,
NeilBrown


pgpQc81MRD5Ju.pgp
Description: OpenPGP digital signature


Re: __might_sleep() warnings on v3.19-rc6

2015-02-01 Thread NeilBrown
On Sun, 1 Feb 2015 21:08:12 -0800 Linus Torvalds
torva...@linux-foundation.org wrote:

 On Sun, Feb 1, 2015 at 3:03 PM, NeilBrown ne...@suse.de wrote:
 
  I guess I could
__set_current_state(TASK_RUNNING);
  somewhere to defeat the warning, and add a comment explaining why.
 
  Would that be a good thing?
 
 Use sched_annotate_sleep() instead, but yes, add a comment about why it's 
 ok.
 
 Linus
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

OK - following patch is queued to appear in a pull request tomorrow  (I hope).

Thanks,
NeilBrown

From: NeilBrown ne...@suse.de
Date: Mon, 2 Feb 2015 17:08:03 +1100
Subject: [PATCH] md/bitmap: fix a might_sleep() warning.

commit 8eb23b9f35aae413140d3fda766a98092c21e9b0
sched: Debug nested sleeps

causes false-positive warnings in RAID5 code.

This annotation removes them and adds a comment
explaining why there is no real problem.

Reported-by: Fengguang Wu fengguang...@intel.com
Signed-off-by: NeilBrown ne...@suse.de

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index da3604e73e8a..1695ee5f3ffc 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -72,6 +72,19 @@ __acquires(bitmap-lock)
/* this page has not been allocated yet */
 
spin_unlock_irq(bitmap-lock);
+   /* It is possible that this is being called inside a
+* prepare_to_wait/finish_wait loop from raid5c:make_request().
+* In general it is not permitted to sleep in that context as it
+* can cause the loop to spin freely.
+* That doesn't apply here as we can only reach this point
+* once with any loop.
+* When this function completes, either bp[page].map or
+* bp[page].hijacked.  In either case, this function will
+* abort before getting to this point again.  So there is
+* no risk of a free-spin, and so it is safe to assert
+* that sleeping here is allowed.
+*/
+   sched_annotate_sleep();
mappage = kzalloc(PAGE_SIZE, GFP_NOIO);
spin_lock_irq(bitmap-lock);
 


pgpVKI1XPTGam.pgp
Description: OpenPGP digital signature


Re: ext4 vs btrfs performance on SSD array

2014-09-02 Thread NeilBrown
On Mon, 1 Sep 2014 18:22:22 -0700 Christoph Hellwig h...@infradead.org wrote:

 On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
  Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
  and XFS are doing is doing 128k IOs because that's the default block
  device readahead size.  'blockdev --setra 1024 /dev/sdd' before
  mounting the filesystem will probably fix it.
 
 Btw, it's really getting time to make Linux storage fs work out the
 box.  There's way to many things that are stupid by default and we
 require everyone to fix up manually:
 
  - the ridiculously low max_sectors default
  - the very small max readahead size
  - replacing cfq with deadline (or noop)
  - the too small RAID5 stripe cache size
 
 and probably a few I forgot about.  It's time to make things perform
 well out of the box..

Do we still need maximums at all?
There was a time when the queue limit in the block device (or bdi) was an
important part of the write throttle strategy.  Without a queue limit, all of
memory could be consumed by memory in write-back, all queued for some device.
This wasn't healthy.

But since then the write throttling has been completely re-written.  I'm not
certain (and should check) but I suspect it doesn't depend on submit_bio
blocking when the queue is full any more.

So can we just remove the limit on max_sectors and the RAID5 stripe cache
size?  I'm certainly keen to remove the later and just use a mempool if the
limit isn't needed.
I have seen reports that a very large raid5 stripe cache size can cause
a reduction in performance.  I don't know why but I suspect it is a bug that
should be found and fixed.

Do we need max_sectors ??

NeilBrown


signature.asc
Description: PGP signature


Re: [RFC] lib: raid: New RAID library supporting up to six parities

2014-01-06 Thread NeilBrown
On Mon, 6 Jan 2014 10:45:23 +0100 Andrea Mazzoleni amadva...@gmail.com
wrote:

 Hi Neil,
 
 Thanks for your feedback. In the meantime I went further in developing and
 I've just sent version 2 of the patch, that contains a preliminary btrfs
 modification to use the new interface.
 
 Please use this one for any kind of review because it contains a modification
 of the interface to match better the btrfs use.
 I'll now try to do something similar to the async_tx layer and to improve the
 code documentation as you recommended.

Thanks.

 
 Anyway, a good entry point to understand the code is to start from the
 include/linux/raid/raid.h file. It contains the functions that external
 modules should call with a complete description of them.
 
 There is raid_par() used to compute parity, and raid_rec() to recover damaged
 blocks. These two functions replace all the old xor_blocks and raid6 calls.
 
 And there is the raid_sort() you mention. It's an helper function that can be
 used to ensure that the blocks indexes are passed at the raid interface in
 proper order. In existing code I saw that the indexes are often sorted before
 calling raid6, with something like:
 
   if (faila  failb) {
   int tmp = failb;
   failb = faila;
   faila = tmp;
   }
 
 To do the same with up to six failures, it's now required some kind of sort
 function.

I'm not totally convinced by this, but then I haven't played with the code
so maybe I'm wrong.
I don't see the above as sorting faila and failb, but rather determining
which one is first.  Once you know which one is first, the remainder follow
in order.
So I would probably just make sure we always process the block is the right
order.  Then sorting would be irrelevant.
But as I say, I haven't fiddled with the code, so maybe that would end up
being more complex.

Thanks,
NeilBrown



signature.asc
Description: PGP signature


Re: Triple parity and beyond

2013-11-22 Thread NeilBrown
On Fri, 22 Nov 2013 10:07:09 -0600 Stan Hoeppner s...@hardwarefreak.com
wrote:

 On 11/22/2013 2:13 AM, Stan Hoeppner wrote:
  Hi David,
  
  On 11/21/2013 3:07 AM, David Brown wrote:
 ...
  I don't see that there needs to be any changes to the existing md code
  to make raid15 work - it is merely a raid 5 made from a set of raid1
  pairs.  
  
  The sole purpose of the parity layer of the proposed RAID 15 is to
  replace sectors lost due to UREs during rebuild.  AFAIK the current RAID
  5 and RAID 1 drivers have no code to support each other in this manner.
 
 Minor self correction here-- obviously this isn't the 'sole' purpose of
 the parity layer.  It also allows us to recover from losing an entire
 mirror, which is a big upshot of the proposed RAID 15.  Thinking this
 through a little further, more code modification would be needed for
 this scenario.
 
 In the event of a double drive failure in one mirror, the RAID 1 code
 will need to be modified in such a way as to allow the RAID 5 code to
 rebuild the first replacement disk, because the RAID 1 device is still
 in a failed state.  Once this rebuild is complete, the RAID 1 code will
 need to switch the state to degraded, and then do its standard rebuild
 routine for the 2nd replacement drive.
 
 Or, with some (likely major) hacking it should be possible to rebuild
 both drives simultaneously for no loss of throughput or additional
 elapsed time on the RAID 5 rebuild. 

Nah, that would be minor hacking.  Just recreate the RAID1 in a state that is
not-insync, but with automatic-resync disabled.
Then as continuous writes arrive, move the recovery_cp variable forward
towards the end of the array.  When it reaches the end we can safely mark the
whole array as 'in-sync' and forget about diabling auto-resync.

NeilBrown



 In the 20TB drive case, this would
 shave 18 hours off the total rebuild operation elapsed time.  With
 current 4TB drives it would still save 6.5 hours.  Losing both drives in
 one mirror set of a striped array is rare, but given the rebuild time
 saved it may be worth investigating during any development of this RAID
 15 idea.
 



signature.asc
Description: PGP signature


Re: Triple parity and beyond

2013-11-22 Thread NeilBrown
On Thu, 21 Nov 2013 16:57:48 -0600 Stan Hoeppner s...@hardwarefreak.com
wrote:

 On 11/21/2013 1:05 AM, John Williams wrote:
  On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner s...@hardwarefreak.com 
  wrote:
  On 11/20/2013 8:46 PM, John Williams wrote:
  For myself or any machines I managed for work that do not need high
  IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
  similar schemes with arrays of 16 - 32 drives.
 
  You must see a week long rebuild as acceptable...
  
  It would not be a problem if it did take that long, since I would have
  extra parity units as backup in case of a failure during a rebuild.
  
  But of course it would not take that long. Take, for example, a 24 x
  3TB triple-parity array (21+3) that has had two drive failures
  (perhaps the rebuild started with one failure, but there was soon
  another failure). I would expect the rebuild to take about a day.
 
 You're looking at today.  We're discussing tomorrow's needs.  Today's
 6TB 3.5 drives have sustained average throughput of ~175MB/s.
 Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
 previously, at that rate a straight disk-disk copy of a 20TB drive takes
 18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
 rebuilding a failed drive in a 3P array of say 8 of these disks will
 likely take at least 3 times as long, 2 days 6 hours minimum, probably
 more.  This may be perfectly acceptable to some, but probably not to all.

Could you explain your logic here?  Why do you think rebuilding parity
will take 3 times as long as rebuilding a copy?  Can you measure that sort of
difference today?

Presumably when we have 20TB drives we will also have more cores and quite
possibly dedicated co-processors which will make the CPU load less
significant.

NeilBrown


signature.asc
Description: PGP signature


Re: Triple parity and beyond

2013-11-22 Thread NeilBrown
On Fri, 22 Nov 2013 21:46:50 -0600 Stan Hoeppner s...@hardwarefreak.com
wrote:

 On 11/22/2013 5:07 PM, NeilBrown wrote:
  On Thu, 21 Nov 2013 16:57:48 -0600 Stan Hoeppner s...@hardwarefreak.com
  wrote:
  
  On 11/21/2013 1:05 AM, John Williams wrote:
  On Wed, Nov 20, 2013 at 10:52 PM, Stan Hoeppner s...@hardwarefreak.com 
  wrote:
  On 11/20/2013 8:46 PM, John Williams wrote:
  For myself or any machines I managed for work that do not need high
  IOPS, I would definitely choose triple- or quad-parity over RAID 51 or
  similar schemes with arrays of 16 - 32 drives.
 
  You must see a week long rebuild as acceptable...
 
  It would not be a problem if it did take that long, since I would have
  extra parity units as backup in case of a failure during a rebuild.
 
  But of course it would not take that long. Take, for example, a 24 x
  3TB triple-parity array (21+3) that has had two drive failures
  (perhaps the rebuild started with one failure, but there was soon
  another failure). I would expect the rebuild to take about a day.
 
  You're looking at today.  We're discussing tomorrow's needs.  Today's
  6TB 3.5 drives have sustained average throughput of ~175MB/s.
  Tomorrow's 20TB drives will be lucky to do 300MB/s.  As I said
  previously, at that rate a straight disk-disk copy of a 20TB drive takes
  18.6 hours.  This is what you get with RAID1/10/51.  In the real world,
  rebuilding a failed drive in a 3P array of say 8 of these disks will
  likely take at least 3 times as long, 2 days 6 hours minimum, probably
  more.  This may be perfectly acceptable to some, but probably not to all.
  
  Could you explain your logic here?  Why do you think rebuilding parity
  will take 3 times as long as rebuilding a copy?  Can you measure that sort 
  of
  difference today?
 
 I've not performed head-to-head timed rebuild tests of mirror vs parity
 RAIDs.  I'm making the elapsed guess for parity RAIDs based on posts
 here over the past ~3 years, in which many users reported 16-24+ hour
 rebuild times for their fairly wide (12-16 1-2TB drive) RAID6 arrays.

I guess with that many drives you could hit PCI bus throughput limits.

A 16-lane PCIe 4.0 could just about give 100MB/s to each of 16 devices.  So
you would really need top-end hardware to keep all of 16 drives busy in a
recovery.
So yes: rebuilding a drive in a 16-drive RAID6+ would be slower than in e.g.
a 20 drive RAID10.

 
 This is likely due to their chosen rebuild priority and concurrent user
 load during rebuild.  Since this seems to be the norm, instead of giving
 100% to the rebuild, I thought it prudent to take this into account,
 instead of the theoretical minimum rebuild time.
 
  Presumably when we have 20TB drives we will also have more cores and quite
  possibly dedicated co-processors which will make the CPU load less
  significant.
 
 But (when) will we have the code to fully take advantage of these?  It's
 nearly 2014 and we still don't have a working threaded write model for
 levels 5/6/10, though maybe soon.  Multi-core mainstream x86 CPUs have
 been around for 8 years now, SMP and ccNUMA systems even longer.  So the
 need has been there for a while.

I think we might have that multi-threading now - not sure exactly what is
enabled by default though.

I think it requires more than need - it requires demand.  i.e. people
repeatedly expressing the need.  We certainly have had that for a while, but
not a very long while


 
 I'm strictly making an observation (possibly not fully accurate) here.
 I am not casting stones.  I'm not a programmer and am thus unable to
 contribute code, only ideas and troubleshooting assistance for fellow
 users.  Ergo I have no right/standing to complain about the rate of
 feature progress.  I know that everyone hacking md is making the most of
 the time they have available.  So again, not a complaint, just an
 observation.

Understood - and thanks for your observation.

NeilBrown



signature.asc
Description: PGP signature


Re: Triple parity and beyond

2013-11-22 Thread NeilBrown
On Fri, 22 Nov 2013 21:34:41 -0800 John Williams jwilliams4...@gmail.com
wrote:

 On Fri, Nov 22, 2013 at 9:04 PM, NeilBrown ne...@suse.de wrote:
 
  I guess with that many drives you could hit PCI bus throughput limits.
 
  A 16-lane PCIe 4.0 could just about give 100MB/s to each of 16 devices.  So
  you would really need top-end hardware to keep all of 16 drives busy in a
  recovery.
  So yes: rebuilding a drive in a 16-drive RAID6+ would be slower than in e.g.
  a 20 drive RAID10.
 
 Not really. A single 8x PCIe 2.0 card has 8 x 500MB/s = 4000MB/s of
 potential bandwidth. That would be 250MB/s per drive for 16 drives.
 
 But quite a few people running software RAID with many drives have
 multiple PCIe cards. For example, in one machine I have three IBM
 M1015 cards (which I got for $75/ea) that are 8x PCIe 2.0. That comes
 to 3 x 500MB/s x 8 = 12GB/s of IO bandwidth.
 
 Also, your math is wrong. PCIe 3.0 is 985 MB/s per lane. If we assume
 PCIe 4.0 would double that, we would have 1970MB/s per lane. So one
 lane of the hypothetical PCIe 4.0 would have enough IO bandwidth to
 give about 120MB/s to each of 16 drives. A single 8x PCIe 4.0 card
 would have 8 times that capability which is more than 15GB/s.

It wasn't my math, it was my reading :-(
16-lane PCIe 4.0 is 31 GB/sec so 2GB/sec per drive.  I was reading the
1-lane number...

 
 Even a single 8x PCIe 3.0 card has potentially over 7GB/s of bandwidth.
 
 Bottom line is that IO bandwidth is not a problem for a system with
 prudently chosen hardware.
 
 More likely is that you would be CPU limited (rather than bus limited)
 in a high-parity rebuild where more than one drive failed. But even
 that is not likely to be too bad, since Andrea's single-threaded
 recovery code can recover two drives at nearly 1GB/s on one of my
 machines. I think the code could probably be threaded to achieve a
 multiple of that running on multiple cores.

Indeed.  It seems likely that with modern hardware, the  linear write speed
would be the limiting factor for spinning-rust drives.
For SSDs the limit might end up being somewhere else ...

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: raid10 make_request failure during iozone benchmark upon btrfs

2012-07-02 Thread NeilBrown
On Tue, 03 Jul 2012 03:13:33 +0100 Kerin Millar kerfra...@gmail.com wrote:

 Hi,
 
 On 03/07/2012 02:39, NeilBrown wrote:
 
 [snip]
 
   Could you please double check that you are running a kernel with
  
   commit aba336bd1d46d6b0404b06f6915ed76150739057
   Author: NeilBrownne...@suse.de
   Date:   Thu May 31 15:39:11 2012 +1000
  
 md: raid1/raid10: fix problem with merge_bvec_fn
  
   in it?
  
   I am indeed. I searched the list beforehand and noticed the patch in
   question. Not sure which -rc it landed in but I checked my source tree
   and it's definitely in there.
  
   Cheers,
  
   --Kerin
  
   Thanks.
   Looking at it again I see that it is definitely a different bug, that patch
   wouldn't affect it.
  
   But I cannot see what could possibly be causing the problem.
   You have a 256K chunk size, so requests should be limited to 512 sectors
   aligned at a 512-sector boundary.
   However all the requests that a causing errors are 512 sectors long, but
   aligned on a 256-sector boundary (which is not also 512-sector).  This is
   wrong.
 
 I see.
 
  
   It could be that btrfs is submitting bad requests, but I think it always 
 uses
   bio_add_page, and bio_add_page appears to do the right thing.
   It could be that dm-linear is causing problem, but it seems to correctly 
 after
   the underlying device for alignment, and reports that alignment to
   bio_add_page.
   It could be that md/raid10 is the problem but I cannot find any fault in
   raid10_mergeable_bvec - performs much the same tests that the
   raid01 make_request function does.
  
   So it is a mystery.
  
   Is this failure repeatable?
 
 Yes, it's reproducible with 100% consistency. Furthermore, I tried to
 use the btrfs volume as a store for the package manager, so as to try
 with a 'realistic' workload. Many of these errors were triggered
 immediately upon invoking the package manager. In case it matters, the
 package manager is portage (in Gentoo Linux) and the directory structure
 entails a shallow directory depth with a large number of distributed
 small files. I haven't been able to reproduce with xfs, ext4 or reiserfs.
 
  
   If so, could you please insert
   WARN_ON_ONCE(1);
   in drivers/md/raid10.c where it prints out the message: just after the
   bad_map: label.
  
   Also, in raid10_mergeable_bvec, insert
   WARN_ON_ONCE(max  0);
   just before
  if (max  0)
  /* bio_add cannot handle a negative return */
  max = 0;
  
   and then see if either of those generate a warning, and post the full stack
   trace  if they do.
 
 OK. I ran iozone again on a fresh filesystem, mounted with the default
 options. Here's the trace that appears, just before the first
 make_request_bug message:
 
 WARNING: at drivers/md/raid10.c:1094 make_request+0xda5/0xe20()
 Hardware name: ProLiant MicroServer
 Modules linked in: btrfs zlib_deflate lzo_compress kvm_amd kvm sp5100_tco 
 i2c_piix4
 Pid: 1031, comm: btrfs-submit-1 Not tainted 3.5.0-rc5 #3
 Call Trace:
 [81031987] ? warn_slowpath_common+0x67/0xa0
 [81442b45] ? make_request+0xda5/0xe20
 [81460b34] ? __split_and_process_bio+0x2d4/0x600
 [81063429] ? set_next_entity+0x29/0x60
 [810652c3] ? pick_next_task_fair+0x63/0x140
 [81450b7f] ? md_make_request+0xbf/0x1e0
 [8123d12f] ? generic_make_request+0xaf/0xe0
 [8123d1c3] ? submit_bio+0x63/0xe0
 [81040abd] ? try_to_del_timer_sync+0x7d/0x120
 [a016839a] ? run_scheduled_bios+0x23a/0x520 [btrfs]
 [a0170e40] ? worker_loop+0x120/0x520 [btrfs]
 [a0170d20] ? btrfs_queue_worker+0x2e0/0x2e0 [btrfs]
 [810520c5] ? kthread+0x85/0xa0
 [815441f4] ? kernel_thread_helper+0x4/0x10
 [81052040] ? kthread_freezable_should_stop+0x60/0x60
 [815441f0] ? gs_change+0xb/0xb
 
 Cheers,
 
 --Kerin

Thanks.  Looks like it is a btrfs bug - so a big hello to linux-btrfs :-)

The symptom is that iozone on btrfs on md/raid10 can result in

[  919.893454] md/raid10:md0: make_request bug: can't convert block across 
chunks or bigger than 256k 6653500160 256
[  919.893465] btrfs: bdev /dev/mapper/vg0-test errs: wr 1, rd 0, flush 0, 
corrupt 0, gen 0


i.e. RAID10 has a 256K chunk size, but is getting 256K requests which overlap
two chunks - the last half of one chunk and the first half of the next.
That isn't allowed and raid10_mergeable_bvec, called by bio_add_page, should
prevent it.

However btrfs_map_bio() sets -bi_sector to a new value without verifying
that the resulting bio is still acceptable - which it isn't.

The core problem is that you cannot build a bio for one location, then use it
freely at another location.
md/raid1 handles this by checking each addition to a bio against all the
possible location that it might read/write it.  Maybe btrfs could do the
same.
Alternately we could work with Kent Overstreet (of bcache fame) to remove the
restriction that the fs must make

Re: Mis-Design of Btrfs?

2011-07-15 Thread NeilBrown
On Thu, 14 Jul 2011 21:58:46 -0700 (PDT) da...@lang.hm wrote:

 On Thu, 14 Jul 2011, Chris Mason wrote:
 
  Excerpts from Ric Wheeler's message of 2011-07-14 02:57:54 -0400:
  On 07/14/2011 07:38 AM, NeilBrown wrote:
  On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheelerrwhee...@redhat.com  
  wrote:
 
  I'm certainly open to suggestions and collaboration.  Do you have in 
  mind any
  particular way to make the interface richer??
 
  NeilBrown
  Hi Neil,
 
  I know that Chris has a very specific set of use cases for btrfs and 
  think that
  Alasdair and others have started to look at what is doable.
 
  The obvious use case is the following:
 
  If a file system uses checksumming or other data corruption detection 
  bits, it
  can detect that it got bad data on a write. If that data was protected 
  by RAID,
  it would like to ask the block layer to try to read from another mirror 
  (for
  raid1) or try to validate/rebuild from parity.
 
  Today, I think that a retry will basically just give us back a random 
  chance of
  getting data from a different mirror or the same one that we got data 
  from on
  the first go.
 
  Chris, Alasdair, was that a good summary of one concern?
 
  Thanks!
 
  Ric
  I imagine a new field in 'struct bio' which was normally zero but could be
  some small integer.  It is only meaningful for read.
  When 0 it means get this data way you like.
  When non-zero it means get this data using method N, where the different
  methods are up to the device.
 
  For a mirrored RAID, method N means read from device N-1.
  For stripe/parity RAID, method 1 means use other data blocks and parity
  blocks to reconstruct data.
 
  The default for non RAID devices is to return EINVAL for any N  0.
  A remapping device (dm-linear, dm-stripe etc) would just pass the number
  down.  I'm not sure how RAID1 over RAID5 would handle it... that might 
  need
  some thought.
 
  So if btrfs reads a block and the checksum looks wrong, it reads again 
  with
  a larger N.  It continues incrementing N and retrying until it gets a 
  block
  that it likes or it gets EINVAL.  There should probably be an error code
  (EAGAIN?) which means I cannot work with that number, but try the next 
  one.
 
  It would be trivial for me to implement this for RAID1 and RAID10, and
  relatively easy for RAID5.
  I'd need to give a bit of thought to RAID6 as there are possibly multiple
  ways to reconstruct from different combinations of parity and data.  I'm 
  not
  sure if there would be much point in doing that though.
 
  It might make sense for a device to be able to report what the maximum
  'N' supported is... that might make stacked raid easier to manage...
 
  NeilBrown
 
 
  I think that the above makes sense. Not sure what your 0 definition is, 
  but I
  would assume that for non-raided devices (i.e., a single s-ata disk), 0 
  would
  be an indication that there is nothing more that can be tried. The data 
  you got
  is all you are ever going to get :)
 
  For multiple mirrors, you might want to have a scheme where you would be 
  able to
  cycle through the mirrors. You could retry, cycling through the various 
  mirrors
  until you have tried and returned them all at which point you would get a 
  no
  more error back or some such thing.
 
  Hi everyone,
 
  The mirror number idea is basically what btrfs does today, and I think
  it fits in with Neil's idea to have different policies for different
  blocks.
 
  Basically what btrfs does:
 
  read_block(block_num, mirror = 0)
  if (no io error and not csum error)
  horray()
 
  num_mirrors = get_num_copies(block number)
  for (i = 1; i  num_mirrors; i++) {
  read_block(block_num, mirror = i);
  }
 
  In a stacked configuration, the get_num_copies function can be smarter,
  basically adding up all the copies of the lower levels and finding a way
  to combine them.  We could just send down a fake bio that is responsible
  for adding up the storage combinations into a bitmask or whatever works.
 
  We could also just keep retrying until the lower layers reported no more
  mirror were available.
 
  In btrfs at least, we don't set the data blocks up to date until the crc
  has passed, so replacing bogus blocks is easy.  Metadata is a little
  more complex, but that's not really related to this topic.
 
  mirror number 0 just means no mirror preference/pick the fastest
  mirror to the btrfs block mapping code.
 
  Btrfs has the concept of different raid levels for different logical
  block numbers, so you get_num_copies might return one answer for block A
  and a different answer for block B.
 
  Either way, we could easily make use of a bio field here if it were
  exported out.
 
 you don't want to just pass the value down as that will cause problems 
 with layering (especially if the lower layer supports more values than a 
 higher layer)
 
 I would suggest that each layer take the value it's give, do an integer 
 division by the number of values

Re: Mis-Design of Btrfs?

2011-07-14 Thread NeilBrown
On Thu, 14 Jul 2011 07:02:22 +0100 Ric Wheeler rwhee...@redhat.com wrote:

  I'm certainly open to suggestions and collaboration.  Do you have in mind 
  any
  particular way to make the interface richer??
 
  NeilBrown
 
 Hi Neil,
 
 I know that Chris has a very specific set of use cases for btrfs and think 
 that 
 Alasdair and others have started to look at what is doable.
 
 The obvious use case is the following:
 
 If a file system uses checksumming or other data corruption detection bits, 
 it 
 can detect that it got bad data on a write. If that data was protected by 
 RAID, 
 it would like to ask the block layer to try to read from another mirror (for 
 raid1) or try to validate/rebuild from parity.
 
 Today, I think that a retry will basically just give us back a random chance 
 of 
 getting data from a different mirror or the same one that we got data from on 
 the first go.
 
 Chris, Alasdair, was that a good summary of one concern?
 
 Thanks!
 
 Ric

I imagine a new field in 'struct bio' which was normally zero but could be
some small integer.  It is only meaningful for read.
When 0 it means get this data way you like.
When non-zero it means get this data using method N, where the different
methods are up to the device.

For a mirrored RAID, method N means read from device N-1.
For stripe/parity RAID, method 1 means use other data blocks and parity
blocks to reconstruct data.

The default for non RAID devices is to return EINVAL for any N  0.
A remapping device (dm-linear, dm-stripe etc) would just pass the number
down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
some thought.

So if btrfs reads a block and the checksum looks wrong, it reads again with
a larger N.  It continues incrementing N and retrying until it gets a block
that it likes or it gets EINVAL.  There should probably be an error code
(EAGAIN?) which means I cannot work with that number, but try the next one.

It would be trivial for me to implement this for RAID1 and RAID10, and
relatively easy for RAID5.
I'd need to give a bit of thought to RAID6 as there are possibly multiple
ways to reconstruct from different combinations of parity and data.  I'm not
sure if there would be much point in doing that though.

It might make sense for a device to be able to report what the maximum
'N' supported is... that might make stacked raid easier to manage...

NeilBrown

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-14 Thread NeilBrown
On Thu, 14 Jul 2011 11:37:41 +0200 Jan Schmidt list.bt...@jan-o-sch.net
wrote:

 Hi Neil,
 
 On 14.07.2011 08:38, NeilBrown wrote:
  I imagine a new field in 'struct bio' which was normally zero but could be
  some small integer.  It is only meaningful for read.
  When 0 it means get this data way you like.
  When non-zero it means get this data using method N, where the different
  methods are up to the device.
  
  For a mirrored RAID, method N means read from device N-1.
  For stripe/parity RAID, method 1 means use other data blocks and parity
  blocks to reconstruct data.
  
  The default for non RAID devices is to return EINVAL for any N  0.
  A remapping device (dm-linear, dm-stripe etc) would just pass the number
  down.  I'm not sure how RAID1 over RAID5 would handle it... that might need
  some thought.
  
  So if btrfs reads a block and the checksum looks wrong, it reads again with
  a larger N.  It continues incrementing N and retrying until it gets a block
  that it likes or it gets EINVAL.  There should probably be an error code
  (EAGAIN?) which means I cannot work with that number, but try the next 
  one.
 
 I like this idea. It comes pretty close to what btrfs is currently doing
 (what was the reason for this thread being started, wasn't it?), only
 not within struct bio.
 
 The way you describe the extra parameter is input only. I think it would
 be a nice add on if we knew which N was used when 0 passed for the
 request. At least for the RAID1 case, I'd like to see that when I submit
 a bio with method (or whatever we call it) 0, it comes back with the
 method field set to the value that reflects the method the device
 actually used. Obviously, when passing non-zero values, the bio would
 have to come back with that value unmodified.
 
 That way, we'll get more control on the retry algorithms and are free to
 decide not to use the failed method again, if we like.
 
 Setting method on the return path might be valuable not only for
 RAID1, but I haven't thought that trough.
 
 -Jan

That sounds like it would be reasonable...

It would be important not to read too much into the number though.  i.e.
don't think of it as RAID1 but keep a much more abstract idea, else you
could run into trouble.

A (near) future addition to md is keeping track of 'bad blocks' so we can
fail more gracefully as devices start to fail.
So a read request to a RAID1 might not be served by just one device, but
might be served by one device for some parts, and another device for other
parts, so as to avoid one or more 'bad blocks'.

I think I could still provide a reasonably consistent mapping from 'arbitrary
number' to 'set of devices', but it may not be what you expect.  And the
number '1' may well correspond to different devices for different bi_sector
offsets.

i.e. the abstraction must allow the low level implementation substantial
freedom to choosing how to implement each request.

But yes - I don't see why we couldn't report which strategy was used with the
implication that using that same strategy at the same offset with the same
size would be expected to produce the same result.

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mis-Design of Btrfs?

2011-07-13 Thread NeilBrown
On Wed, 29 Jun 2011 10:29:53 +0100 Ric Wheeler rwhee...@redhat.com wrote:

 On 06/27/2011 07:46 AM, NeilBrown wrote:
  On Thu, 23 Jun 2011 12:53:37 +0200 Nico Schottelius
  nico-lkml-20110...@schottelius.org  wrote:
 
  Good morning devs,
 
  I'm wondering whether the raid- and volume-management-builtin of btrfs is
  actually a sane idea or not.
  Currently we do have md/device-mapper support for raid
  already, btrfs lacks raid5 support and re-implements stuff that
  has already been done.
 
  I'm aware of the fact that it is very useful to know on which devices
  we are in a filesystem. But I'm wondering, whether it wouldn't be
  smarter to generalise the information exposure through the VFS layer
  instead of replicating functionality:
 
  Physical:   USB-HD   SSD   USB-Flash  | Exposes information to
  Raid:   Raid1, Raid5, Raid10, etc.| higher levels
  Crypto: Luks  |
  LVM:Groups/Volumes|
  FS: xfs/jfs/reiser/ext3   v
 
  Thus a filesystem like ext3 could be aware that it is running
  on a USB HD, enable -o sync be default or have the filesystem
  to rewrite blocks when running on crypto or optimise for an SSD, ...
  I would certainly agree that exposing information to higher levels is a good
  idea.  To some extent we do.  But it isn't always as easy as it might sound.
  Choosing exactly what information to expose is the challenge.  If you lack
  sufficient foresight you might expose something which turns out to be
  very specific to just one device, so all those upper levels which make use 
  of
  the information find they are really special-casing one specific device,
  which isn't a good idea.
 
 
  However it doesn't follow that RAID5 should not be implemented in BTRFS.
  The levels that you have drawn are just one perspective.  While that has
  value, it may not be universal.
  I could easily argue that the LVM layer is a mistake and that filesystems
  should provide that functionality directly.
  I could almost argue the same for crypto.
  RAID1 can make a lot of sense to be tightly integrated with the FS.
  RAID5 ... I'm less convinced, but then I have a vested interest there so 
  that
  isn't an objective assessment.
 
  Part of the way Linux works is that s/he who writes the code gets to make
  the design decisions.   The BTRFS developers might create something truly
  awesome, or might end up having to support a RAID feature that they
  subsequently think is a bad idea.  But it really is their decision to make.
 
  NeilBrown
 
 
 One more thing to add here is that I think that we still have a chance to 
 increase the sharing between btrfs and the MD stack if we can get those 
 changes 
 made. No one likes to duplicate code, but we will need a richer interface 
 between the block and file system layer to help close that gap.
 
 Ric
 

I'm certainly open to suggestions and collaboration.  Do you have in mind any
particular way to make the interface richer??

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html