Re: [Cluster-devel] [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-04-05 Thread NeilBrown
On Wed, Apr 04 2018, Andreas Grünbacher wrote:

> Herbert Xu <herb...@gondor.apana.org.au> schrieb am Mi. 4. Apr. 2018 um
> 17:51:
>
>> On Wed, Apr 04, 2018 at 11:46:28AM -0400, Bob Peterson wrote:
>> >
>> > The patches look good. The big question is whether to add them to this
>> > merge window while it's still open. Opinions?
>>
>> We're still hashing out the rhashtable interface so I don't think now is
>> the time to rush things.
>
>
> Fair enough. No matter how rhashtable_walk_peek changes, we‘ll still need
> these two patches to fix the glock dump though.

Those two patches look fine to me and don't depend on changes to
rhashtable, so it is up to you when they go upstream.

However, I think the code can be substantially simplified, particularly
once we make rhashtable a little cleverer.
So this is what I'll probably be doing for a similar situation in
lustre

Having examined seqfile closely, it is apparent that if ->start never
changes *ppos, and if ->next always increments it (except maybe on error)
then

1/ ->next is only ever given a 'pos' that was returned by the previous
   call to ->start or ->next.  So it should *always* return the next
   object, after the one previously returned by ->start or ->next.  It
   never needs to 'seek'. The 'traverse()' function in seq_file.c does
   any seeking needed.  ->next needs to increase 'pos' and when walking
   a dynamic list, it is easiest if it just increments it.

2/ ->start is only called with a pos of:
0 - in this case it should rewind to the start
the last pos passed to ->start of ->next
 In this case it should return the same thing that was
 returned last time.  If it no longer exists, then
 the following one should be returned.
one more than the last pos passed to ->start or ->next
 In this case it should return the object after the
 last one returned.

The proposed enhancement to rhashtable_walk* is to add a
rhashtable_walk_prev() which returns the previously returned object,
if it is still in the table, or NULL. It also enhances
rhashtable_walk_start() so that if the previously returned object is
still in the table, it is preserved as the current cursor.
This means that if you take some action to ensure that the
previously returned object remains in the table until the next ->start,
then you can reliably walk the table with no duplicates or omissions
(unless a concurrent rehash causes duplicates)
If you don't keep the object in the table and it gets removed, then
the 'skip' counter is used to find your place, and you might get
duplicates or omissions.

NeilBrown


signature.asc
Description: PGP signature


Re: [Cluster-devel] [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-04-02 Thread NeilBrown
On Fri, Mar 30 2018, Herbert Xu wrote:

> On Thu, Mar 29, 2018 at 06:52:34PM +0200, Andreas Gruenbacher wrote:
>>
>> Should rhashtable_walk_peek be kept around even if there are no more
>> users? I have my doubts.
>
> Absolutely.  All netlink dumps using rhashtable_walk_next are buggy
> and need to switch over to rhashtable_walk_peek.  As otherwise
> the object that triggers the out-of-space condition will be skipped
> upon resumption.

Do we really need a rhashtable_walk_peek() interface?
I imagine that a seqfile ->start function can do:

  if (*ppos == 0 && last_pos != 0) {
rhashtable_walk_exit();
rhashtable_walk_enter(, );
last_pos = 0;
  }
  rhashtable_walk_start();
  if (*ppos == last_pos && iter.p)
return iter.p;
  last_pos = *ppos;
  return rhashtable_walk_next()


and the ->next function just does

  last_pos = *ppos;
  *ppos += 1;
  do p = rhashtable_walk_next(); while (IS_ERR(p));
  return p;

It might be OK to have a function call instead of expecting people to
use iter.p directly.

static inline void *rhashtable_walk_prev(struct rhashtable_iter *iter)
{
return iter->p;
}

Thoughts?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [Cluster-devel] [PATCH] gfs2: Stop using rhashtable_walk_peek

2018-03-29 Thread NeilBrown
On Wed, Mar 28 2018, Andreas Gruenbacher wrote:

> Function rhashtable_walk_peek is problematic because there is no
> guarantee that the glock previously returned still exists; when that key
> is deleted, rhashtable_walk_peek can end up returning a different key,
> which would cause an inconsistent glock dump.  So instead of using
> rhashtable_walk_peek, keep track of the current glock in the seq file
> iterator functions.
>
> Signed-off-by: Andreas Gruenbacher <agrue...@redhat.com>
> ---
>  fs/gfs2/glock.c | 18 +-
>  1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index 82fb5583445c..f1fc353875d3 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -55,6 +55,7 @@ struct gfs2_glock_iter {
>   struct gfs2_sbd *sdp;   /* incore superblock   */
>   struct rhashtable_iter hti; /* rhashtable iterator */
>   struct gfs2_glock *gl;  /* current glock struct*/
> + bool gl_held;
>   loff_t last_pos;/* last position   */
>  };
>  
> @@ -1923,9 +1924,11 @@ void gfs2_glock_exit(void)
>  
>  static void gfs2_glock_iter_next(struct gfs2_glock_iter *gi, loff_t n)
>  {
> - if (n == 0)
> - gi->gl = rhashtable_walk_peek(>hti);
> - else {
> + if (n != 0 || !gi->gl) {
> + if (gi->gl_held) {
> + gfs2_glock_queue_put(gi->gl);
> + gi->gl_held = false;
> + }
>   gi->gl = rhashtable_walk_next(>hti);
>   n--;
>   }

Thank for this patch!
The above looks a bit fragile to me.
gfs2_glock_iter_next() (And hence gfs2_glock_seq_start()) will sometimes
exit with gl_held true, and sometimes with it false.
gfs2_glock_seq_stop() assumes that it is false.
Normally gfs2_glock_seq_next() will normally be called between these
two and will clear gl_held, but I don't think there is a hard guarantee
of that.
Maybe we should always 'put' gi->gl in iter_next if gl_held??

Thanks,
NeilBrown



> @@ -1988,7 +1991,10 @@ static void gfs2_glock_seq_stop(struct seq_file *seq, 
> void *iter_ptr)
>  {
>   struct gfs2_glock_iter *gi = seq->private;
>  
> - gi->gl = NULL;
> + if (gi->gl) {
> + lockref_get(>gl->gl_lockref);
> + gi->gl_held = true;
> + }
>   rhashtable_walk_stop(>hti);
>  }
>  
> @@ -2061,6 +2067,7 @@ static int __gfs2_glocks_open(struct inode *inode, 
> struct file *file,
>*/
>   gi->last_pos = -1;
>   gi->gl = NULL;
> + gi->gl_held = false;
>   rhashtable_walk_enter(_hash_table, >hti);
>   }
>   return ret;
> @@ -2076,7 +2083,8 @@ static int gfs2_glocks_release(struct inode *inode, 
> struct file *file)
>   struct seq_file *seq = file->private_data;
>   struct gfs2_glock_iter *gi = seq->private;
>  
> - gi->gl = NULL;
> + if (gi->gl_held)
> + gfs2_glock_put(gi->gl);
>   rhashtable_walk_exit(>hti);
>   return seq_release_private(inode, file);
>  }
> -- 
> 2.14.3


signature.asc
Description: PGP signature


Re: [Cluster-devel] [PATCH v4 13/27] lib: add errseq_t type and infrastructure for handling it

2017-05-10 Thread NeilBrown
On Tue, May 09 2017, Jeff Layton wrote:

> An errseq_t is a way of recording errors in one place, and allowing any
> number of "subscribers" to tell whether an error has been set again
> since a previous time.
>
> It's implemented as an unsigned 32-bit value that is managed with atomic
> operations. The low order bits are designated to hold an error code
> (max size of MAX_ERRNO). The upper bits are used as a counter.
>
> The API works with consumers sampling an errseq_t value at a particular
> point in time. Later, that value can be used to tell whether new errors
> have been set since that time.
>
> Note that there is a 1 in 512k risk of collisions here if new errors
> are being recorded frequently, since we have so few bits to use as a
> counter. To mitigate this, one bit is used as a flag to tell whether the
> value has been sampled since a new value was recorded. That allows
> us to avoid bumping the counter if no one has sampled it since it
> was last bumped.
>
> Later patches will build on this infrastructure to change how writeback
> errors are tracked in the kernel.
>
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

I like that this is a separate lib/*.c - nicely structured too.

Reviewed-by: NeilBrown <ne...@suse.com>

Thanks,
NeilBrown


> ---
>  include/linux/errseq.h |  19 +
>  lib/Makefile   |   2 +-
>  lib/errseq.c   | 199 
> +
>  3 files changed, 219 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/errseq.h
>  create mode 100644 lib/errseq.c
>
> diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> new file mode 100644
> index ..0d2555f310cd
> --- /dev/null
> +++ b/include/linux/errseq.h
> @@ -0,0 +1,19 @@
> +#ifndef _LINUX_ERRSEQ_H
> +#define _LINUX_ERRSEQ_H
> +
> +/* See lib/errseq.c for more info */
> +
> +typedef u32  errseq_t;
> +
> +void __errseq_set(errseq_t *eseq, int err);
> +static inline void errseq_set(errseq_t *eseq, int err)
> +{
> + /* Optimize for the common case of no error */
> + if (unlikely(err))
> + __errseq_set(eseq, err);
> +}
> +
> +errseq_t errseq_sample(errseq_t *eseq);
> +int errseq_check(errseq_t *eseq, errseq_t since);
> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
> +#endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 320ac46a8725..2423afef40f7 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o 
> random32.o \
>gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
>bsearch.o find_bit.o llist.o memweight.o kfifo.o \
>percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
> -  once.o refcount.o
> +  once.o refcount.o errseq.o
>  obj-y += string_helpers.o
>  obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
>  obj-y += hexdump.o
> diff --git a/lib/errseq.c b/lib/errseq.c
> new file mode 100644
> index ..0f8b4ed460f0
> --- /dev/null
> +++ b/lib/errseq.c
> @@ -0,0 +1,199 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * An errseq_t is a way of recording errors in one place, and allowing any
> + * number of "subscribers" to tell whether it has changed since an arbitrary
> + * time of their choosing.
> + *
> + * It's implemented as an unsigned 32-bit value. The low order bits are
> + * designated to hold an error code (between 0 and -MAX_ERRNO). The upper 
> bits
> + * are used as a counter. This is done with atomics instead of locking so 
> that
> + * these functions can be called from any context.
> + *
> + * The general idea is for consumers to sample an errseq_t value at a
> + * particular point in time. Later, that value can be used to tell whether 
> any
> + * new errors have occurred since that time.
> + *
> + * Note that there is a risk of collisions, if new errors are being recorded
> + * frequently, since we have so few bits to use as a counter.
> + *
> + * To mitigate this, one bit is used as a flag to tell whether the value has
> + * been sampled since a new value was recorded. That allows us to avoid 
> bumping
> + * the counter if no one has sampled it since the last time an error was
> + * recorded.
> + *
> + * A new errseq_t should always be zeroed out.  A errseq_t value of all 
> zeroes
> + * is the special (but common) case where there has never been an error. An 
> all
> + * zero value thus serves as the "epoch" if one wishes to know whether there
> + * has ever been an error set since it was first initialized.
> + */
> +
> +/* The low bits are 

Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-04-30 Thread NeilBrown
On Sat, Apr 30 2016, Dave Chinner wrote:

> On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
>> On Tue, Apr 26 2016, Michal Hocko wrote:
>> 
>> > Hi,
>> > we have discussed this topic at LSF/MM this year. There was a general
>> > interest in the scope GFP_NOFS allocation context among some FS
>> > developers. For those who are not aware of the discussion or the issue
>> > I am trying to sort out (or at least start in that direction) please
>> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> > which basically copies what we have for the scope GFP_NOIO allocation
>> > context. I haven't converted any of the FS myself because that is way
>> > beyond my area of expertise but I would be happy to help with further
>> > changes on the MM front as well as in some more generic code paths.
>> >
>> > Dave had an idea on how to further improve the reclaim context to be
>> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> > and FS specific cookie set in the FS allocation context and consumed
>> > by the FS reclaim context to allow doing some provably save actions
>> > that would be skipped due to GFP_NOFS normally.  I like this idea and
>> > I believe we can go that direction regardless of the approach taken here.
>> > Many filesystems simply need to cleanup their NOFS usage first before
>> > diving into a more complex changes.>
>> 
>> This strikes me as over-engineering to work around an unnecessarily
>> burdensome interface but without details it is hard to be certain.
>> 
>> Exactly what things happen in "FS reclaim context" which may, or may
>> not, be safe depending on the specific FS allocation context?  Do they
>> need to happen at all?
>> 
>> My research suggests that for most filesystems the only thing that
>> happens in reclaim context that is at all troublesome is the final
>> 'evict()' on an inode.  This needs to flush out dirty pages and sync the
>> inode to storage.  Some time ago we moved most dirty-page writeout out
>> of the reclaim context and into kswapd.  I think this was an excellent
>> advance in simplicity.
>
> No, we didn't move dirty page writeout to kswapd - we moved it back
> to the background writeback threads where it can be done
> efficiently.  kswapd should almost never do single page writeback
> because of how inefficient it is from an IO perspective, even though
> it can. i.e. if we are doing any significant amount of dirty page
> writeback from memory reclaim (direct, kswapd or otherwise) then
> we've screwed something up.
>
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
>
> When lots of GFP_NOFS allocation is being done, this already
> happens. The shrinkers that can't run due to context accumulate the
> work on the shrinker structure, and when the shrinker can next run
> (e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
> reclaim contexts.
>
> IOWs, we already move shrinker work from direct reclaim to kswapd
> when appropriate.
>
>> The exceptions include:
>>  - nfs and any filesystem using fscache can block for up to 1 second
>>in ->releasepage().  They used to block waiting for some IO, but that
>>caused deadlocks and wasn't really needed.  I left the timeout because
>>it seemed likely that some throttling would help.  I suspect that a
>>careful analysis will show that there is sufficient throttling
>>elsewhere.
>> 
>>  - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>>for IO so it can free some quotainfo things. 
>
> No it's not. evict() can block on IO - waiting for data or inode
> writeback to complete, or even for filesystems to run transactions
> on the inode. Hence the superblock shrinker can and does block in
> inode cache reclaim.

That is why I said "nearly" :-)

>
> Indeed, blocking the superblock shrinker in reclaim is a key part of
> balancing inode cache pressure in XFS. If the shrinker starts
> hitting dirty inodes, it blocks on cleaning them, thereby slowing
> the rate of allocation to that which inodes can be cleaned and
> reclaimed. There are also background threads that walk ahead freeing
> clean inodes, but we have to throttle direct reclaim in this manner
> otherwise the allocation pressure vastly outweighs the ability to
> reclaim inodes. if we don't balance this, inode allocation triggers
> the OOM killer because reclaim 

Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

2016-04-29 Thread NeilBrown
On Tue, Apr 26 2016, Michal Hocko wrote:

> Hi,
> we have discussed this topic at LSF/MM this year. There was a general
> interest in the scope GFP_NOFS allocation context among some FS
> developers. For those who are not aware of the discussion or the issue
> I am trying to sort out (or at least start in that direction) please
> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> which basically copies what we have for the scope GFP_NOIO allocation
> context. I haven't converted any of the FS myself because that is way
> beyond my area of expertise but I would be happy to help with further
> changes on the MM front as well as in some more generic code paths.
>
> Dave had an idea on how to further improve the reclaim context to be
> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> and FS specific cookie set in the FS allocation context and consumed
> by the FS reclaim context to allow doing some provably save actions
> that would be skipped due to GFP_NOFS normally.  I like this idea and
> I believe we can go that direction regardless of the approach taken here.
> Many filesystems simply need to cleanup their NOFS usage first before
> diving into a more complex changes.>

This strikes me as over-engineering to work around an unnecessarily
burdensome interface but without details it is hard to be certain.

Exactly what things happen in "FS reclaim context" which may, or may
not, be safe depending on the specific FS allocation context?  Do they
need to happen at all?

My research suggests that for most filesystems the only thing that
happens in reclaim context that is at all troublesome is the final
'evict()' on an inode.  This needs to flush out dirty pages and sync the
inode to storage.  Some time ago we moved most dirty-page writeout out
of the reclaim context and into kswapd.  I think this was an excellent
advance in simplicity.
If we could similarly move evict() into kswapd (and I believe we can)
then most file systems would do nothing in reclaim context that
interferes with allocation context.

The exceptions include:
 - nfs and any filesystem using fscache can block for up to 1 second
   in ->releasepage().  They used to block waiting for some IO, but that
   caused deadlocks and wasn't really needed.  I left the timeout because
   it seemed likely that some throttling would help.  I suspect that a
   careful analysis will show that there is sufficient throttling
   elsewhere.

 - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
   for IO so it can free some quotainfo things.  If it could be changed
   to just schedule the IO without waiting for it then I think this
   would be safe to be called in any FS allocation context.  It already
   uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
   if the lock is held.

I think you/we would end up with a much simpler system if instead of
focussing on the places where GFP_NOFS is used, we focus on places where
__GFP_FS is tested, and try to remove them.  If we get rid of enough of
them the remainder could just use __GFP_IO.

> The patch 2 is a debugging aid which warns about explicit allocation
> requests from the scope context. This is should help to reduce the
> direct usage of the NOFS flags to bare minimum in favor of the scope
> API. It is not aimed to be merged upstream. I would hope Andrew took it
> into mmotm tree to give it linux-next exposure and allow developers to
> do further cleanups.  There is a new kernel command line parameter which
> has to be used for the debugging to be enabled.
>
> I think the GFP_NOIO should be seeing the same clean up.

I think you are suggesting that use of GFP_NOIO should (largely) be
deprecated in favour of memalloc_noio_save().  I think I agree.
Could we go a step further and deprecate GFP_ATOMIC in favour of some
in_atomic() test?  Maybe that is going too far.

Thanks,
NeilBrown

>
> Any feedback is highly appreciated.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call

2014-07-30 Thread NeilBrown
On Fri, 25 Jul 2014 12:37:29 -0500 Abhi Das a...@redhat.com wrote:

 This system call takes 3 arguments:
 fd  - file descriptor of the directory being readahead
 *offset - offset in dir from which to resume. This is updated
   as we move along in the directory
 count   - The max number of entries to readahead
 
 The syscall is supposed to read upto 'count' entries starting at
 '*offset' and cache the inodes corresponding to those entries. It
 returns a negative error code or a positive number indicating
 the number of inodes it has issued readaheads for. It also
 updates the '*offset' value so that repeated calls to dirreadahead
 can resume at the right location. Returns 0 when there are no more
 entries left.

Hi Abhi,

 I like the idea of enhanced read-ahead on a directory.
 It isn't clear to me why you have included these particular fields in the
 interface though.

 - why have an 'offset'?  Why not just use the current offset of the
   directory 'fd'?
 - Why have a count?  How would a program choose what count to give?

 Maybe you imagine using 'getdents' first to get a list of names, then
 selectively calling 'dirreadahead'  on the offsets of the names you are
 interested it?  That would be racy as names can be added and removed which
 might change offsets.  So maybe you have another reason?

 I would like to suggest an alternate interface (I love playing the API
 game).

 1/ Add a flag to 'fstatat'  AT_EXPECT_MORE.
If the pathname does not contain a '/', then the 'dirfd' is marked
to indicate that stat information for all names returned by getdents will
be wanted.  The filesystem can choose to optimise that however it sees
fit.

 2/ Add a flag to 'fstatat'  AT_NONBLOCK.
This tells the filesystem that you want this information, so if it can
return it immediately it should, and if not it should start pulling it
into cache.  Possibly this should be two flags: AT_NONBLOCK just avoids
any IO, and AT_ASYNC instigates IO even if NONBLOCK is set.

 Then an ls -l could use AT_EXPECT_MORE and then just stat each name.
 An ls -l *.c, might avoid AT_EXPECT_MORE, but would use AT_NONBLOCK
 against all names, then try again with all the names that returned
 EWOULDBLOCK the first time.


 I would really like to see the 'xstat' syscall too, but there is no point
 having both xstat and fxstat.  Follow the model of fstatat and provide
 just fxstatat which can do both.
 With fxstatat, AT_EXPECT_MORE would tell the dirfd exactly which attributes
 would be wanted so it can fetch only that which is desired.

 I'm not very keen on the xgetdents idea of including name information and
 stat information into the one syscall - I would prefer getdents and xstat be
 kept separate.  Of course if a genuine performance cost of the separate can
 be demonstrated, I could well change my mind.

 It does, however, have the advantage that the kernel doesn't need to worry
 about how long read-ahead data needs to be kept, and the application doesn't
 need to worry about how soon to retry an fstatat which failed with
 EWOULDBLOCK.

Thanks for raising this issue again.  I hope it gets fixed one day...

NeilBrown


signature.asc
Description: PGP signature