Re: [Cluster-devel] [PATCH 0/2 v2] positional readdir cookies

2015-12-03 Thread Bob Peterson
- Original Message -
> These two patches implement positional readdir cookies. The first one simply
> changes how splitting leaf blocks works to allow for this method to work. The
> second one does the meat of the work.
> 
> Like I mention in the second patch, this adds a new parameter to the dirent
> structure that is never saved to disk.  This is simply to make use of the
> memory to store the computed location based cookie. Avoiding this has a
> noticeable performance impact. However, I'm open to any ideas on how to make
> this look less strange (or, any other ways of getting space to store these
> values that doesn't involve allocating it, which caused the performance hit).
> 
> This set is the same as my last set but the ar_loccookie flag in gfs2_args,
> is now a single bit, instead of a whole unsigned int.
> 
> Benjamin Marzinski (2):
>   gfs2: keep offset when splitting dir leaf blocks
>   gfs2: change gfs2 readdir cookie
> 
>  fs/gfs2/dir.c| 160
>  ++-
>  fs/gfs2/incore.h |   3 +
>  fs/gfs2/ops_fstype.c |   3 +
>  fs/gfs2/super.c  |  12 +++
>  include/uapi/linux/gfs2_ondisk.h |   9 ++-
>  5 files changed, 148 insertions(+), 39 deletions(-)
> 
> --
> 1.8.3.1
> 
> 
Hi,

Thanks. These patches are now applied to the for-next branch of the linux-gfs2 
tree:

https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-next&id=47378d02fb9931f54c5812ec1ac2fb94e5d403a7
https://git.kernel.org/cgit/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?h=for-next&id=83bdc7b7b08430fad8ced9cbe498d06591521c9c

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-03 Thread Steven Whitehouse

Hi,

On 02/12/15 17:41, Bob Peterson wrote:

- Original Message -

- Original Message -
(snip)

Please take a look at this
again and figure out what the problematic cycle of events is, and then
work out how to avoid that happening in the first place. There is no
point in replacing one problem with another one, particularly one which
would likely be very tricky to debug,

Steve.

Rhe problematic cycle of events is well known:
gfs2_clear_inode calls gfs2_glock_put() for the inode's glock,
but if it's the very last put, it calls into dlm, which can block,
and that's where we get into trouble.

The livelock goes like this:

1. A fence operation needs memory, so it blocks on memory allocation.
2. Memory allocation blocks on slab shrinker.
3. Slab shrinker calls into vfs inode shrinker to free inodes from memory.
4. vfs inode shrinker eventually calls gfs2_clear_inode to free an inode.
5. gfs2_clear_inode calls the final gfs2_glock_put to unlock the inode's
glock.
6. gfs2_glock_put calls dlm unlock to unlock the glock.
7. dlm blocks on a pending fence operation. Goto 1.

So we need to prevent gfs2_clear_inode from calling into DLM. Still, somebody
needs to do the final put and tell dlm to unlock the inode's glock, which is
why
I've been trying to queue it to the delayed work queue.

If I can't do that, we're left with few alternatives: Perhaps a new
function of the quota daemon: to run the lru list and call dlm to unlock
any that have a special bit set, but that just seems ugly.

I've thought of some other alternatives, but they seem a lot uglier and
harder to manage. I'll give it some more thought.

Regards,

Bob Peterson
Red Hat File Systems


Hi Steve,

Another possibility is to add a new inode work_queue which does the work of
gfs2_clear_inode(). Its only function would be to clear the inode, and
therefore, there should not be any competing work, as there is in the case of
a glock. The gfs2_clear_inode() function would then queue the work and return
to its caller, and therefore it wouldn't block. The work that's been queued
might block on DLM, but it shouldn't matter, in theory, since the shrinker
will complete, which will free up the fence, which will free up dlm, which will
free up everything. What do you think?

Regards,

Bob Peterson
Red Hat File Systems


Well that is a possibility. There is another dependency cycle there 
however, which is the one relating to the inode lifetime, and we need to 
be careful that we don't break that in attempting to fix the memory 
allocation at fence time issue. I know it isn't easy, but we need to be 
very careful to avoid introducing any new races, as they are likely to 
be very tricky to debug in this area.


One thing we could do, is to be more careful about when we take this 
delayed path. We could, for example, add a check to see if we are 
running in the context of the shrinker, and only delay the eviction in 
that specific case. I was wondering whether we could move the decision 
further up the chain too, and avoid evicting inodes with zero link count 
under memory pressure in the first place. The issue of needing to 
allocate memory to evict inodes is likely to be something that might 
affect other filesystems too, so it might be useful as a generic feature.


I'd prefer to avoid adding a new inode workqueue, however there is no 
reason why we could not add a new feature to an existing workqueue by 
adding a new inode or glock flag as appropriate (which will not expand 
the size of the inode) to deal with this case,


Steve.