Re: [Cluster-devel] About locking granularity of gfs2

2018-04-24 Thread Guoqing Jiang



On 04/24/2018 08:54 PM, Steven Whitehouse wrote:

Hi,


On 24/04/18 04:54, Guoqing Jiang wrote:

Hi Steve,

Thanks for your reply.

On 04/24/2018 11:03 AM, Steven Whitehouse wrote:

Hi,


On 24/04/18 03:52, Guoqing Jiang wrote:

Hi,

Since gfs2 can "allow parallel allocation from different nodes 
simultaneously
as the locking granularity is one lock per resource group" per 
section 3.2 of [1].


Could it possible to make the locking granularity also applies to 
R/W IO? Then,
with the help of "sunit" and "swidth", we basically can lock a 
stripe, so all nodes
can write to different stripes in parallel, so the basic IO unit is 
one stripe.
Since I don't know gfs2 well,  I am wondering it is possible to do 
it or it doesn't
make sense at all for the idea due to some reasons. Any thoughts 
would be

appreciated, thanks.

I am asking the question because if people want to add the cluster 
support for
md/raid5, then it is better to get the help from filesystem level 
to ensure only one
node can access a stripe at a time, otherwise we have to locking a 
stripe in md

layer which could cause performance issue.

[1] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-253-260.pdf

Regards,
Guoqing



It is not just performance, it would be correctness too, since there 
is no guarantee that two nodes are not writing to the same stripe at 
the same time.


Yes, no fs can guarantee it. I am wondering if using GFS2 as a local 
filesystem, and gfs2 runs on top
of raid5, is it possible that gfs2 can write to two places 
simultaneously while the two places belong

to one stripe?


Yes



Based on the possibility, I guess it is not recommend to run gfs2 on raid5.

The locking granularity is per-inode generally, but also per-rgrp in 
case of rgrps, but that refers only to the header/bitmap since the 
allocated blocks are subject to the per-inode glocks in general.


Please correct me, does it mean there are two types of locking 
granularity? per-rgrp is for allocate rgrp,

and per-inode if for R/W IO, thanks.
It depends what operation is being undertaken. The per-inode glock 
covers all the blocks related to the inode, but during allocation and 
deallocation, the responsibility for the allocated and deallocate 
blocks passes between the rgrp and inode to which they relate. So the 
situation is more complicated than when no allocation/deallocation is 
involved,


Thanks a lot for your explanation.

Regards,
Guoqing



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Michal Hocko
On Tue 24-04-18 19:17:12, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Wed 25-04-18 00:18:40, Richard Weinberger wrote:
> > > Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko:
> > > > > Also only for debugging.
> > > > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> > > > > I can prepare a patch.
> > > > 
> > > > Cool!
> > > > 
> > > > Anyway, if UBIFS has some reclaim recursion critical sections in general
> > > > it would be really great to have them documented and that is where the
> > > > scope api is really handy. Just add the scope and document what is the
> > > > recursion issue. This will help people reading the code as well. Ideally
> > > > there shouldn't be any explicit GFP_NOFS in the code.
> > > 
> > > So in a perfect world a filesystem calls memalloc_nofs_save/restore and
> > > always uses GFP_KERNEL for kmalloc/vmalloc?
> > 
> > Exactly! And in a dream world those memalloc_nofs_save act as a
> > documentation of the reclaim recursion documentation ;)
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> BTW. should memalloc_nofs_save and memalloc_noio_save be merged into just 
> one that prevents both I/O and FS recursion?

Why should FS usage stop IO altogether?

> memalloc_nofs_save allows submitting bios to I/O stack and the bios 
> created under memalloc_nofs_save could be sent to the loop device and the 
> loop device calls the filesystem...

Don't those use NOIO context?
-- 
Michal Hocko
SUSE Labs



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Mikulas Patocka


On Tue, 24 Apr 2018, Michal Hocko wrote:

> On Wed 25-04-18 00:18:40, Richard Weinberger wrote:
> > Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko:
> > > > Also only for debugging.
> > > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> > > > I can prepare a patch.
> > > 
> > > Cool!
> > > 
> > > Anyway, if UBIFS has some reclaim recursion critical sections in general
> > > it would be really great to have them documented and that is where the
> > > scope api is really handy. Just add the scope and document what is the
> > > recursion issue. This will help people reading the code as well. Ideally
> > > there shouldn't be any explicit GFP_NOFS in the code.
> > 
> > So in a perfect world a filesystem calls memalloc_nofs_save/restore and
> > always uses GFP_KERNEL for kmalloc/vmalloc?
> 
> Exactly! And in a dream world those memalloc_nofs_save act as a
> documentation of the reclaim recursion documentation ;)
> -- 
> Michal Hocko
> SUSE Labs

BTW. should memalloc_nofs_save and memalloc_noio_save be merged into just 
one that prevents both I/O and FS recursion?

memalloc_nofs_save allows submitting bios to I/O stack and the bios 
created under memalloc_nofs_save could be sent to the loop device and the 
loop device calls the filesystem...

Mikulas



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Michal Hocko
On Wed 25-04-18 00:18:40, Richard Weinberger wrote:
> Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko:
> > > Also only for debugging.
> > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> > > I can prepare a patch.
> > 
> > Cool!
> > 
> > Anyway, if UBIFS has some reclaim recursion critical sections in general
> > it would be really great to have them documented and that is where the
> > scope api is really handy. Just add the scope and document what is the
> > recursion issue. This will help people reading the code as well. Ideally
> > there shouldn't be any explicit GFP_NOFS in the code.
> 
> So in a perfect world a filesystem calls memalloc_nofs_save/restore and
> always uses GFP_KERNEL for kmalloc/vmalloc?

Exactly! And in a dream world those memalloc_nofs_save act as a
documentation of the reclaim recursion documentation ;)
-- 
Michal Hocko
SUSE Labs



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Richard Weinberger
Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko:
> > Also only for debugging.
> > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> > I can prepare a patch.
> 
> Cool!
> 
> Anyway, if UBIFS has some reclaim recursion critical sections in general
> it would be really great to have them documented and that is where the
> scope api is really handy. Just add the scope and document what is the
> recursion issue. This will help people reading the code as well. Ideally
> there shouldn't be any explicit GFP_NOFS in the code.

So in a perfect world a filesystem calls memalloc_nofs_save/restore and
always uses GFP_KERNEL for kmalloc/vmalloc?

Thanks,
//richard




Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Richard Weinberger
Am Dienstag, 24. April 2018, 18:27:12 CEST schrieb Michal Hocko:
> Hi,
> it seems that we still have few vmalloc users who perform GFP_NOFS
> allocation:
> drivers/mtd/ubi/io.c

UBI is also not a big deal. We use it here like in UBIFS for debugging
when self-checks are enabled.

Thanks,
//richard


Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Richard Weinberger
[resending without html ...]

Am Dienstag, 24. April 2018, 18:27:12 CEST schrieb Michal Hocko:
> Hi,
> it seems that we still have few vmalloc users who perform GFP_NOFS
> allocation:
> drivers/mtd/ubi/io.c

UBI is not a big deal. We use it here like in UBIFS for debugging
when self-checks are enabled.

> fs/ext4/xattr.c
> fs/gfs2/dir.c
> fs/gfs2/quota.c
> fs/nfs/blocklayout/extent_tree.c
> fs/ubifs/debug.c
> fs/ubifs/lprops.c
> fs/ubifs/lpt_commit.c
> fs/ubifs/orphan.c

All users in UBIFS are debugging code and some error reporting.
No fast paths.
I think we can switch to prealloation + locking without much hassle.
I can prepare a patch.

Thanks,
//richard



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Richard Weinberger
Am Dienstag, 24. April 2018, 18:27:12 CEST schrieb Michal Hocko:
> fs/ubifs/debug.c

This one is just for debugging.
So, preallocating + locking would not hurt much.

> fs/ubifs/lprops.c

Ditto.

> fs/ubifs/lpt_commit.c

Here we use it also only in debugging mode and in one case for
fatal error reporting.
No hot paths.

> fs/ubifs/orphan.c

Also only for debugging.
Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
I can prepare a patch.

Thanks,
//richard


Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Michal Hocko
On Tue 24-04-18 20:26:23, Steven Whitehouse wrote:
[...]
> It would be good to fix this, and it has been known as an issue for a long
> time. We might well be able to make use of the new API though. It might be
> as simple as adding the calls when we get & release glocks, but I'd have to
> check the code to be sure,

Yeah, starting with annotating those locking contexts and how document
how their are used in the reclaim is the great first step. This has to
be done per-fs obviously.
-- 
Michal Hocko
SUSE Labs



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Michal Hocko
On Tue 24-04-18 14:35:36, Theodore Ts'o wrote:
> On Tue, Apr 24, 2018 at 10:27:12AM -0600, Michal Hocko wrote:
> > fs/ext4/xattr.c
> > 
> > What to do about this? Well, there are two things. Firstly, it would be
> > really great to double check whether the GFP_NOFS is really needed. I
> > cannot judge that because I am not familiar with the code.
> 
> *Most* of the time it's not needed, but there are times when it is.
> We could be more smart about sending down GFP_NOFS only when it is
> needed.

Well, the primary idea is that you do not have to. All you care about is
to use the scope api where it matters + a comment describing the
reclaim recursion context (e.g. this lock will be held in the reclaim
path here and there).

> If we are sending too many GFP_NOFS's allocations such that
> it's causing heartburn, we could fix this.  (xattr commands are rare
> enough that I dind't think it was worth it to modulate the GFP flags
> for this particular case, but we could make it be smarter if it would
> help.)

Well, the vmalloc is actually a correctness issue rather than a
heartburn...

> > If the use is really valid then we have a way to do the vmalloc
> > allocation properly. We have memalloc_nofs_{save,restore} scope api. How
> > does that work? You simply call memalloc_nofs_save when the reclaim
> > recursion critical section starts (e.g. when you take a lock which is
> > then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
> > when the critical section ends. _All_ allocations within that scope
> > will get GFP_NOFS semantic automagically. If you are not sure about the
> > scope itself then the easiest workaround is to wrap the vmalloc itself
> > with a big fat comment that this should be revisited.
> 
> This is something we could do in ext4.  It hadn't been high priority,
> because we've been rather overloaded.

Well, ext/jbd already has scopes defined for the transaction context so
anything down that road can be converted to GFP_KERNEL (well, unless the
same code path is shared outside of the transaction context and still
requires a protection). It would be really great to identify other
contexts and slowly move away from the explicit GFP_NOFS. Are you aware
of other contexts?

> As a suggestion, could you take
> documentation about how to convert to the memalloc_nofs_{save,restore}
> scope api (which I think you've written about e-mails at length
> before), and put that into a file in Documentation/core-api?

I can.

> The question I was trying to figure out which triggered the above
> request is how/whether to gradually convert to that scope API.  Is it
> safe to add the memalloc_nofs_{save,restore} to code and keep the
> GFP_NOFS flags until we're sure we got it all right, for all of the
> code paths, and then drop the GFP_NOFS?

The first stage is to define and document those scopes. I have provided
a debugging patch [1] in the past that would dump_stack when seeing an
explicit GFP_NOFS from a scope which could help to eliminate existing
users.

[1] http://lkml.kernel.org/r/20170106141845.24362-1-mho...@kernel.org
-- 
Michal Hocko
SUSE Labs



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Michal Hocko
On Tue 24-04-18 21:03:43, Richard Weinberger wrote:
> Am Dienstag, 24. April 2018, 18:27:12 CEST schrieb Michal Hocko:
> > fs/ubifs/debug.c
> 
> This one is just for debugging.
> So, preallocating + locking would not hurt much.
> 
> > fs/ubifs/lprops.c
> 
> Ditto.
> 
> > fs/ubifs/lpt_commit.c
> 
> Here we use it also only in debugging mode and in one case for
> fatal error reporting.
> No hot paths.
> 
> > fs/ubifs/orphan.c
> 
> Also only for debugging.
> Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> I can prepare a patch.

Cool!

Anyway, if UBIFS has some reclaim recursion critical sections in general
it would be really great to have them documented and that is where the
scope api is really handy. Just add the scope and document what is the
recursion issue. This will help people reading the code as well. Ideally
there shouldn't be any explicit GFP_NOFS in the code.

Thanks for a quick turnaround.

-- 
Michal Hocko
SUSE Labs



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Steven Whitehouse

Hi,


On 24/04/18 17:27, Michal Hocko wrote:

Hi,
it seems that we still have few vmalloc users who perform GFP_NOFS
allocation:
drivers/mtd/ubi/io.c
fs/ext4/xattr.c
fs/gfs2/dir.c
fs/gfs2/quota.c
fs/nfs/blocklayout/extent_tree.c
fs/ubifs/debug.c
fs/ubifs/lprops.c
fs/ubifs/lpt_commit.c
fs/ubifs/orphan.c

Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly
because we do have hardocded GFP_KERNEL allocations deep inside the
vmalloc layers. That means that if GFP_NOFS really protects from
recursion into the fs deadlocks then the vmalloc call is broken.

What to do about this? Well, there are two things. Firstly, it would be
really great to double check whether the GFP_NOFS is really needed. I
cannot judge that because I am not familiar with the code. It would be
great if the respective maintainers (hopefully get_maintainer.sh pointed
me to all relevant ones). If there is not reclaim recursion issue then
simply use the standard vmalloc (aka GFP_KERNEL request).
For GFS2, and I suspect for other fs too, it is really needed. We don't 
want to enter reclaim while holding filesystem locks.



If the use is really valid then we have a way to do the vmalloc
allocation properly. We have memalloc_nofs_{save,restore} scope api. How
does that work? You simply call memalloc_nofs_save when the reclaim
recursion critical section starts (e.g. when you take a lock which is
then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
when the critical section ends. _All_ allocations within that scope
will get GFP_NOFS semantic automagically. If you are not sure about the
scope itself then the easiest workaround is to wrap the vmalloc itself
with a big fat comment that this should be revisited.

Does that sound like something that can be done in a reasonable time?
I have tried to bring this up in the past but our speed is glacial and
there are attempts to do hacks like checking for abusers inside the
vmalloc which is just too ugly to live.

Please do not hesitate to get back to me if something is not clear.

Thanks!


It would be good to fix this, and it has been known as an issue for a 
long time. We might well be able to make use of the new API though. It 
might be as simple as adding the calls when we get & release glocks, but 
I'd have to check the code to be sure,


Steve.



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Theodore Y. Ts'o
On Tue, Apr 24, 2018 at 10:27:12AM -0600, Michal Hocko wrote:
> fs/ext4/xattr.c
> 
> What to do about this? Well, there are two things. Firstly, it would be
> really great to double check whether the GFP_NOFS is really needed. I
> cannot judge that because I am not familiar with the code.

*Most* of the time it's not needed, but there are times when it is.
We could be more smart about sending down GFP_NOFS only when it is
needed.  If we are sending too many GFP_NOFS's allocations such that
it's causing heartburn, we could fix this.  (xattr commands are rare
enough that I dind't think it was worth it to modulate the GFP flags
for this particular case, but we could make it be smarter if it would
help.)

> If the use is really valid then we have a way to do the vmalloc
> allocation properly. We have memalloc_nofs_{save,restore} scope api. How
> does that work? You simply call memalloc_nofs_save when the reclaim
> recursion critical section starts (e.g. when you take a lock which is
> then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
> when the critical section ends. _All_ allocations within that scope
> will get GFP_NOFS semantic automagically. If you are not sure about the
> scope itself then the easiest workaround is to wrap the vmalloc itself
> with a big fat comment that this should be revisited.

This is something we could do in ext4.  It hadn't been high priority,
because we've been rather overloaded.  As a suggestion, could you take
documentation about how to convert to the memalloc_nofs_{save,restore}
scope api (which I think you've written about e-mails at length
before), and put that into a file in Documentation/core-api?

The question I was trying to figure out which triggered the above
request is how/whether to gradually convert to that scope API.  Is it
safe to add the memalloc_nofs_{save,restore} to code and keep the
GFP_NOFS flags until we're sure we got it all right, for all of the
code paths, and then drop the GFP_NOFS?

Thanks,

- Ted



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Mikulas Patocka


On Tue, 24 Apr 2018, Michal Hocko wrote:

> On Tue 24-04-18 12:46:55, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > 
> > > Hi,
> > > it seems that we still have few vmalloc users who perform GFP_NOFS
> > > allocation:
> > > drivers/mtd/ubi/io.c
> > > fs/ext4/xattr.c
> > > fs/gfs2/dir.c
> > > fs/gfs2/quota.c
> > > fs/nfs/blocklayout/extent_tree.c
> > > fs/ubifs/debug.c
> > > fs/ubifs/lprops.c
> > > fs/ubifs/lpt_commit.c
> > > fs/ubifs/orphan.c
> > > 
> > > Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly
> > > because we do have hardocded GFP_KERNEL allocations deep inside the
> > > vmalloc layers. That means that if GFP_NOFS really protects from
> > > recursion into the fs deadlocks then the vmalloc call is broken.
> > > 
> > > What to do about this? Well, there are two things. Firstly, it would be
> > > really great to double check whether the GFP_NOFS is really needed. I
> > > cannot judge that because I am not familiar with the code. It would be
> > > great if the respective maintainers (hopefully get_maintainer.sh pointed
> > > me to all relevant ones). If there is not reclaim recursion issue then
> > > simply use the standard vmalloc (aka GFP_KERNEL request).
> > > 
> > > If the use is really valid then we have a way to do the vmalloc
> > > allocation properly. We have memalloc_nofs_{save,restore} scope api. How
> > > does that work? You simply call memalloc_nofs_save when the reclaim
> > > recursion critical section starts (e.g. when you take a lock which is
> > > then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
> > > when the critical section ends. _All_ allocations within that scope
> > > will get GFP_NOFS semantic automagically. If you are not sure about the
> > > scope itself then the easiest workaround is to wrap the vmalloc itself
> > > with a big fat comment that this should be revisited.
> > > 
> > > Does that sound like something that can be done in a reasonable time?
> > > I have tried to bring this up in the past but our speed is glacial and
> > > there are attempts to do hacks like checking for abusers inside the
> > > vmalloc which is just too ugly to live.
> > > 
> > > Please do not hesitate to get back to me if something is not clear.
> > > 
> > > Thanks!
> > > -- 
> > > Michal Hocko
> > > SUSE Labs
> > 
> > I made a patch that adds memalloc_noio/fs_save around these calls a year 
> > ago: http://lkml.iu.edu/hypermail/linux/kernel/1707.0/01376.html
> 
> Yeah, and that is the wrong approach.

It is crude, but it fixes the deadlock possibility. Then, the maintainers 
will have a lot of time to refactor the code and move these 
memalloc_noio_save calls to the proper scope.

> Let's try to fix this properly
> this time. As the above outlines, the worst case we can end up mid-term
> would be to wrap vmalloc calls with the scope api with a TODO. But I am
> pretty sure the respective maintainers can come up with a better
> solution. I am definitely willing to help here.
> -- 
> Michal Hocko
> SUSE Labs

Mikulas



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Mikulas Patocka


On Tue, 24 Apr 2018, Michal Hocko wrote:

> Hi,
> it seems that we still have few vmalloc users who perform GFP_NOFS
> allocation:
> drivers/mtd/ubi/io.c
> fs/ext4/xattr.c
> fs/gfs2/dir.c
> fs/gfs2/quota.c
> fs/nfs/blocklayout/extent_tree.c
> fs/ubifs/debug.c
> fs/ubifs/lprops.c
> fs/ubifs/lpt_commit.c
> fs/ubifs/orphan.c
> 
> Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly
> because we do have hardocded GFP_KERNEL allocations deep inside the
> vmalloc layers. That means that if GFP_NOFS really protects from
> recursion into the fs deadlocks then the vmalloc call is broken.
> 
> What to do about this? Well, there are two things. Firstly, it would be
> really great to double check whether the GFP_NOFS is really needed. I
> cannot judge that because I am not familiar with the code. It would be
> great if the respective maintainers (hopefully get_maintainer.sh pointed
> me to all relevant ones). If there is not reclaim recursion issue then
> simply use the standard vmalloc (aka GFP_KERNEL request).
> 
> If the use is really valid then we have a way to do the vmalloc
> allocation properly. We have memalloc_nofs_{save,restore} scope api. How
> does that work? You simply call memalloc_nofs_save when the reclaim
> recursion critical section starts (e.g. when you take a lock which is
> then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
> when the critical section ends. _All_ allocations within that scope
> will get GFP_NOFS semantic automagically. If you are not sure about the
> scope itself then the easiest workaround is to wrap the vmalloc itself
> with a big fat comment that this should be revisited.
> 
> Does that sound like something that can be done in a reasonable time?
> I have tried to bring this up in the past but our speed is glacial and
> there are attempts to do hacks like checking for abusers inside the
> vmalloc which is just too ugly to live.
> 
> Please do not hesitate to get back to me if something is not clear.
> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs

I made a patch that adds memalloc_noio/fs_save around these calls a year 
ago: http://lkml.iu.edu/hypermail/linux/kernel/1707.0/01376.html

Mikulas



Re: [Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Michal Hocko
On Tue 24-04-18 12:46:55, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > Hi,
> > it seems that we still have few vmalloc users who perform GFP_NOFS
> > allocation:
> > drivers/mtd/ubi/io.c
> > fs/ext4/xattr.c
> > fs/gfs2/dir.c
> > fs/gfs2/quota.c
> > fs/nfs/blocklayout/extent_tree.c
> > fs/ubifs/debug.c
> > fs/ubifs/lprops.c
> > fs/ubifs/lpt_commit.c
> > fs/ubifs/orphan.c
> > 
> > Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly
> > because we do have hardocded GFP_KERNEL allocations deep inside the
> > vmalloc layers. That means that if GFP_NOFS really protects from
> > recursion into the fs deadlocks then the vmalloc call is broken.
> > 
> > What to do about this? Well, there are two things. Firstly, it would be
> > really great to double check whether the GFP_NOFS is really needed. I
> > cannot judge that because I am not familiar with the code. It would be
> > great if the respective maintainers (hopefully get_maintainer.sh pointed
> > me to all relevant ones). If there is not reclaim recursion issue then
> > simply use the standard vmalloc (aka GFP_KERNEL request).
> > 
> > If the use is really valid then we have a way to do the vmalloc
> > allocation properly. We have memalloc_nofs_{save,restore} scope api. How
> > does that work? You simply call memalloc_nofs_save when the reclaim
> > recursion critical section starts (e.g. when you take a lock which is
> > then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
> > when the critical section ends. _All_ allocations within that scope
> > will get GFP_NOFS semantic automagically. If you are not sure about the
> > scope itself then the easiest workaround is to wrap the vmalloc itself
> > with a big fat comment that this should be revisited.
> > 
> > Does that sound like something that can be done in a reasonable time?
> > I have tried to bring this up in the past but our speed is glacial and
> > there are attempts to do hacks like checking for abusers inside the
> > vmalloc which is just too ugly to live.
> > 
> > Please do not hesitate to get back to me if something is not clear.
> > 
> > Thanks!
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> I made a patch that adds memalloc_noio/fs_save around these calls a year 
> ago: http://lkml.iu.edu/hypermail/linux/kernel/1707.0/01376.html

Yeah, and that is the wrong approach. Let's try to fix this properly
this time. As the above outlines, the worst case we can end up mid-term
would be to wrap vmalloc calls with the scope api with a TODO. But I am
pretty sure the respective maintainers can come up with a better
solution. I am definitely willing to help here.
-- 
Michal Hocko
SUSE Labs



[Cluster-devel] vmalloc with GFP_NOFS

2018-04-24 Thread Michal Hocko
Hi,
it seems that we still have few vmalloc users who perform GFP_NOFS
allocation:
drivers/mtd/ubi/io.c
fs/ext4/xattr.c
fs/gfs2/dir.c
fs/gfs2/quota.c
fs/nfs/blocklayout/extent_tree.c
fs/ubifs/debug.c
fs/ubifs/lprops.c
fs/ubifs/lpt_commit.c
fs/ubifs/orphan.c

Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly
because we do have hardocded GFP_KERNEL allocations deep inside the
vmalloc layers. That means that if GFP_NOFS really protects from
recursion into the fs deadlocks then the vmalloc call is broken.

What to do about this? Well, there are two things. Firstly, it would be
really great to double check whether the GFP_NOFS is really needed. I
cannot judge that because I am not familiar with the code. It would be
great if the respective maintainers (hopefully get_maintainer.sh pointed
me to all relevant ones). If there is not reclaim recursion issue then
simply use the standard vmalloc (aka GFP_KERNEL request).

If the use is really valid then we have a way to do the vmalloc
allocation properly. We have memalloc_nofs_{save,restore} scope api. How
does that work? You simply call memalloc_nofs_save when the reclaim
recursion critical section starts (e.g. when you take a lock which is
then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
when the critical section ends. _All_ allocations within that scope
will get GFP_NOFS semantic automagically. If you are not sure about the
scope itself then the easiest workaround is to wrap the vmalloc itself
with a big fat comment that this should be revisited.

Does that sound like something that can be done in a reasonable time?
I have tried to bring this up in the past but our speed is glacial and
there are attempts to do hacks like checking for abusers inside the
vmalloc which is just too ugly to live.

Please do not hesitate to get back to me if something is not clear.

Thanks!
-- 
Michal Hocko
SUSE Labs



Re: [Cluster-devel] About locking granularity of gfs2

2018-04-24 Thread Steven Whitehouse

Hi,


On 24/04/18 04:54, Guoqing Jiang wrote:

Hi Steve,

Thanks for your reply.

On 04/24/2018 11:03 AM, Steven Whitehouse wrote:

Hi,


On 24/04/18 03:52, Guoqing Jiang wrote:

Hi,

Since gfs2 can "allow parallel allocation from different nodes 
simultaneously
as the locking granularity is one lock per resource group" per 
section 3.2 of [1].


Could it possible to make the locking granularity also applies to 
R/W IO? Then,
with the help of "sunit" and "swidth", we basically can lock a 
stripe, so all nodes
can write to different stripes in parallel, so the basic IO unit is 
one stripe.
Since I don't know gfs2 well,  I am wondering it is possible to do 
it or it doesn't
make sense at all for the idea due to some reasons. Any thoughts 
would be

appreciated, thanks.

I am asking the question because if people want to add the cluster 
support for
md/raid5, then it is better to get the help from filesystem level to 
ensure only one
node can access a stripe at a time, otherwise we have to locking a 
stripe in md

layer which could cause performance issue.

[1] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-253-260.pdf

Regards,
Guoqing



It is not just performance, it would be correctness too, since there 
is no guarantee that two nodes are not writing to the same stripe at 
the same time.


Yes, no fs can guarantee it. I am wondering if using GFS2 as a local 
filesystem, and gfs2 runs on top
of raid5, is it possible that gfs2 can write to two places 
simultaneously while the two places belong

to one stripe?


Yes

The locking granularity is per-inode generally, but also per-rgrp in 
case of rgrps, but that refers only to the header/bitmap since the 
allocated blocks are subject to the per-inode glocks in general.


Please correct me, does it mean there are two types of locking 
granularity? per-rgrp is for allocate rgrp,

and per-inode if for R/W IO, thanks.
It depends what operation is being undertaken. The per-inode glock 
covers all the blocks related to the inode, but during allocation and 
deallocation, the responsibility for the allocated and deallocate blocks 
passes between the rgrp and inode to which they relate. So the situation 
is more complicated than when no allocation/deallocation is involved,


Steve.


Guoqing






Re: [Cluster-devel] About locking granularity of gfs2

2018-04-24 Thread Guoqing Jiang



On 04/24/2018 01:13 PM, Gang He wrote:

Stripe unit is logical volume concepts, for file system, it should not know 
this,
for file system, the access unit is block (power of disk sector size).


IMHO, It is true for typical fs, but I think zfs and btrfs can know 
about it well, though

no cluster fs can support it now.

Thanks,
Guoqing