Re: [Cluster-devel] About locking granularity of gfs2
On 04/24/2018 08:54 PM, Steven Whitehouse wrote: Hi, On 24/04/18 04:54, Guoqing Jiang wrote: Hi Steve, Thanks for your reply. On 04/24/2018 11:03 AM, Steven Whitehouse wrote: Hi, On 24/04/18 03:52, Guoqing Jiang wrote: Hi, Since gfs2 can "allow parallel allocation from different nodes simultaneously as the locking granularity is one lock per resource group" per section 3.2 of [1]. Could it possible to make the locking granularity also applies to R/W IO? Then, with the help of "sunit" and "swidth", we basically can lock a stripe, so all nodes can write to different stripes in parallel, so the basic IO unit is one stripe. Since I don't know gfs2 well, I am wondering it is possible to do it or it doesn't make sense at all for the idea due to some reasons. Any thoughts would be appreciated, thanks. I am asking the question because if people want to add the cluster support for md/raid5, then it is better to get the help from filesystem level to ensure only one node can access a stripe at a time, otherwise we have to locking a stripe in md layer which could cause performance issue. [1] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-253-260.pdf Regards, Guoqing It is not just performance, it would be correctness too, since there is no guarantee that two nodes are not writing to the same stripe at the same time. Yes, no fs can guarantee it. I am wondering if using GFS2 as a local filesystem, and gfs2 runs on top of raid5, is it possible that gfs2 can write to two places simultaneously while the two places belong to one stripe? Yes Based on the possibility, I guess it is not recommend to run gfs2 on raid5. The locking granularity is per-inode generally, but also per-rgrp in case of rgrps, but that refers only to the header/bitmap since the allocated blocks are subject to the per-inode glocks in general. Please correct me, does it mean there are two types of locking granularity? per-rgrp is for allocate rgrp, and per-inode if for R/W IO, thanks. It depends what operation is being undertaken. The per-inode glock covers all the blocks related to the inode, but during allocation and deallocation, the responsibility for the allocated and deallocate blocks passes between the rgrp and inode to which they relate. So the situation is more complicated than when no allocation/deallocation is involved, Thanks a lot for your explanation. Regards, Guoqing
Re: [Cluster-devel] vmalloc with GFP_NOFS
On Tue 24-04-18 19:17:12, Mikulas Patocka wrote: > > > On Tue, 24 Apr 2018, Michal Hocko wrote: > > > On Wed 25-04-18 00:18:40, Richard Weinberger wrote: > > > Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko: > > > > > Also only for debugging. > > > > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem. > > > > > I can prepare a patch. > > > > > > > > Cool! > > > > > > > > Anyway, if UBIFS has some reclaim recursion critical sections in general > > > > it would be really great to have them documented and that is where the > > > > scope api is really handy. Just add the scope and document what is the > > > > recursion issue. This will help people reading the code as well. Ideally > > > > there shouldn't be any explicit GFP_NOFS in the code. > > > > > > So in a perfect world a filesystem calls memalloc_nofs_save/restore and > > > always uses GFP_KERNEL for kmalloc/vmalloc? > > > > Exactly! And in a dream world those memalloc_nofs_save act as a > > documentation of the reclaim recursion documentation ;) > > -- > > Michal Hocko > > SUSE Labs > > BTW. should memalloc_nofs_save and memalloc_noio_save be merged into just > one that prevents both I/O and FS recursion? Why should FS usage stop IO altogether? > memalloc_nofs_save allows submitting bios to I/O stack and the bios > created under memalloc_nofs_save could be sent to the loop device and the > loop device calls the filesystem... Don't those use NOIO context? -- Michal Hocko SUSE Labs
Re: [Cluster-devel] vmalloc with GFP_NOFS
On Tue, 24 Apr 2018, Michal Hocko wrote: > On Wed 25-04-18 00:18:40, Richard Weinberger wrote: > > Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko: > > > > Also only for debugging. > > > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem. > > > > I can prepare a patch. > > > > > > Cool! > > > > > > Anyway, if UBIFS has some reclaim recursion critical sections in general > > > it would be really great to have them documented and that is where the > > > scope api is really handy. Just add the scope and document what is the > > > recursion issue. This will help people reading the code as well. Ideally > > > there shouldn't be any explicit GFP_NOFS in the code. > > > > So in a perfect world a filesystem calls memalloc_nofs_save/restore and > > always uses GFP_KERNEL for kmalloc/vmalloc? > > Exactly! And in a dream world those memalloc_nofs_save act as a > documentation of the reclaim recursion documentation ;) > -- > Michal Hocko > SUSE Labs BTW. should memalloc_nofs_save and memalloc_noio_save be merged into just one that prevents both I/O and FS recursion? memalloc_nofs_save allows submitting bios to I/O stack and the bios created under memalloc_nofs_save could be sent to the loop device and the loop device calls the filesystem... Mikulas
Re: [Cluster-devel] vmalloc with GFP_NOFS
On Wed 25-04-18 00:18:40, Richard Weinberger wrote: > Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko: > > > Also only for debugging. > > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem. > > > I can prepare a patch. > > > > Cool! > > > > Anyway, if UBIFS has some reclaim recursion critical sections in general > > it would be really great to have them documented and that is where the > > scope api is really handy. Just add the scope and document what is the > > recursion issue. This will help people reading the code as well. Ideally > > there shouldn't be any explicit GFP_NOFS in the code. > > So in a perfect world a filesystem calls memalloc_nofs_save/restore and > always uses GFP_KERNEL for kmalloc/vmalloc? Exactly! And in a dream world those memalloc_nofs_save act as a documentation of the reclaim recursion documentation ;) -- Michal Hocko SUSE Labs
Re: [Cluster-devel] vmalloc with GFP_NOFS
Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko: > > Also only for debugging. > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem. > > I can prepare a patch. > > Cool! > > Anyway, if UBIFS has some reclaim recursion critical sections in general > it would be really great to have them documented and that is where the > scope api is really handy. Just add the scope and document what is the > recursion issue. This will help people reading the code as well. Ideally > there shouldn't be any explicit GFP_NOFS in the code. So in a perfect world a filesystem calls memalloc_nofs_save/restore and always uses GFP_KERNEL for kmalloc/vmalloc? Thanks, //richard
Re: [Cluster-devel] vmalloc with GFP_NOFS
Am Dienstag, 24. April 2018, 18:27:12 CEST schrieb Michal Hocko: > Hi, > it seems that we still have few vmalloc users who perform GFP_NOFS > allocation: > drivers/mtd/ubi/io.c UBI is also not a big deal. We use it here like in UBIFS for debugging when self-checks are enabled. Thanks, //richard
Re: [Cluster-devel] vmalloc with GFP_NOFS
[resending without html ...] Am Dienstag, 24. April 2018, 18:27:12 CEST schrieb Michal Hocko: > Hi, > it seems that we still have few vmalloc users who perform GFP_NOFS > allocation: > drivers/mtd/ubi/io.c UBI is not a big deal. We use it here like in UBIFS for debugging when self-checks are enabled. > fs/ext4/xattr.c > fs/gfs2/dir.c > fs/gfs2/quota.c > fs/nfs/blocklayout/extent_tree.c > fs/ubifs/debug.c > fs/ubifs/lprops.c > fs/ubifs/lpt_commit.c > fs/ubifs/orphan.c All users in UBIFS are debugging code and some error reporting. No fast paths. I think we can switch to prealloation + locking without much hassle. I can prepare a patch. Thanks, //richard
Re: [Cluster-devel] vmalloc with GFP_NOFS
Am Dienstag, 24. April 2018, 18:27:12 CEST schrieb Michal Hocko: > fs/ubifs/debug.c This one is just for debugging. So, preallocating + locking would not hurt much. > fs/ubifs/lprops.c Ditto. > fs/ubifs/lpt_commit.c Here we use it also only in debugging mode and in one case for fatal error reporting. No hot paths. > fs/ubifs/orphan.c Also only for debugging. Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem. I can prepare a patch. Thanks, //richard
Re: [Cluster-devel] vmalloc with GFP_NOFS
On Tue 24-04-18 20:26:23, Steven Whitehouse wrote: [...] > It would be good to fix this, and it has been known as an issue for a long > time. We might well be able to make use of the new API though. It might be > as simple as adding the calls when we get & release glocks, but I'd have to > check the code to be sure, Yeah, starting with annotating those locking contexts and how document how their are used in the reclaim is the great first step. This has to be done per-fs obviously. -- Michal Hocko SUSE Labs
Re: [Cluster-devel] vmalloc with GFP_NOFS
On Tue 24-04-18 14:35:36, Theodore Ts'o wrote: > On Tue, Apr 24, 2018 at 10:27:12AM -0600, Michal Hocko wrote: > > fs/ext4/xattr.c > > > > What to do about this? Well, there are two things. Firstly, it would be > > really great to double check whether the GFP_NOFS is really needed. I > > cannot judge that because I am not familiar with the code. > > *Most* of the time it's not needed, but there are times when it is. > We could be more smart about sending down GFP_NOFS only when it is > needed. Well, the primary idea is that you do not have to. All you care about is to use the scope api where it matters + a comment describing the reclaim recursion context (e.g. this lock will be held in the reclaim path here and there). > If we are sending too many GFP_NOFS's allocations such that > it's causing heartburn, we could fix this. (xattr commands are rare > enough that I dind't think it was worth it to modulate the GFP flags > for this particular case, but we could make it be smarter if it would > help.) Well, the vmalloc is actually a correctness issue rather than a heartburn... > > If the use is really valid then we have a way to do the vmalloc > > allocation properly. We have memalloc_nofs_{save,restore} scope api. How > > does that work? You simply call memalloc_nofs_save when the reclaim > > recursion critical section starts (e.g. when you take a lock which is > > then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore > > when the critical section ends. _All_ allocations within that scope > > will get GFP_NOFS semantic automagically. If you are not sure about the > > scope itself then the easiest workaround is to wrap the vmalloc itself > > with a big fat comment that this should be revisited. > > This is something we could do in ext4. It hadn't been high priority, > because we've been rather overloaded. Well, ext/jbd already has scopes defined for the transaction context so anything down that road can be converted to GFP_KERNEL (well, unless the same code path is shared outside of the transaction context and still requires a protection). It would be really great to identify other contexts and slowly move away from the explicit GFP_NOFS. Are you aware of other contexts? > As a suggestion, could you take > documentation about how to convert to the memalloc_nofs_{save,restore} > scope api (which I think you've written about e-mails at length > before), and put that into a file in Documentation/core-api? I can. > The question I was trying to figure out which triggered the above > request is how/whether to gradually convert to that scope API. Is it > safe to add the memalloc_nofs_{save,restore} to code and keep the > GFP_NOFS flags until we're sure we got it all right, for all of the > code paths, and then drop the GFP_NOFS? The first stage is to define and document those scopes. I have provided a debugging patch [1] in the past that would dump_stack when seeing an explicit GFP_NOFS from a scope which could help to eliminate existing users. [1] http://lkml.kernel.org/r/20170106141845.24362-1-mho...@kernel.org -- Michal Hocko SUSE Labs
Re: [Cluster-devel] vmalloc with GFP_NOFS
On Tue 24-04-18 21:03:43, Richard Weinberger wrote: > Am Dienstag, 24. April 2018, 18:27:12 CEST schrieb Michal Hocko: > > fs/ubifs/debug.c > > This one is just for debugging. > So, preallocating + locking would not hurt much. > > > fs/ubifs/lprops.c > > Ditto. > > > fs/ubifs/lpt_commit.c > > Here we use it also only in debugging mode and in one case for > fatal error reporting. > No hot paths. > > > fs/ubifs/orphan.c > > Also only for debugging. > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem. > I can prepare a patch. Cool! Anyway, if UBIFS has some reclaim recursion critical sections in general it would be really great to have them documented and that is where the scope api is really handy. Just add the scope and document what is the recursion issue. This will help people reading the code as well. Ideally there shouldn't be any explicit GFP_NOFS in the code. Thanks for a quick turnaround. -- Michal Hocko SUSE Labs
Re: [Cluster-devel] vmalloc with GFP_NOFS
Hi, On 24/04/18 17:27, Michal Hocko wrote: Hi, it seems that we still have few vmalloc users who perform GFP_NOFS allocation: drivers/mtd/ubi/io.c fs/ext4/xattr.c fs/gfs2/dir.c fs/gfs2/quota.c fs/nfs/blocklayout/extent_tree.c fs/ubifs/debug.c fs/ubifs/lprops.c fs/ubifs/lpt_commit.c fs/ubifs/orphan.c Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly because we do have hardocded GFP_KERNEL allocations deep inside the vmalloc layers. That means that if GFP_NOFS really protects from recursion into the fs deadlocks then the vmalloc call is broken. What to do about this? Well, there are two things. Firstly, it would be really great to double check whether the GFP_NOFS is really needed. I cannot judge that because I am not familiar with the code. It would be great if the respective maintainers (hopefully get_maintainer.sh pointed me to all relevant ones). If there is not reclaim recursion issue then simply use the standard vmalloc (aka GFP_KERNEL request). For GFS2, and I suspect for other fs too, it is really needed. We don't want to enter reclaim while holding filesystem locks. If the use is really valid then we have a way to do the vmalloc allocation properly. We have memalloc_nofs_{save,restore} scope api. How does that work? You simply call memalloc_nofs_save when the reclaim recursion critical section starts (e.g. when you take a lock which is then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore when the critical section ends. _All_ allocations within that scope will get GFP_NOFS semantic automagically. If you are not sure about the scope itself then the easiest workaround is to wrap the vmalloc itself with a big fat comment that this should be revisited. Does that sound like something that can be done in a reasonable time? I have tried to bring this up in the past but our speed is glacial and there are attempts to do hacks like checking for abusers inside the vmalloc which is just too ugly to live. Please do not hesitate to get back to me if something is not clear. Thanks! It would be good to fix this, and it has been known as an issue for a long time. We might well be able to make use of the new API though. It might be as simple as adding the calls when we get & release glocks, but I'd have to check the code to be sure, Steve.
Re: [Cluster-devel] vmalloc with GFP_NOFS
On Tue, Apr 24, 2018 at 10:27:12AM -0600, Michal Hocko wrote: > fs/ext4/xattr.c > > What to do about this? Well, there are two things. Firstly, it would be > really great to double check whether the GFP_NOFS is really needed. I > cannot judge that because I am not familiar with the code. *Most* of the time it's not needed, but there are times when it is. We could be more smart about sending down GFP_NOFS only when it is needed. If we are sending too many GFP_NOFS's allocations such that it's causing heartburn, we could fix this. (xattr commands are rare enough that I dind't think it was worth it to modulate the GFP flags for this particular case, but we could make it be smarter if it would help.) > If the use is really valid then we have a way to do the vmalloc > allocation properly. We have memalloc_nofs_{save,restore} scope api. How > does that work? You simply call memalloc_nofs_save when the reclaim > recursion critical section starts (e.g. when you take a lock which is > then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore > when the critical section ends. _All_ allocations within that scope > will get GFP_NOFS semantic automagically. If you are not sure about the > scope itself then the easiest workaround is to wrap the vmalloc itself > with a big fat comment that this should be revisited. This is something we could do in ext4. It hadn't been high priority, because we've been rather overloaded. As a suggestion, could you take documentation about how to convert to the memalloc_nofs_{save,restore} scope api (which I think you've written about e-mails at length before), and put that into a file in Documentation/core-api? The question I was trying to figure out which triggered the above request is how/whether to gradually convert to that scope API. Is it safe to add the memalloc_nofs_{save,restore} to code and keep the GFP_NOFS flags until we're sure we got it all right, for all of the code paths, and then drop the GFP_NOFS? Thanks, - Ted
Re: [Cluster-devel] vmalloc with GFP_NOFS
On Tue, 24 Apr 2018, Michal Hocko wrote: > On Tue 24-04-18 12:46:55, Mikulas Patocka wrote: > > > > > > On Tue, 24 Apr 2018, Michal Hocko wrote: > > > > > Hi, > > > it seems that we still have few vmalloc users who perform GFP_NOFS > > > allocation: > > > drivers/mtd/ubi/io.c > > > fs/ext4/xattr.c > > > fs/gfs2/dir.c > > > fs/gfs2/quota.c > > > fs/nfs/blocklayout/extent_tree.c > > > fs/ubifs/debug.c > > > fs/ubifs/lprops.c > > > fs/ubifs/lpt_commit.c > > > fs/ubifs/orphan.c > > > > > > Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly > > > because we do have hardocded GFP_KERNEL allocations deep inside the > > > vmalloc layers. That means that if GFP_NOFS really protects from > > > recursion into the fs deadlocks then the vmalloc call is broken. > > > > > > What to do about this? Well, there are two things. Firstly, it would be > > > really great to double check whether the GFP_NOFS is really needed. I > > > cannot judge that because I am not familiar with the code. It would be > > > great if the respective maintainers (hopefully get_maintainer.sh pointed > > > me to all relevant ones). If there is not reclaim recursion issue then > > > simply use the standard vmalloc (aka GFP_KERNEL request). > > > > > > If the use is really valid then we have a way to do the vmalloc > > > allocation properly. We have memalloc_nofs_{save,restore} scope api. How > > > does that work? You simply call memalloc_nofs_save when the reclaim > > > recursion critical section starts (e.g. when you take a lock which is > > > then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore > > > when the critical section ends. _All_ allocations within that scope > > > will get GFP_NOFS semantic automagically. If you are not sure about the > > > scope itself then the easiest workaround is to wrap the vmalloc itself > > > with a big fat comment that this should be revisited. > > > > > > Does that sound like something that can be done in a reasonable time? > > > I have tried to bring this up in the past but our speed is glacial and > > > there are attempts to do hacks like checking for abusers inside the > > > vmalloc which is just too ugly to live. > > > > > > Please do not hesitate to get back to me if something is not clear. > > > > > > Thanks! > > > -- > > > Michal Hocko > > > SUSE Labs > > > > I made a patch that adds memalloc_noio/fs_save around these calls a year > > ago: http://lkml.iu.edu/hypermail/linux/kernel/1707.0/01376.html > > Yeah, and that is the wrong approach. It is crude, but it fixes the deadlock possibility. Then, the maintainers will have a lot of time to refactor the code and move these memalloc_noio_save calls to the proper scope. > Let's try to fix this properly > this time. As the above outlines, the worst case we can end up mid-term > would be to wrap vmalloc calls with the scope api with a TODO. But I am > pretty sure the respective maintainers can come up with a better > solution. I am definitely willing to help here. > -- > Michal Hocko > SUSE Labs Mikulas
Re: [Cluster-devel] vmalloc with GFP_NOFS
On Tue, 24 Apr 2018, Michal Hocko wrote: > Hi, > it seems that we still have few vmalloc users who perform GFP_NOFS > allocation: > drivers/mtd/ubi/io.c > fs/ext4/xattr.c > fs/gfs2/dir.c > fs/gfs2/quota.c > fs/nfs/blocklayout/extent_tree.c > fs/ubifs/debug.c > fs/ubifs/lprops.c > fs/ubifs/lpt_commit.c > fs/ubifs/orphan.c > > Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly > because we do have hardocded GFP_KERNEL allocations deep inside the > vmalloc layers. That means that if GFP_NOFS really protects from > recursion into the fs deadlocks then the vmalloc call is broken. > > What to do about this? Well, there are two things. Firstly, it would be > really great to double check whether the GFP_NOFS is really needed. I > cannot judge that because I am not familiar with the code. It would be > great if the respective maintainers (hopefully get_maintainer.sh pointed > me to all relevant ones). If there is not reclaim recursion issue then > simply use the standard vmalloc (aka GFP_KERNEL request). > > If the use is really valid then we have a way to do the vmalloc > allocation properly. We have memalloc_nofs_{save,restore} scope api. How > does that work? You simply call memalloc_nofs_save when the reclaim > recursion critical section starts (e.g. when you take a lock which is > then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore > when the critical section ends. _All_ allocations within that scope > will get GFP_NOFS semantic automagically. If you are not sure about the > scope itself then the easiest workaround is to wrap the vmalloc itself > with a big fat comment that this should be revisited. > > Does that sound like something that can be done in a reasonable time? > I have tried to bring this up in the past but our speed is glacial and > there are attempts to do hacks like checking for abusers inside the > vmalloc which is just too ugly to live. > > Please do not hesitate to get back to me if something is not clear. > > Thanks! > -- > Michal Hocko > SUSE Labs I made a patch that adds memalloc_noio/fs_save around these calls a year ago: http://lkml.iu.edu/hypermail/linux/kernel/1707.0/01376.html Mikulas
Re: [Cluster-devel] vmalloc with GFP_NOFS
On Tue 24-04-18 12:46:55, Mikulas Patocka wrote: > > > On Tue, 24 Apr 2018, Michal Hocko wrote: > > > Hi, > > it seems that we still have few vmalloc users who perform GFP_NOFS > > allocation: > > drivers/mtd/ubi/io.c > > fs/ext4/xattr.c > > fs/gfs2/dir.c > > fs/gfs2/quota.c > > fs/nfs/blocklayout/extent_tree.c > > fs/ubifs/debug.c > > fs/ubifs/lprops.c > > fs/ubifs/lpt_commit.c > > fs/ubifs/orphan.c > > > > Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly > > because we do have hardocded GFP_KERNEL allocations deep inside the > > vmalloc layers. That means that if GFP_NOFS really protects from > > recursion into the fs deadlocks then the vmalloc call is broken. > > > > What to do about this? Well, there are two things. Firstly, it would be > > really great to double check whether the GFP_NOFS is really needed. I > > cannot judge that because I am not familiar with the code. It would be > > great if the respective maintainers (hopefully get_maintainer.sh pointed > > me to all relevant ones). If there is not reclaim recursion issue then > > simply use the standard vmalloc (aka GFP_KERNEL request). > > > > If the use is really valid then we have a way to do the vmalloc > > allocation properly. We have memalloc_nofs_{save,restore} scope api. How > > does that work? You simply call memalloc_nofs_save when the reclaim > > recursion critical section starts (e.g. when you take a lock which is > > then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore > > when the critical section ends. _All_ allocations within that scope > > will get GFP_NOFS semantic automagically. If you are not sure about the > > scope itself then the easiest workaround is to wrap the vmalloc itself > > with a big fat comment that this should be revisited. > > > > Does that sound like something that can be done in a reasonable time? > > I have tried to bring this up in the past but our speed is glacial and > > there are attempts to do hacks like checking for abusers inside the > > vmalloc which is just too ugly to live. > > > > Please do not hesitate to get back to me if something is not clear. > > > > Thanks! > > -- > > Michal Hocko > > SUSE Labs > > I made a patch that adds memalloc_noio/fs_save around these calls a year > ago: http://lkml.iu.edu/hypermail/linux/kernel/1707.0/01376.html Yeah, and that is the wrong approach. Let's try to fix this properly this time. As the above outlines, the worst case we can end up mid-term would be to wrap vmalloc calls with the scope api with a TODO. But I am pretty sure the respective maintainers can come up with a better solution. I am definitely willing to help here. -- Michal Hocko SUSE Labs
[Cluster-devel] vmalloc with GFP_NOFS
Hi, it seems that we still have few vmalloc users who perform GFP_NOFS allocation: drivers/mtd/ubi/io.c fs/ext4/xattr.c fs/gfs2/dir.c fs/gfs2/quota.c fs/nfs/blocklayout/extent_tree.c fs/ubifs/debug.c fs/ubifs/lprops.c fs/ubifs/lpt_commit.c fs/ubifs/orphan.c Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly because we do have hardocded GFP_KERNEL allocations deep inside the vmalloc layers. That means that if GFP_NOFS really protects from recursion into the fs deadlocks then the vmalloc call is broken. What to do about this? Well, there are two things. Firstly, it would be really great to double check whether the GFP_NOFS is really needed. I cannot judge that because I am not familiar with the code. It would be great if the respective maintainers (hopefully get_maintainer.sh pointed me to all relevant ones). If there is not reclaim recursion issue then simply use the standard vmalloc (aka GFP_KERNEL request). If the use is really valid then we have a way to do the vmalloc allocation properly. We have memalloc_nofs_{save,restore} scope api. How does that work? You simply call memalloc_nofs_save when the reclaim recursion critical section starts (e.g. when you take a lock which is then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore when the critical section ends. _All_ allocations within that scope will get GFP_NOFS semantic automagically. If you are not sure about the scope itself then the easiest workaround is to wrap the vmalloc itself with a big fat comment that this should be revisited. Does that sound like something that can be done in a reasonable time? I have tried to bring this up in the past but our speed is glacial and there are attempts to do hacks like checking for abusers inside the vmalloc which is just too ugly to live. Please do not hesitate to get back to me if something is not clear. Thanks! -- Michal Hocko SUSE Labs
Re: [Cluster-devel] About locking granularity of gfs2
Hi, On 24/04/18 04:54, Guoqing Jiang wrote: Hi Steve, Thanks for your reply. On 04/24/2018 11:03 AM, Steven Whitehouse wrote: Hi, On 24/04/18 03:52, Guoqing Jiang wrote: Hi, Since gfs2 can "allow parallel allocation from different nodes simultaneously as the locking granularity is one lock per resource group" per section 3.2 of [1]. Could it possible to make the locking granularity also applies to R/W IO? Then, with the help of "sunit" and "swidth", we basically can lock a stripe, so all nodes can write to different stripes in parallel, so the basic IO unit is one stripe. Since I don't know gfs2 well, I am wondering it is possible to do it or it doesn't make sense at all for the idea due to some reasons. Any thoughts would be appreciated, thanks. I am asking the question because if people want to add the cluster support for md/raid5, then it is better to get the help from filesystem level to ensure only one node can access a stripe at a time, otherwise we have to locking a stripe in md layer which could cause performance issue. [1] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-253-260.pdf Regards, Guoqing It is not just performance, it would be correctness too, since there is no guarantee that two nodes are not writing to the same stripe at the same time. Yes, no fs can guarantee it. I am wondering if using GFS2 as a local filesystem, and gfs2 runs on top of raid5, is it possible that gfs2 can write to two places simultaneously while the two places belong to one stripe? Yes The locking granularity is per-inode generally, but also per-rgrp in case of rgrps, but that refers only to the header/bitmap since the allocated blocks are subject to the per-inode glocks in general. Please correct me, does it mean there are two types of locking granularity? per-rgrp is for allocate rgrp, and per-inode if for R/W IO, thanks. It depends what operation is being undertaken. The per-inode glock covers all the blocks related to the inode, but during allocation and deallocation, the responsibility for the allocated and deallocate blocks passes between the rgrp and inode to which they relate. So the situation is more complicated than when no allocation/deallocation is involved, Steve. Guoqing
Re: [Cluster-devel] About locking granularity of gfs2
On 04/24/2018 01:13 PM, Gang He wrote: Stripe unit is logical volume concepts, for file system, it should not know this, for file system, the access unit is block (power of disk sector size). IMHO, It is true for typical fs, but I think zfs and btrfs can know about it well, though no cluster fs can support it now. Thanks, Guoqing