Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.
On Wed, Feb 04, 2015 at 09:49:50AM +, Steven Whitehouse wrote: Hi, On 04/02/15 07:13, Oleg Drokin wrote: Hello! On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote: I also wonder if vmalloc is still very slow? That was the case some time ago when I noticed a problem in directory access times in gfs2, which made us change to use kmalloc with a vmalloc fallback in the first place, Another of the myths about vmalloc. The speed and scalability of vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst of those problems 5-6 years ago - see the rewrite from 2008 that started with commit db64fe0 (mm: rewrite vmap layer) This actually might be less true than one would hope. At least somewhat recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008) show that there's huge contention on vmlist_lock, so if you have vmalloc intense workloads, you get penalized heavily. Granted, this is rhel6 kernel, but that is still (albeit heavily modified) 2.6.32, which was released at the end of 2009, way after 2008. I see that vmlist_lock is gone now, but e.g. vmap_area_lock that is heavily used is still in place. So of course with that in place there's every incentive to not use vmalloc if at all possible. But if used, one would still hopes it would be at least safe to do even if somewhat slow. Bye, Oleg I was thinking back to this thread: https://lkml.org/lkml/2010/4/12/207 More recent than 2008, and although it resulted in a patch that apparently fixed the problem, I don't think it was ever applied on the basis that it was too risky and kmalloc was the proper solution anyway I've not tested recently, so it may have been fixed in the mean time, IIUC, the problem was resolved with a different fix back in 2011 - a lookaside cache that avoids the overhead of searching the entire list on every vmalloc. commit 89699605fe7cfd8611900346f61cb6cbf179b10a Author: Nick Piggin npig...@suse.de Date: Tue Mar 22 16:30:36 2011 -0700 mm: vmap area cache Provide a free area cache for the vmalloc virtual address allocator, based on the algorithm used by the user virtual memory allocator. This reduces the number of rbtree operations and linear traversals over the vmap extents in order to find a free area, by starting off at the last point that a free area was found. After this patch, the search will start from where it left off, giving closer to an amortized O(1). This is verified to solve regressions reported Steven in GFS2, and Avi in KVM. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.
On Wed, Feb 04, 2015 at 02:13:29AM -0500, Oleg Drokin wrote: Hello! On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote: I also wonder if vmalloc is still very slow? That was the case some time ago when I noticed a problem in directory access times in gfs2, which made us change to use kmalloc with a vmalloc fallback in the first place, Another of the myths about vmalloc. The speed and scalability of vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst of those problems 5-6 years ago - see the rewrite from 2008 that started with commit db64fe0 (mm: rewrite vmap layer) This actually might be less true than one would hope. At least somewhat recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008) show that there's huge contention on vmlist_lock, so if you have vmalloc vmlist_lock and the list it protected went away in 3.10. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.
Hi, On 04/02/15 07:13, Oleg Drokin wrote: Hello! On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote: I also wonder if vmalloc is still very slow? That was the case some time ago when I noticed a problem in directory access times in gfs2, which made us change to use kmalloc with a vmalloc fallback in the first place, Another of the myths about vmalloc. The speed and scalability of vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst of those problems 5-6 years ago - see the rewrite from 2008 that started with commit db64fe0 (mm: rewrite vmap layer) This actually might be less true than one would hope. At least somewhat recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008) show that there's huge contention on vmlist_lock, so if you have vmalloc intense workloads, you get penalized heavily. Granted, this is rhel6 kernel, but that is still (albeit heavily modified) 2.6.32, which was released at the end of 2009, way after 2008. I see that vmlist_lock is gone now, but e.g. vmap_area_lock that is heavily used is still in place. So of course with that in place there's every incentive to not use vmalloc if at all possible. But if used, one would still hopes it would be at least safe to do even if somewhat slow. Bye, Oleg I was thinking back to this thread: https://lkml.org/lkml/2010/4/12/207 More recent than 2008, and although it resulted in a patch that apparently fixed the problem, I don't think it was ever applied on the basis that it was too risky and kmalloc was the proper solution anyway I've not tested recently, so it may have been fixed in the mean time, Steve.
Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.
Hello! On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote: I also wonder if vmalloc is still very slow? That was the case some time ago when I noticed a problem in directory access times in gfs2, which made us change to use kmalloc with a vmalloc fallback in the first place, Another of the myths about vmalloc. The speed and scalability of vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst of those problems 5-6 years ago - see the rewrite from 2008 that started with commit db64fe0 (mm: rewrite vmap layer) This actually might be less true than one would hope. At least somewhat recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008) show that there's huge contention on vmlist_lock, so if you have vmalloc intense workloads, you get penalized heavily. Granted, this is rhel6 kernel, but that is still (albeit heavily modified) 2.6.32, which was released at the end of 2009, way after 2008. I see that vmlist_lock is gone now, but e.g. vmap_area_lock that is heavily used is still in place. So of course with that in place there's every incentive to not use vmalloc if at all possible. But if used, one would still hopes it would be at least safe to do even if somewhat slow. Bye, Oleg
Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.
Hi, On 02/02/15 08:11, Dave Chinner wrote: On Mon, Feb 02, 2015 at 01:57:23AM -0500, Oleg Drokin wrote: Hello! On Feb 2, 2015, at 12:37 AM, Dave Chinner wrote: On Sun, Feb 01, 2015 at 10:59:54PM -0500, gr...@linuxhacker.ru wrote: From: Oleg Drokin gr...@linuxhacker.ru leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so it clearly does not want any shrinker activity within the fs itself. convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve this goal. Signed-off-by: Oleg Drokin gr...@linuxhacker.ru --- fs/gfs2/dir.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c index c5a34f0..6371192 100644 --- a/fs/gfs2/dir.c +++ b/fs/gfs2/dir.c @@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 index, u32 len, ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN); if (ht == NULL) - ht = vzalloc(size); + ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO, + PAGE_KERNEL); That, in the end, won't help as vmalloc still uses GFP_KERNEL allocations deep down in the PTE allocation code. See the hacks in the DM and XFS code to work around this. i.e. go look for callers of memalloc_noio_save(). It's ugly and grotesque, but we've got no other way to limit reclaim context because the MM devs won't pass the vmalloc gfp context down the stack to the PTE allocations Hm, interesting. So all the other code in the kernel that does this sort of thing (and there's quite a bit outside of xfs and ocfs2) would not get the desired effect? No. I expect, however, that very few people would ever see a deadlock as a result - it's a pretty rare sort of kernel case to hit in most cases. XFS does make extensive use of vm_map_ram() in GFP_NOFS context, however, when large directory block sizes are in use, and we also have a history of lockdep throwing warnings under memory pressure. In the end, the memalloc_noio_save() changes were made to stop the frequent lockdep reports rather than actual deadlocks. Indeed, I think the patch is still an improvement however, so I'm happy to apply it while a better solution is found. So, I did some digging in archives and found this thread from 2010 onward with various patches and rants. Not sure how I missed that before. Should we have another run at this I wonder? By all means, but I don't think you'll have any more luck than anyone else in the past. We've still got the problem of attitude (vmalloc is not for general use) and making it actually work is seen as encouraging undesirable behaviour. If you can change attitudes towards vmalloc first, then you'll be much more likely to make progress in getting these problems solved Well I don't know whether it has to be vmalloc that provides the solution here... if memory fragmentation could be controlled then kmalloc of larger contiguous chunks of memory could be done using that, which might be a better solution overall. But I do agree that we need to try and come to some kind of solution to this problem as it is one of those things that has been rumbling on for a long time without a proper solution. I also wonder if vmalloc is still very slow? That was the case some time ago when I noticed a problem in directory access times in gfs2, which made us change to use kmalloc with a vmalloc fallback in the first place, Steve.
Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.
On Mon, Feb 02, 2015 at 01:57:23AM -0500, Oleg Drokin wrote: Hello! On Feb 2, 2015, at 12:37 AM, Dave Chinner wrote: On Sun, Feb 01, 2015 at 10:59:54PM -0500, gr...@linuxhacker.ru wrote: From: Oleg Drokin gr...@linuxhacker.ru leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so it clearly does not want any shrinker activity within the fs itself. convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve this goal. Signed-off-by: Oleg Drokin gr...@linuxhacker.ru --- fs/gfs2/dir.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c index c5a34f0..6371192 100644 --- a/fs/gfs2/dir.c +++ b/fs/gfs2/dir.c @@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 index, u32 len, ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN); if (ht == NULL) - ht = vzalloc(size); + ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO, + PAGE_KERNEL); That, in the end, won't help as vmalloc still uses GFP_KERNEL allocations deep down in the PTE allocation code. See the hacks in the DM and XFS code to work around this. i.e. go look for callers of memalloc_noio_save(). It's ugly and grotesque, but we've got no other way to limit reclaim context because the MM devs won't pass the vmalloc gfp context down the stack to the PTE allocations Hm, interesting. So all the other code in the kernel that does this sort of thing (and there's quite a bit outside of xfs and ocfs2) would not get the desired effect? No. I expect, however, that very few people would ever see a deadlock as a result - it's a pretty rare sort of kernel case to hit in most cases. XFS does make extensive use of vm_map_ram() in GFP_NOFS context, however, when large directory block sizes are in use, and we also have a history of lockdep throwing warnings under memory pressure. In the end, the memalloc_noio_save() changes were made to stop the frequent lockdep reports rather than actual deadlocks. So, I did some digging in archives and found this thread from 2010 onward with various patches and rants. Not sure how I missed that before. Should we have another run at this I wonder? By all means, but I don't think you'll have any more luck than anyone else in the past. We've still got the problem of attitude (vmalloc is not for general use) and making it actually work is seen as encouraging undesirable behaviour. If you can change attitudes towards vmalloc first, then you'll be much more likely to make progress in getting these problems solved Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.
On Sun, Feb 01, 2015 at 10:59:54PM -0500, gr...@linuxhacker.ru wrote: From: Oleg Drokin gr...@linuxhacker.ru leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so it clearly does not want any shrinker activity within the fs itself. convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve this goal. Signed-off-by: Oleg Drokin gr...@linuxhacker.ru --- fs/gfs2/dir.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c index c5a34f0..6371192 100644 --- a/fs/gfs2/dir.c +++ b/fs/gfs2/dir.c @@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 index, u32 len, ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN); if (ht == NULL) - ht = vzalloc(size); + ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO, +PAGE_KERNEL); That, in the end, won't help as vmalloc still uses GFP_KERNEL allocations deep down in the PTE allocation code. See the hacks in the DM and XFS code to work around this. i.e. go look for callers of memalloc_noio_save(). It's ugly and grotesque, but we've got no other way to limit reclaim context because the MM devs won't pass the vmalloc gfp context down the stack to the PTE allocations Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.
Hello! On Feb 2, 2015, at 12:37 AM, Dave Chinner wrote: On Sun, Feb 01, 2015 at 10:59:54PM -0500, gr...@linuxhacker.ru wrote: From: Oleg Drokin gr...@linuxhacker.ru leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so it clearly does not want any shrinker activity within the fs itself. convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve this goal. Signed-off-by: Oleg Drokin gr...@linuxhacker.ru --- fs/gfs2/dir.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c index c5a34f0..6371192 100644 --- a/fs/gfs2/dir.c +++ b/fs/gfs2/dir.c @@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 index, u32 len, ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN); if (ht == NULL) -ht = vzalloc(size); +ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO, + PAGE_KERNEL); That, in the end, won't help as vmalloc still uses GFP_KERNEL allocations deep down in the PTE allocation code. See the hacks in the DM and XFS code to work around this. i.e. go look for callers of memalloc_noio_save(). It's ugly and grotesque, but we've got no other way to limit reclaim context because the MM devs won't pass the vmalloc gfp context down the stack to the PTE allocations Hm, interesting. So all the other code in the kernel that does this sort of thing (and there's quite a bit outside of xfs and ocfs2) would not get the desired effect? So, I did some digging in archives and found this thread from 2010 onward with various patches and rants. Not sure how I missed that before. Should we have another run at this I wonder? Bye, Oleg