Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

2015-02-05 Thread Dave Chinner
On Wed, Feb 04, 2015 at 09:49:50AM +, Steven Whitehouse wrote:
 Hi,
 
 On 04/02/15 07:13, Oleg Drokin wrote:
 Hello!
 
 On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote:
 I also wonder if vmalloc is still very slow? That was the case some
 time ago when I noticed a problem in directory access times in gfs2,
 which made us change to use kmalloc with a vmalloc fallback in the
 first place,
 Another of the myths about vmalloc. The speed and scalability of
 vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst
 of those problems 5-6 years ago - see the rewrite from 2008 that
 started with commit db64fe0 (mm: rewrite vmap layer)
 This actually might be less true than one would hope. At least somewhat
 recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008)
 show that there's huge contention on vmlist_lock, so if you have vmalloc
 intense workloads, you get penalized heavily. Granted, this is rhel6 kernel,
 but that is still (albeit heavily modified) 2.6.32, which was released at
 the end of 2009, way after 2008.
 I see that vmlist_lock is gone now, but e.g. vmap_area_lock that is heavily
 used is still in place.
 
 So of course with that in place there's every incentive to not use vmalloc
 if at all possible. But if used, one would still hopes it would be at least
 safe to do even if somewhat slow.
 
 Bye,
  Oleg
 
 I was thinking back to this thread:
 https://lkml.org/lkml/2010/4/12/207
 
 More recent than 2008, and although it resulted in a patch that
 apparently fixed the problem, I don't think it was ever applied on
 the basis that it was too risky and kmalloc was the proper solution
 anyway I've not tested recently, so it may have been fixed in
 the mean time,

IIUC, the problem was resolved with a different fix back in 2011 - a
lookaside cache that avoids the overhead of searching the entire
list on every vmalloc. 

commit 89699605fe7cfd8611900346f61cb6cbf179b10a
Author: Nick Piggin npig...@suse.de
Date:   Tue Mar 22 16:30:36 2011 -0700

mm: vmap area cache

Provide a free area cache for the vmalloc virtual address allocator, based
on the algorithm used by the user virtual memory allocator.

This reduces the number of rbtree operations and linear traversals over
the vmap extents in order to find a free area, by starting off at the last
point that a free area was found.

After this patch, the search will start from where it left off, giving
closer to an amortized O(1).

This is verified to solve regressions reported Steven in GFS2, and Avi in
KVM.


Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

2015-02-05 Thread Dave Chinner
On Wed, Feb 04, 2015 at 02:13:29AM -0500, Oleg Drokin wrote:
 Hello!
 
 On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote:
  I also wonder if vmalloc is still very slow? That was the case some
  time ago when I noticed a problem in directory access times in gfs2,
  which made us change to use kmalloc with a vmalloc fallback in the
  first place,
  Another of the myths about vmalloc. The speed and scalability of
  vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst
  of those problems 5-6 years ago - see the rewrite from 2008 that
  started with commit db64fe0 (mm: rewrite vmap layer)
 
 This actually might be less true than one would hope. At least somewhat
 recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008)
 show that there's huge contention on vmlist_lock, so if you have vmalloc

vmlist_lock and the list it protected went away in 3.10.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

2015-02-04 Thread Steven Whitehouse

Hi,

On 04/02/15 07:13, Oleg Drokin wrote:

Hello!

On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote:

I also wonder if vmalloc is still very slow? That was the case some
time ago when I noticed a problem in directory access times in gfs2,
which made us change to use kmalloc with a vmalloc fallback in the
first place,

Another of the myths about vmalloc. The speed and scalability of
vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst
of those problems 5-6 years ago - see the rewrite from 2008 that
started with commit db64fe0 (mm: rewrite vmap layer)

This actually might be less true than one would hope. At least somewhat
recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008)
show that there's huge contention on vmlist_lock, so if you have vmalloc
intense workloads, you get penalized heavily. Granted, this is rhel6 kernel,
but that is still (albeit heavily modified) 2.6.32, which was released at
the end of 2009, way after 2008.
I see that vmlist_lock is gone now, but e.g. vmap_area_lock that is heavily
used is still in place.

So of course with that in place there's every incentive to not use vmalloc
if at all possible. But if used, one would still hopes it would be at least
safe to do even if somewhat slow.

Bye,
 Oleg


I was thinking back to this thread:
https://lkml.org/lkml/2010/4/12/207

More recent than 2008, and although it resulted in a patch that 
apparently fixed the problem, I don't think it was ever applied on the 
basis that it was too risky and kmalloc was the proper solution 
anyway I've not tested recently, so it may have been fixed in the 
mean time,


Steve.



Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

2015-02-03 Thread Oleg Drokin
Hello!

On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote:
 I also wonder if vmalloc is still very slow? That was the case some
 time ago when I noticed a problem in directory access times in gfs2,
 which made us change to use kmalloc with a vmalloc fallback in the
 first place,
 Another of the myths about vmalloc. The speed and scalability of
 vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst
 of those problems 5-6 years ago - see the rewrite from 2008 that
 started with commit db64fe0 (mm: rewrite vmap layer)

This actually might be less true than one would hope. At least somewhat
recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008)
show that there's huge contention on vmlist_lock, so if you have vmalloc
intense workloads, you get penalized heavily. Granted, this is rhel6 kernel,
but that is still (albeit heavily modified) 2.6.32, which was released at
the end of 2009, way after 2008.
I see that vmlist_lock is gone now, but e.g. vmap_area_lock that is heavily
used is still in place.

So of course with that in place there's every incentive to not use vmalloc
if at all possible. But if used, one would still hopes it would be at least
safe to do even if somewhat slow.

Bye,
Oleg



Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

2015-02-02 Thread Steven Whitehouse

Hi,

On 02/02/15 08:11, Dave Chinner wrote:

On Mon, Feb 02, 2015 at 01:57:23AM -0500, Oleg Drokin wrote:

Hello!

On Feb 2, 2015, at 12:37 AM, Dave Chinner wrote:


On Sun, Feb 01, 2015 at 10:59:54PM -0500, gr...@linuxhacker.ru wrote:

From: Oleg Drokin gr...@linuxhacker.ru

leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so
it clearly does not want any shrinker activity within the fs itself.
convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve
this goal.

Signed-off-by: Oleg Drokin gr...@linuxhacker.ru
---
fs/gfs2/dir.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index c5a34f0..6371192 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 
index, u32 len,

ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN);
if (ht == NULL)
-   ht = vzalloc(size);
+   ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO,
+  PAGE_KERNEL);

That, in the end, won't help as vmalloc still uses GFP_KERNEL
allocations deep down in the PTE allocation code. See the hacks in
the DM and XFS code to work around this. i.e. go look for callers of
memalloc_noio_save().  It's ugly and grotesque, but we've got no
other way to limit reclaim context because the MM devs won't pass
the vmalloc gfp context down the stack to the PTE allocations

Hm, interesting.
So all the other code in the kernel that does this sort of thing (and there's 
quite a bit
outside of xfs and ocfs2) would not get the desired effect?

No. I expect, however, that very few people would ever see a
deadlock as a result - it's a pretty rare sort of kernel case to hit
in most cases. XFS does make extensive use of vm_map_ram() in
GFP_NOFS context, however, when large directory block sizes are in
use, and we also have a history of lockdep throwing warnings under
memory pressure. In the end, the memalloc_noio_save() changes were
made to stop the frequent lockdep reports rather than actual
deadlocks.
Indeed, I think the patch is still an improvement however, so I'm happy 
to apply it while a better solution is found.



So, I did some digging in archives and found this thread from 2010 onward with 
various
patches and rants.
Not sure how I missed that before.

Should we have another run at this I wonder?

By all means, but I don't think you'll have any more luck than
anyone else in the past. We've still got the problem of attitude
(vmalloc is not for general use) and making it actually work is
seen as encouraging undesirable behaviour. If you can change
attitudes towards vmalloc first, then you'll be much more likely to
make progress in getting these problems solved



Well I don't know whether it has to be vmalloc that provides the 
solution here... if memory fragmentation could be controlled then 
kmalloc of larger contiguous chunks of memory could be done using that, 
which might be a better solution overall. But I do agree that we need to 
try and come to some kind of solution to this problem as it is one of 
those things that has been rumbling on for a long time without a proper 
solution.


I also wonder if vmalloc is still very slow? That was the case some time 
ago when I noticed a problem in directory access times in gfs2, which 
made us change to use kmalloc with a vmalloc fallback in the first place,


Steve.




Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

2015-02-02 Thread Dave Chinner
On Mon, Feb 02, 2015 at 01:57:23AM -0500, Oleg Drokin wrote:
 Hello!
 
 On Feb 2, 2015, at 12:37 AM, Dave Chinner wrote:
 
  On Sun, Feb 01, 2015 at 10:59:54PM -0500, gr...@linuxhacker.ru wrote:
  From: Oleg Drokin gr...@linuxhacker.ru
  
  leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so
  it clearly does not want any shrinker activity within the fs itself.
  convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve
  this goal.
  
  Signed-off-by: Oleg Drokin gr...@linuxhacker.ru
  ---
  fs/gfs2/dir.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
  
  diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
  index c5a34f0..6371192 100644
  --- a/fs/gfs2/dir.c
  +++ b/fs/gfs2/dir.c
  @@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 
  index, u32 len,
  
 ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN);
 if (ht == NULL)
  -  ht = vzalloc(size);
  +  ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO,
  + PAGE_KERNEL);
  That, in the end, won't help as vmalloc still uses GFP_KERNEL
  allocations deep down in the PTE allocation code. See the hacks in
  the DM and XFS code to work around this. i.e. go look for callers of
  memalloc_noio_save().  It's ugly and grotesque, but we've got no
  other way to limit reclaim context because the MM devs won't pass
  the vmalloc gfp context down the stack to the PTE allocations
 
 Hm, interesting.
 So all the other code in the kernel that does this sort of thing (and there's 
 quite a bit
 outside of xfs and ocfs2) would not get the desired effect?

No. I expect, however, that very few people would ever see a
deadlock as a result - it's a pretty rare sort of kernel case to hit
in most cases. XFS does make extensive use of vm_map_ram() in
GFP_NOFS context, however, when large directory block sizes are in
use, and we also have a history of lockdep throwing warnings under
memory pressure. In the end, the memalloc_noio_save() changes were
made to stop the frequent lockdep reports rather than actual
deadlocks.

 So, I did some digging in archives and found this thread from 2010 onward 
 with various
 patches and rants.
 Not sure how I missed that before.
 
 Should we have another run at this I wonder?

By all means, but I don't think you'll have any more luck than
anyone else in the past. We've still got the problem of attitude
(vmalloc is not for general use) and making it actually work is
seen as encouraging undesirable behaviour. If you can change
attitudes towards vmalloc first, then you'll be much more likely to
make progress in getting these problems solved

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

2015-02-01 Thread Dave Chinner
On Sun, Feb 01, 2015 at 10:59:54PM -0500, gr...@linuxhacker.ru wrote:
 From: Oleg Drokin gr...@linuxhacker.ru
 
 leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so
 it clearly does not want any shrinker activity within the fs itself.
 convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve
 this goal.
 
 Signed-off-by: Oleg Drokin gr...@linuxhacker.ru
 ---
  fs/gfs2/dir.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
 index c5a34f0..6371192 100644
 --- a/fs/gfs2/dir.c
 +++ b/fs/gfs2/dir.c
 @@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 
 index, u32 len,
  
   ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN);
   if (ht == NULL)
 - ht = vzalloc(size);
 + ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO,
 +PAGE_KERNEL);

That, in the end, won't help as vmalloc still uses GFP_KERNEL
allocations deep down in the PTE allocation code. See the hacks in
the DM and XFS code to work around this. i.e. go look for callers of
memalloc_noio_save().  It's ugly and grotesque, but we've got no
other way to limit reclaim context because the MM devs won't pass
the vmalloc gfp context down the stack to the PTE allocations

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Cluster-devel] [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

2015-02-01 Thread Oleg Drokin
Hello!

On Feb 2, 2015, at 12:37 AM, Dave Chinner wrote:

 On Sun, Feb 01, 2015 at 10:59:54PM -0500, gr...@linuxhacker.ru wrote:
 From: Oleg Drokin gr...@linuxhacker.ru
 
 leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so
 it clearly does not want any shrinker activity within the fs itself.
 convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve
 this goal.
 
 Signed-off-by: Oleg Drokin gr...@linuxhacker.ru
 ---
 fs/gfs2/dir.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
 index c5a34f0..6371192 100644
 --- a/fs/gfs2/dir.c
 +++ b/fs/gfs2/dir.c
 @@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 
 index, u32 len,
 
  ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN);
  if (ht == NULL)
 -ht = vzalloc(size);
 +ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO,
 +   PAGE_KERNEL);
 That, in the end, won't help as vmalloc still uses GFP_KERNEL
 allocations deep down in the PTE allocation code. See the hacks in
 the DM and XFS code to work around this. i.e. go look for callers of
 memalloc_noio_save().  It's ugly and grotesque, but we've got no
 other way to limit reclaim context because the MM devs won't pass
 the vmalloc gfp context down the stack to the PTE allocations

Hm, interesting.
So all the other code in the kernel that does this sort of thing (and there's 
quite a bit
outside of xfs and ocfs2) would not get the desired effect?

So, I did some digging in archives and found this thread from 2010 onward with 
various
patches and rants.
Not sure how I missed that before.

Should we have another run at this I wonder?

Bye,
Oleg