Re: [patch 07/99] btrfs: Use mempools for extent_state structures

2011-12-02 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/01/2011 02:55 PM, Jeff Mahoney wrote:
 On 11/28/2011 07:04 PM, Jeff Mahoney wrote:
 On 11/28/2011 06:53 PM, Andi Kleen wrote:
 Jeff Mahoney je...@suse.com writes:
 
 The extent_state structure is used at the core of the extent 
 i/o code for managing flags, locking, etc. It requires 
 allocations deep in the write code and if failures occur
 they are difficult to recover from.
 
 We avoid most of the failures by using a mempool, which can 
 sleep when required, to honor the allocations. This allows 
 future patches to convert most of the 
 {set,clear,convert}_extent_bit and derivatives to return
 void.
 
 Is this really safe?
 
 iirc if there's any operation that needs multiple mempool 
 objects you can get into the following ABBA style deadlock:
 
 thread 1 thread 2
 
 get object A from pool 1 get object C from pool 2 get object B 
 from pool 2 pool 2 full - block get object D from pool 2
 ^ pool1, i assume
 pool1 full - block
 
 
 Now for thread 1 to make progress it needs thread 2 to free its
  object first, but that needs thread 1 to free its object also
  first, which is a deadlock.
 
 It would still work if there are other users which eventually 
 make progress, but if you're really unlucky either pool 1 or 2 
 is complete used by threads doing a multiple object operation.
 So you got a nasty rare deadlock ...
 
 Yes, I think you're right. I think the risk is probably there
 and I'm not sure it can be totally mitigated. We'd be stuck if
 there is a string of pathological cases and both mempool are
 empty. The only way I can see to try to help the situation is to
 make the mempool fairly substantial, which is what I did here.
 I'd prefer to make the mempools per-fs but that would require
 pretty heavy modifications in order to pass around a per-fs
 struct. In any case, the risk isn't totally eliminated.
 
 The more I look into it, the more I don't think this is an
 uncommon scenario in the kernel. Device mapper draws from a number
 of mempools that can be interspersed with allocations from the bio
 pool. Even without stacking different types of dm devices, I think
 we can run into this scenario. The DM scenario is probably even
 worse since the allocs are GFP_NOIO instead of the (relatively)
 more relaxed GFP_NOFS.
 
 I'm not saying it's ok, just that there are similar sites already
 that seem to be easier to hit but we haven't heard of issues there
 either. It seems to be a behavior that is largely mitigated by
 establishing mempools of sufficient size.

... and the next installment:

My understanding of mempools is inaccurate and the pathological case
can be hit much more quickly than I anticipated, or at least have a
performance impact I'd rather not see. It turns out that the pool will
be depleted *before* the other pathological cases are hit. In fact,
the pool is hit before reclaim starts at all.

Rather than:

normal alloc with gfp_t
(normal alloc fails after trying reclaim)
pool alloc
(pool alloc fails)
(wait, timeout, retry)

it is:
normal alloc with gfp_t  ~(__GFP_WAIT|__GFP_IO)
(which is GFP_NOFS  ~GFP_NOFS for the common case)

.. so:

normal alloc with GFP_NOWAIT
(normal alloc fails) (much more often)
pool alloc
(pool alloc fails)
((wait until an object is freed or timeout) and retry with GFP_NOFS)

So, using mempools for all cases as I'd hoped isn't what I want. I'll
work up another set of patches that use a straight kmem_cache_alloc
where possible. The good news is that slab-backed mempools can be
replenished with objects from the slab that weren't allocated from the
pool. So there's that.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJO2atEAAoJEB57S2MheeWy2CYP/iix4znJVqBkN8wUg31MELGB
UpodfZkwlhrD/8PfJrisjA1wLLKpK5GzJ0m6rhKCYvkBau28yCSAWGW9sJIfGM0I
4fLAoupncoJqmYyQvrZIczxztDVwMu37bL7fNfCkC64HoaETChQbaDbrwu+L8Rn2
Lls7EtkUNWrpY+J4F0zq9b8eE46WjbpZBKhI/PuqAE0LfSXK8nd46Z5qA/MM7fab
tFftC0AyBLP2dq8R0H1IX0yjri6YD0xwwfdHdbzVAoJ/tINVbfiQxntYyONgNnBF
6sNogCtcskpTSDHZyVK7ATuJJL6ZAIFO0ZUJjZYFc+q0q1oYMfmbhNs9Qq5le7bZ
Ig6pcMgHqGOqkis95jqzgStl2A1OIYPZsn6K1329N44fvZ8PjxDVCHS83FzBY0qw
YuEgBNd8vRrdjtcMax2QOs9yaSmvEXDkws1+tLWg/ZV0Ik8crbW0ctVqsyDpWwPN
2dSyrlxGbAJyXzELUg498dLORw+chHokUhsEvwYEmL1HGLsZGFoXAV3H5df4v6Mx
wsKZeT+Nsp16CyYOXVCAWGRyp6FPDWECJAUxygjyEaIGWmiJjMKU1GizL9J4fgc+
59fantrAMbFwPpRwJxfHumCdnQOi55qtrcP1UvCB3o9jBH4QsUofP4sGQsIAw4oQ
ctuvtI1R7zcub0906BwP
=h+a2
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/99] btrfs: Use mempools for extent_state structures

2011-12-01 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/28/2011 07:04 PM, Jeff Mahoney wrote:
 On 11/28/2011 06:53 PM, Andi Kleen wrote:
 Jeff Mahoney je...@suse.com writes:
 
 The extent_state structure is used at the core of the extent
 i/o code for managing flags, locking, etc. It requires
 allocations deep in the write code and if failures occur they
 are difficult to recover from.
 
 We avoid most of the failures by using a mempool, which can
 sleep when required, to honor the allocations. This allows
 future patches to convert most of the
 {set,clear,convert}_extent_bit and derivatives to return void.
 
 Is this really safe?
 
 iirc if there's any operation that needs multiple mempool
 objects you can get into the following ABBA style deadlock:
 
 thread 1 thread 2
 
 get object A from pool 1 get object C from pool 2 get object B
 from pool 2 pool 2 full - block get object D from pool 2
 ^ pool1, i assume
 pool1 full - block
 
 
 Now for thread 1 to make progress it needs thread 2 to free its 
 object first, but that needs thread 1 to free its object also 
 first, which is a deadlock.
 
 It would still work if there are other users which eventually
 make progress, but if you're really unlucky either pool 1 or 2
 is complete used by threads doing a multiple object operation. So
 you got a nasty rare deadlock ...
 
 Yes, I think you're right. I think the risk is probably there and
 I'm not sure it can be totally mitigated. We'd be stuck if there is
 a string of pathological cases and both mempool are empty. The only
 way I can see to try to help the situation is to make the mempool
 fairly substantial, which is what I did here. I'd prefer to make
 the mempools per-fs but that would require pretty heavy
 modifications in order to pass around a per-fs struct. In any case,
 the risk isn't totally eliminated.

The more I look into it, the more I don't think this is an uncommon
scenario in the kernel. Device mapper draws from a number of mempools
that can be interspersed with allocations from the bio pool. Even
without stacking different types of dm devices, I think we can run
into this scenario. The DM scenario is probably even worse since the
allocs are GFP_NOIO instead of the (relatively) more relaxed GFP_NOFS.

I'm not saying it's ok, just that there are similar sites already that
seem to be easier to hit but we haven't heard of issues there either.
It seems to be a behavior that is largely mitigated by establishing
mempools of sufficient size.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJO19ueAAoJEB57S2MheeWyDb4QAKZ3T9JGTXZN9MikO0Q8aH08
KE0X4NSnagn/goS/88+6Hj9Z165gQ3ERUaoW4+VsdpLOAaJuNo4YtMShoN7pgC9X
HeaddX1cQb8aducWPt9o7sglkpERJHzdSNKO4rJaYQbJLZFBQ1KLQEISRjS1ZAiu
74Y/t3tESiJO8W5XV1b06pythUKG4sLPK8+ssBvpYt+qrfhfp//dse/sKPE3L1k5
zHP0GIePRazhkE9RUB8+5rjItUR7MSvECJviCKKhxyY/TsQRm4g27AhbL717Ew/9
V3LyyZe5ojhXWNDyaCeLUu4j/xOcYVftkxuQLHcVltTDaAzxCkaRBnzxoz5nouc+
SusUzVYy/aR14QyNbtFpfJVAB3E9djXsaA3EZO2ar+NPeBch28LEJRfC8hoItej3
O39PAFpviYRSHa12GDIqJ0x3q9auTeU5DRPt3gnpstT26t4vICkFn4B/cp9EzpQI
G1Gm09ftkehpSdBiFFSBXCpPg/7qTAMPnaF1Yq3ueQoWbAqVEtnPnaWQLfH4IBBj
5fp6ei0/zAS2qr1iBKZnWUKJKCIAyvCkEeLusTcLymFzLzVMpywYWchU+EW0AZ5G
98tx8Dgzi77ImHG4uF9uWuNJ80G7iR+nGgE+8dKIoEKLETHdoQMRgTCc05o9tTHM
Gk0GUwRESn97r/ytUm8n
=zdAv
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/99] btrfs: Use mempools for extent_state structures

2011-11-28 Thread Andi Kleen
Jeff Mahoney je...@suse.com writes:

  The extent_state structure is used at the core of the extent i/o code
  for managing flags, locking, etc. It requires allocations deep in the
  write code and if failures occur they are difficult to recover from.

  We avoid most of the failures by using a mempool, which can sleep when
  required, to honor the allocations. This allows future patches to convert
  most of the {set,clear,convert}_extent_bit and derivatives to return
  void.

Is this really safe?

iirc if there's any operation that needs multiple mempool objects you
can get into the following ABBA style deadlock:

thread 1 thread 2

get object A from pool 1
 get object C from pool 2 
get object B from pool 2
pool 2 full - block
 get object D from pool 2
 pool1 full - block 


Now for thread 1 to make progress it needs thread 2 to free its
object first, but that needs thread 1 to free its object also 
first, which is a deadlock.

It would still work if there are other users which eventually
make progress, but if you're really unlucky either pool 1 or 2 
is complete used by threads doing a multiple object operation.
So you got a nasty rare deadlock ...

The only way to handle this would be to always preallocate
all objects needed for a operation in an atomic way.
Or never use more than one object.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/99] btrfs: Use mempools for extent_state structures

2011-11-28 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/28/2011 06:53 PM, Andi Kleen wrote:
 Jeff Mahoney je...@suse.com writes:
 
 The extent_state structure is used at the core of the extent i/o
 code for managing flags, locking, etc. It requires allocations
 deep in the write code and if failures occur they are difficult
 to recover from.
 
 We avoid most of the failures by using a mempool, which can sleep
 when required, to honor the allocations. This allows future
 patches to convert most of the {set,clear,convert}_extent_bit and
 derivatives to return void.
 
 Is this really safe?
 
 iirc if there's any operation that needs multiple mempool objects
 you can get into the following ABBA style deadlock:
 
 thread 1 thread 2
 
 get object A from pool 1 get object C from pool 2 get object B from
 pool 2 pool 2 full - block get object D from pool 2
^ pool1, i assume
 pool1 full - block
 
 
 Now for thread 1 to make progress it needs thread 2 to free its 
 object first, but that needs thread 1 to free its object also 
 first, which is a deadlock.
 
 It would still work if there are other users which eventually make
 progress, but if you're really unlucky either pool 1 or 2 is
 complete used by threads doing a multiple object operation. So you
 got a nasty rare deadlock ...

Yes, I think you're right. I think the risk is probably there and I'm
not sure it can be totally mitigated. We'd be stuck if there is a
string of pathological cases and both mempool are empty. The only way
I can see to try to help the situation is to make the mempool fairly
substantial, which is what I did here. I'd prefer to make the mempools
per-fs but that would require pretty heavy modifications in order to
pass around a per-fs struct. In any case, the risk isn't totally
eliminated.

 The only way to handle this would be to always preallocate all
 objects needed for a operation in an atomic way.

This was how I wanted to do it at first but I couldn't see a way to
make it work in a manner that wouldn't kill performance since the
calculations effectively need to be done twice and are pretty common.
The other issue is that operations can be pretty complex and
preallocating for all cases will either be really wasteful or really
error prone.

 Or never use more than one object.

I need to do some analysis here to see if that is even possible.

- -Jeff


- -- 
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIbBAEBAgAGBQJO1CFxAAoJEB57S2MheeWyL1EP+Pd+c1hodWEA6StTnE1pG+Tl
u5Rg+VBmN8AdcfdOzvyoyfZgPaCxtHzaJVX2y3PzHtY1INEMp2EoijaL1X9afKQF
lJP1BmpIARARMrQrYbqvZrl+DGUeIwIhC2LBuNffYifbMAn3KdTLRhkvcGehTmQO
Vik1bgmo6ZKjAIP/FLViGdB4hITITwHtT7n4bDyZ9ATgqgq5sGUWnw7f0sILeBqR
NOt8QVgaCmzZIF5foUr2UsBBxjo0joiW2V7YWs4roJeCtDx+6mwyRifQFTGuN2xi
1CVTnTT+FYXkD8IB6mKNHptIDLu7q29RaikSFsgV684IYAp510OBStgBuCpBeeNc
Xd77XJkf9UCYSXWs1j587MnOqaeV818OqvwxHAGZnavBHZ/+8AFOJ77GAVaP1Xlm
M6gYP1TXjL0yjJEWNVtP0kBjimKUFw5ECv86DCQllRTY/5oWmAxjyJviEzVgJBiM
EDtzf17APyiTSWawRZlDc6a46kG2Jm6QiNmx4MyAlSFu81Qv0EDOpC4TAwiRDZKa
KLjaW10TKcC5TDwZXcOBiF/PrpvBCvJjd0N3sCFVD0Tn0cdqME9BuFUP1hI1ku+T
L5VyzHsp07ndCk+H6gGWkCZzEtSSyWFG4gHkFcV7z/IF5bPpsyxcxdGXGzBGbYUh
9NdrCz5lMb49VTBBSaY=
=ZXGi
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 07/99] btrfs: Use mempools for extent_state structures

2011-11-23 Thread Jeff Mahoney
 The extent_state structure is used at the core of the extent i/o code
 for managing flags, locking, etc. It requires allocations deep in the
 write code and if failures occur they are difficult to recover from.

 We avoid most of the failures by using a mempool, which can sleep when
 required, to honor the allocations. This allows future patches to convert
 most of the {set,clear,convert}_extent_bit and derivatives to return
 void.

Signed-off-by: Jeff Mahoney je...@suse.com
---
 fs/btrfs/extent_io.c |   71 ---
 1 file changed, 51 insertions(+), 20 deletions(-)

Index: source/fs/btrfs/extent_io.c
===
--- source.orig/fs/btrfs/extent_io.c2011-11-21 14:13:55.0 -0500
+++ source/fs/btrfs/extent_io.c 2011-11-21 14:38:23.0 -0500
@@ -12,6 +12,7 @@
 #include linux/pagevec.h
 #include linux/prefetch.h
 #include linux/cleancache.h
+#include linux/mempool.h
 #include extent_io.h
 #include extent_map.h
 #include compat.h
@@ -21,6 +22,8 @@
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
+static mempool_t *extent_state_pool;
+#define EXTENT_STATE_POOL_SIZE (64*1024)
 
 static LIST_HEAD(buffers);
 static LIST_HEAD(states);
@@ -61,18 +64,28 @@ tree_fs_info(struct extent_io_tree *tree
 int __init extent_io_init(void)
 {
extent_state_cache = kmem_cache_create(extent_state,
-   sizeof(struct extent_state), 0,
-   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
+   sizeof(struct extent_state), 0,
+   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+   NULL);
if (!extent_state_cache)
return -ENOMEM;
 
+   extent_state_pool = mempool_create_slab_pool(
+   EXTENT_STATE_POOL_SIZE /
+   sizeof(struct extent_state),
+   extent_state_cache);
+   if (!extent_state_pool)
+   goto free_state_cache;
+
extent_buffer_cache = kmem_cache_create(extent_buffers,
sizeof(struct extent_buffer), 0,
SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
if (!extent_buffer_cache)
-   goto free_state_cache;
+   goto free_state_mempool;
return 0;
 
+free_state_mempool:
+   mempool_destroy(extent_state_pool);
 free_state_cache:
kmem_cache_destroy(extent_state_cache);
return -ENOMEM;
@@ -103,6 +116,8 @@ void extent_io_exit(void)
list_del(eb-leak_list);
kmem_cache_free(extent_buffer_cache, eb);
}
+   if (extent_state_pool)
+   mempool_destroy(extent_state_pool);
if (extent_state_cache)
kmem_cache_destroy(extent_state_cache);
if (extent_buffer_cache)
@@ -128,7 +143,7 @@ static struct extent_state *alloc_extent
unsigned long flags;
 #endif
 
-   state = kmem_cache_alloc(extent_state_cache, mask);
+   state = mempool_alloc(extent_state_pool, mask);
if (!state)
return state;
state-state = 0;
@@ -145,6 +160,12 @@ static struct extent_state *alloc_extent
return state;
 }
 
+static struct extent_state *alloc_extent_state_nofail(gfp_t mask)
+{
+   BUG_ON(!(mask  __GFP_WAIT));
+   return alloc_extent_state(mask);
+}
+
 void free_extent_state(struct extent_state *state)
 {
if (!state)
@@ -160,7 +181,7 @@ void free_extent_state(struct extent_sta
spin_unlock_irqrestore(leak_lock, flags);
 #endif
trace_free_extent_state(state, _RET_IP_);
-   kmem_cache_free(extent_state_cache, state);
+   mempool_free(state, extent_state_pool);
}
 }
 
@@ -437,6 +458,12 @@ static int clear_state_bit(struct extent
return ret;
 }
 
+static void
+assert_atomic_alloc(struct extent_state *prealloc, gfp_t mask)
+{
+   WARN_ON(!prealloc  (mask  __GFP_WAIT));
+}
+
 static struct extent_state *
 alloc_extent_state_atomic(struct extent_state *prealloc)
 {
@@ -464,6 +491,7 @@ NORET_TYPE void extent_io_tree_panic(str
  * the range [start, end] is inclusive.
  *
  * This takes the tree lock, and returns 0 on success and  0 on error.
+ * If (mask  __GFP_WAIT) == 0, there are no error conditions.
  */
 int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 int bits, int wake, int delete,
@@ -486,11 +514,8 @@ int clear_extent_bit(struct extent_io_tr
if (bits  (EXTENT_IOBITS | EXTENT_BOUNDARY))
clear = 1;
 again:
-   if (!prealloc  (mask  __GFP_WAIT)) {
-   prealloc = alloc_extent_state(mask);
-   if (!prealloc)
-   return -ENOMEM;
-   }
+   if