Re: [patch 07/99] btrfs: Use mempools for extent_state structures
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/01/2011 02:55 PM, Jeff Mahoney wrote: > On 11/28/2011 07:04 PM, Jeff Mahoney wrote: >> On 11/28/2011 06:53 PM, Andi Kleen wrote: >>> Jeff Mahoney writes: > The extent_state structure is used at the core of the extent i/o code for managing flags, locking, etc. It requires allocations deep in the write code and if failures occur they are difficult to recover from. We avoid most of the failures by using a mempool, which can sleep when required, to honor the allocations. This allows future patches to convert most of the {set,clear,convert}_extent_bit and derivatives to return void. > >>> Is this really safe? > >>> iirc if there's any operation that needs multiple mempool >>> objects you can get into the following ABBA style deadlock: > >>> thread 1 thread 2 > >>> get object A from pool 1 get object C from pool 2 get object B >>> from pool 2 pool 2 full -> block get object D from pool 2 >> ^ pool1, i assume >>> pool1 full -> block > > >>> Now for thread 1 to make progress it needs thread 2 to free its >>> object first, but that needs thread 1 to free its object also >>> first, which is a deadlock. > >>> It would still work if there are other users which eventually >>> make progress, but if you're really unlucky either pool 1 or 2 >>> is complete used by threads doing a multiple object operation. >>> So you got a nasty rare deadlock ... > >> Yes, I think you're right. I think the risk is probably there >> and I'm not sure it can be totally mitigated. We'd be stuck if >> there is a string of pathological cases and both mempool are >> empty. The only way I can see to try to help the situation is to >> make the mempool fairly substantial, which is what I did here. >> I'd prefer to make the mempools per-fs but that would require >> pretty heavy modifications in order to pass around a per-fs >> struct. In any case, the risk isn't totally eliminated. > > The more I look into it, the more I don't think this is an > uncommon scenario in the kernel. Device mapper draws from a number > of mempools that can be interspersed with allocations from the bio > pool. Even without stacking different types of dm devices, I think > we can run into this scenario. The DM scenario is probably even > worse since the allocs are GFP_NOIO instead of the (relatively) > more relaxed GFP_NOFS. > > I'm not saying it's ok, just that there are similar sites already > that seem to be easier to hit but we haven't heard of issues there > either. It seems to be a behavior that is largely mitigated by > establishing mempools of sufficient size. ... and the next installment: My understanding of mempools is inaccurate and the pathological case can be hit much more quickly than I anticipated, or at least have a performance impact I'd rather not see. It turns out that the pool will be depleted *before* the other pathological cases are hit. In fact, the pool is hit before reclaim starts at all. Rather than: normal alloc with gfp_t (normal alloc fails after trying reclaim) pool alloc (pool alloc fails) (wait, timeout, retry) it is: normal alloc with gfp_t & ~(__GFP_WAIT|__GFP_IO) (which is GFP_NOFS & ~GFP_NOFS for the common case) .. so: normal alloc with GFP_NOWAIT (normal alloc fails) (much more often) pool alloc (pool alloc fails) ((wait until an object is freed or timeout) and retry with GFP_NOFS) So, using mempools for all cases as I'd hoped isn't what I want. I'll work up another set of patches that use a straight kmem_cache_alloc where possible. The good news is that slab-backed mempools can be replenished with objects from the slab that weren't allocated from the pool. So there's that. - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.18 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJO2atEAAoJEB57S2MheeWy2CYP/iix4znJVqBkN8wUg31MELGB UpodfZkwlhrD/8PfJrisjA1wLLKpK5GzJ0m6rhKCYvkBau28yCSAWGW9sJIfGM0I 4fLAoupncoJqmYyQvrZIczxztDVwMu37bL7fNfCkC64HoaETChQbaDbrwu+L8Rn2 Lls7EtkUNWrpY+J4F0zq9b8eE46WjbpZBKhI/PuqAE0LfSXK8nd46Z5qA/MM7fab tFftC0AyBLP2dq8R0H1IX0yjri6YD0xwwfdHdbzVAoJ/tINVbfiQxntYyONgNnBF 6sNogCtcskpTSDHZyVK7ATuJJL6ZAIFO0ZUJjZYFc+q0q1oYMfmbhNs9Qq5le7bZ Ig6pcMgHqGOqkis95jqzgStl2A1OIYPZsn6K1329N44fvZ8PjxDVCHS83FzBY0qw YuEgBNd8vRrdjtcMax2QOs9yaSmvEXDkws1+tLWg/ZV0Ik8crbW0ctVqsyDpWwPN 2dSyrlxGbAJyXzELUg498dLORw+chHokUhsEvwYEmL1HGLsZGFoXAV3H5df4v6Mx wsKZeT+Nsp16CyYOXVCAWGRyp6FPDWECJAUxygjyEaIGWmiJjMKU1GizL9J4fgc+ 59fantrAMbFwPpRwJxfHumCdnQOi55qtrcP1UvCB3o9jBH4QsUofP4sGQsIAw4oQ ctuvtI1R7zcub0906BwP =h+a2 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/99] btrfs: Use mempools for extent_state structures
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/28/2011 07:04 PM, Jeff Mahoney wrote: > On 11/28/2011 06:53 PM, Andi Kleen wrote: >> Jeff Mahoney writes: > >>> The extent_state structure is used at the core of the extent >>> i/o code for managing flags, locking, etc. It requires >>> allocations deep in the write code and if failures occur they >>> are difficult to recover from. >>> >>> We avoid most of the failures by using a mempool, which can >>> sleep when required, to honor the allocations. This allows >>> future patches to convert most of the >>> {set,clear,convert}_extent_bit and derivatives to return void. > >> Is this really safe? > >> iirc if there's any operation that needs multiple mempool >> objects you can get into the following ABBA style deadlock: > >> thread 1 thread 2 > >> get object A from pool 1 get object C from pool 2 get object B >> from pool 2 pool 2 full -> block get object D from pool 2 > ^ pool1, i assume >> pool1 full -> block > > >> Now for thread 1 to make progress it needs thread 2 to free its >> object first, but that needs thread 1 to free its object also >> first, which is a deadlock. > >> It would still work if there are other users which eventually >> make progress, but if you're really unlucky either pool 1 or 2 >> is complete used by threads doing a multiple object operation. So >> you got a nasty rare deadlock ... > > Yes, I think you're right. I think the risk is probably there and > I'm not sure it can be totally mitigated. We'd be stuck if there is > a string of pathological cases and both mempool are empty. The only > way I can see to try to help the situation is to make the mempool > fairly substantial, which is what I did here. I'd prefer to make > the mempools per-fs but that would require pretty heavy > modifications in order to pass around a per-fs struct. In any case, > the risk isn't totally eliminated. The more I look into it, the more I don't think this is an uncommon scenario in the kernel. Device mapper draws from a number of mempools that can be interspersed with allocations from the bio pool. Even without stacking different types of dm devices, I think we can run into this scenario. The DM scenario is probably even worse since the allocs are GFP_NOIO instead of the (relatively) more relaxed GFP_NOFS. I'm not saying it's ok, just that there are similar sites already that seem to be easier to hit but we haven't heard of issues there either. It seems to be a behavior that is largely mitigated by establishing mempools of sufficient size. - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.18 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJO19ueAAoJEB57S2MheeWyDb4QAKZ3T9JGTXZN9MikO0Q8aH08 KE0X4NSnagn/goS/88+6Hj9Z165gQ3ERUaoW4+VsdpLOAaJuNo4YtMShoN7pgC9X HeaddX1cQb8aducWPt9o7sglkpERJHzdSNKO4rJaYQbJLZFBQ1KLQEISRjS1ZAiu 74Y/t3tESiJO8W5XV1b06pythUKG4sLPK8+ssBvpYt+qrfhfp//dse/sKPE3L1k5 zHP0GIePRazhkE9RUB8+5rjItUR7MSvECJviCKKhxyY/TsQRm4g27AhbL717Ew/9 V3LyyZe5ojhXWNDyaCeLUu4j/xOcYVftkxuQLHcVltTDaAzxCkaRBnzxoz5nouc+ SusUzVYy/aR14QyNbtFpfJVAB3E9djXsaA3EZO2ar+NPeBch28LEJRfC8hoItej3 O39PAFpviYRSHa12GDIqJ0x3q9auTeU5DRPt3gnpstT26t4vICkFn4B/cp9EzpQI G1Gm09ftkehpSdBiFFSBXCpPg/7qTAMPnaF1Yq3ueQoWbAqVEtnPnaWQLfH4IBBj 5fp6ei0/zAS2qr1iBKZnWUKJKCIAyvCkEeLusTcLymFzLzVMpywYWchU+EW0AZ5G 98tx8Dgzi77ImHG4uF9uWuNJ80G7iR+nGgE+8dKIoEKLETHdoQMRgTCc05o9tTHM Gk0GUwRESn97r/ytUm8n =zdAv -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/99] btrfs: Use mempools for extent_state structures
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/28/2011 06:53 PM, Andi Kleen wrote: > Jeff Mahoney writes: > >> The extent_state structure is used at the core of the extent i/o >> code for managing flags, locking, etc. It requires allocations >> deep in the write code and if failures occur they are difficult >> to recover from. >> >> We avoid most of the failures by using a mempool, which can sleep >> when required, to honor the allocations. This allows future >> patches to convert most of the {set,clear,convert}_extent_bit and >> derivatives to return void. > > Is this really safe? > > iirc if there's any operation that needs multiple mempool objects > you can get into the following ABBA style deadlock: > > thread 1 thread 2 > > get object A from pool 1 get object C from pool 2 get object B from > pool 2 pool 2 full -> block get object D from pool 2 ^ pool1, i assume > pool1 full -> block > > > Now for thread 1 to make progress it needs thread 2 to free its > object first, but that needs thread 1 to free its object also > first, which is a deadlock. > > It would still work if there are other users which eventually make > progress, but if you're really unlucky either pool 1 or 2 is > complete used by threads doing a multiple object operation. So you > got a nasty rare deadlock ... Yes, I think you're right. I think the risk is probably there and I'm not sure it can be totally mitigated. We'd be stuck if there is a string of pathological cases and both mempool are empty. The only way I can see to try to help the situation is to make the mempool fairly substantial, which is what I did here. I'd prefer to make the mempools per-fs but that would require pretty heavy modifications in order to pass around a per-fs struct. In any case, the risk isn't totally eliminated. > The only way to handle this would be to always preallocate all > objects needed for a operation in an atomic way. This was how I wanted to do it at first but I couldn't see a way to make it work in a manner that wouldn't kill performance since the calculations effectively need to be done twice and are pretty common. The other issue is that operations can be pretty complex and preallocating for all cases will either be really wasteful or really error prone. > Or never use more than one object. I need to do some analysis here to see if that is even possible. - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.18 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIbBAEBAgAGBQJO1CFxAAoJEB57S2MheeWyL1EP+Pd+c1hodWEA6StTnE1pG+Tl u5Rg+VBmN8AdcfdOzvyoyfZgPaCxtHzaJVX2y3PzHtY1INEMp2EoijaL1X9afKQF lJP1BmpIARARMrQrYbqvZrl+DGUeIwIhC2LBuNffYifbMAn3KdTLRhkvcGehTmQO Vik1bgmo6ZKjAIP/FLViGdB4hITITwHtT7n4bDyZ9ATgqgq5sGUWnw7f0sILeBqR NOt8QVgaCmzZIF5foUr2UsBBxjo0joiW2V7YWs4roJeCtDx+6mwyRifQFTGuN2xi 1CVTnTT+FYXkD8IB6mKNHptIDLu7q29RaikSFsgV684IYAp510OBStgBuCpBeeNc Xd77XJkf9UCYSXWs1j587MnOqaeV818OqvwxHAGZnavBHZ/+8AFOJ77GAVaP1Xlm M6gYP1TXjL0yjJEWNVtP0kBjimKUFw5ECv86DCQllRTY/5oWmAxjyJviEzVgJBiM EDtzf17APyiTSWawRZlDc6a46kG2Jm6QiNmx4MyAlSFu81Qv0EDOpC4TAwiRDZKa KLjaW10TKcC5TDwZXcOBiF/PrpvBCvJjd0N3sCFVD0Tn0cdqME9BuFUP1hI1ku+T L5VyzHsp07ndCk+H6gGWkCZzEtSSyWFG4gHkFcV7z/IF5bPpsyxcxdGXGzBGbYUh 9NdrCz5lMb49VTBBSaY= =ZXGi -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/99] btrfs: Use mempools for extent_state structures
Jeff Mahoney writes: > The extent_state structure is used at the core of the extent i/o code > for managing flags, locking, etc. It requires allocations deep in the > write code and if failures occur they are difficult to recover from. > > We avoid most of the failures by using a mempool, which can sleep when > required, to honor the allocations. This allows future patches to convert > most of the {set,clear,convert}_extent_bit and derivatives to return > void. Is this really safe? iirc if there's any operation that needs multiple mempool objects you can get into the following ABBA style deadlock: thread 1 thread 2 get object A from pool 1 get object C from pool 2 get object B from pool 2 pool 2 full -> block get object D from pool 2 pool1 full -> block Now for thread 1 to make progress it needs thread 2 to free its object first, but that needs thread 1 to free its object also first, which is a deadlock. It would still work if there are other users which eventually make progress, but if you're really unlucky either pool 1 or 2 is complete used by threads doing a multiple object operation. So you got a nasty rare deadlock ... The only way to handle this would be to always preallocate all objects needed for a operation in an atomic way. Or never use more than one object. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 07/99] btrfs: Use mempools for extent_state structures
The extent_state structure is used at the core of the extent i/o code for managing flags, locking, etc. It requires allocations deep in the write code and if failures occur they are difficult to recover from. We avoid most of the failures by using a mempool, which can sleep when required, to honor the allocations. This allows future patches to convert most of the {set,clear,convert}_extent_bit and derivatives to return void. Signed-off-by: Jeff Mahoney --- fs/btrfs/extent_io.c | 71 --- 1 file changed, 51 insertions(+), 20 deletions(-) Index: source/fs/btrfs/extent_io.c === --- source.orig/fs/btrfs/extent_io.c2011-11-21 14:13:55.0 -0500 +++ source/fs/btrfs/extent_io.c 2011-11-21 14:38:23.0 -0500 @@ -12,6 +12,7 @@ #include #include #include +#include #include "extent_io.h" #include "extent_map.h" #include "compat.h" @@ -21,6 +22,8 @@ static struct kmem_cache *extent_state_cache; static struct kmem_cache *extent_buffer_cache; +static mempool_t *extent_state_pool; +#define EXTENT_STATE_POOL_SIZE (64*1024) static LIST_HEAD(buffers); static LIST_HEAD(states); @@ -61,18 +64,28 @@ tree_fs_info(struct extent_io_tree *tree int __init extent_io_init(void) { extent_state_cache = kmem_cache_create("extent_state", - sizeof(struct extent_state), 0, - SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); + sizeof(struct extent_state), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + NULL); if (!extent_state_cache) return -ENOMEM; + extent_state_pool = mempool_create_slab_pool( + EXTENT_STATE_POOL_SIZE / + sizeof(struct extent_state), + extent_state_cache); + if (!extent_state_pool) + goto free_state_cache; + extent_buffer_cache = kmem_cache_create("extent_buffers", sizeof(struct extent_buffer), 0, SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); if (!extent_buffer_cache) - goto free_state_cache; + goto free_state_mempool; return 0; +free_state_mempool: + mempool_destroy(extent_state_pool); free_state_cache: kmem_cache_destroy(extent_state_cache); return -ENOMEM; @@ -103,6 +116,8 @@ void extent_io_exit(void) list_del(&eb->leak_list); kmem_cache_free(extent_buffer_cache, eb); } + if (extent_state_pool) + mempool_destroy(extent_state_pool); if (extent_state_cache) kmem_cache_destroy(extent_state_cache); if (extent_buffer_cache) @@ -128,7 +143,7 @@ static struct extent_state *alloc_extent unsigned long flags; #endif - state = kmem_cache_alloc(extent_state_cache, mask); + state = mempool_alloc(extent_state_pool, mask); if (!state) return state; state->state = 0; @@ -145,6 +160,12 @@ static struct extent_state *alloc_extent return state; } +static struct extent_state *alloc_extent_state_nofail(gfp_t mask) +{ + BUG_ON(!(mask & __GFP_WAIT)); + return alloc_extent_state(mask); +} + void free_extent_state(struct extent_state *state) { if (!state) @@ -160,7 +181,7 @@ void free_extent_state(struct extent_sta spin_unlock_irqrestore(&leak_lock, flags); #endif trace_free_extent_state(state, _RET_IP_); - kmem_cache_free(extent_state_cache, state); + mempool_free(state, extent_state_pool); } } @@ -437,6 +458,12 @@ static int clear_state_bit(struct extent return ret; } +static void +assert_atomic_alloc(struct extent_state *prealloc, gfp_t mask) +{ + WARN_ON(!prealloc && (mask & __GFP_WAIT)); +} + static struct extent_state * alloc_extent_state_atomic(struct extent_state *prealloc) { @@ -464,6 +491,7 @@ NORET_TYPE void extent_io_tree_panic(str * the range [start, end] is inclusive. * * This takes the tree lock, and returns 0 on success and < 0 on error. + * If (mask & __GFP_WAIT) == 0, there are no error conditions. */ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, int bits, int wake, int delete, @@ -486,11 +514,8 @@ int clear_extent_bit(struct extent_io_tr if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY)) clear = 1; again: - if (!prealloc && (mask & __GFP_WAIT)) { - prealloc = alloc_extent_state(mask); - if (!prealloc) - return -ENOMEM; - } + if (!prealloc && (mask & __GFP_WAIT)) + prealloc = al