[PATCH 0/2] btrfs: fortification for GFP_NOFS allocations

2015-08-19 Thread mhocko
Hi,
these two patches were sent as a part of a larger RFC which aims at
allowing GFP_NOFS allocations to fail to help sort out memory reclaim
issues bound to the current behavior
(http://marc.info/?l=linux-mmm=143876830616538w=2).

It is clear that move to the GFP_NOFS behavior change is a long term
plan but these patches should be good enough even with that change in
place. It also seems that Chris wasn't opposed and would be willing to
take them http://marc.info/?l=linux-mmm=143991792427165w=2 so here we
come. I have rephrased the changeslogs to not refer to the patch which
changes the NOFS behavior.

Just to clarify. These two patches allowed my particular testcase
(mentioned in the cover referenced above) to survive it doesn't mean
that the failing GFP_NOFS are OK now. I have seen some other places
where GFP_NOFS allocation is followed by BUG_ON(ALLOC_FAILED). I have
not encountered them though.

Let me know if you would prefer other changes.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: Prevent from early transaction abort

2015-08-19 Thread mhocko
From: Michal Hocko mho...@suse.com

Btrfs relies on GFP_NOFS allocation when committing the transaction but
this allocation context is rather weak wrt. reclaim capabilities. The
page allocator currently tries hard to not fail these allocations if
they are small (=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem
currently but there is an attempt to move away from the default no-fail
behavior and allow these allocation to fail more eagerly. And this would
lead to a pre-mature transaction abort as follows:

[   55.328093] Call Trace:
[   55.328890]  [8154e6f0] dump_stack+0x4f/0x7b
[   55.330518]  [8108fa28] ? console_unlock+0x334/0x363
[   55.332738]  [8110873e] __alloc_pages_nodemask+0x81d/0x8d4
[   55.334910]  [81100752] pagecache_get_page+0x10e/0x20c
[   55.336844]  [a007d916] alloc_extent_buffer+0xd0/0x350 [btrfs]
[   55.338973]  [a0059d8c] btrfs_find_create_tree_block+0x15/0x17 
[btrfs]
[   55.341329]  [a004f728] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[   55.343566]  [a003fa34] split_leaf+0x1e4/0x6a6 [btrfs]
[   55.345577]  [a0040567] btrfs_search_slot+0x671/0x831 [btrfs]
[   55.347679]  [810682d7] ? get_parent_ip+0xe/0x3e
[   55.349434]  [a0041cb2] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[   55.351681]  [a004ecfb] __btrfs_run_delayed_refs+0x7a6/0xf35 
[btrfs]
[   55.353979]  [a00512ea] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.356212]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.358378]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.360626]  [a0060221] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.362894]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.365221]  [a0073428] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.367273]  [81186808] vfs_fsync_range+0x8f/0x9e
[   55.369047]  [81186833] vfs_fsync+0x1c/0x1e
[   55.370654]  [81186869] do_fsync+0x34/0x4e
[   55.372246]  [81186ab3] SyS_fsync+0x10/0x14
[   55.373851]  [81554f97] system_call_fastpath+0x12/0x6f
[   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: 
errno=-12 Out of memory
[   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted 
transaction.
[   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting 
unused transaction(IO failure).
[   55.384280] [ cut here ]
[   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 
btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[   55.384337] Call Trace:
[   55.384353]  [8154e6f0] dump_stack+0x4f/0x7b
[   55.384357]  [8107f717] ? down_trylock+0x2d/0x37
[   55.384359]  [81046977] warn_slowpath_common+0xa1/0xbb
[   55.384398]  [a00a1d6b] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384400]  [81046a34] warn_slowpath_null+0x1a/0x1c
[   55.384423]  [a00a1d6b] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384446]  [a004e5f7] ? __btrfs_run_delayed_refs+0xa2/0xf35 
[btrfs]
[   55.384455]  [a004e600] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[   55.384476]  [a00512ea] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.384499]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.384521]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.384543]  [a0060221] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.384565]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.384588]  [a0073428] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.384591]  [81186808] vfs_fsync_range+0x8f/0x9e
[   55.384592]  [81186833] vfs_fsync+0x1c/0x1e
[   55.384593]  [81186869] do_fsync+0x34/0x4e
[   55.384594]  [81186ab3] SyS_fsync+0x10/0x14
[   55.384595]  [81554f97] system_call_fastpath+0x12/0x6f
[...]
[   55.384608] ---[ end trace c29799da1d4dd621 ]---
[   55.437323] BTRFS info (device hdb1): forced readonly
[   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by being explicit about the no-fail behavior of this allocation
path and use __GFP_NOFAIL.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/btrfs/extent_io.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c374e1e71e5f..f4d6eea975d7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4607,9 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 
start,
 {
struct extent_buffer *eb = NULL;
 
-   eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
-   if (eb == NULL)
-   return NULL;
+   eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
eb-start = start;
eb-len = len;
eb-fs_info = fs_info;
@@ -4867,7 +4865,7 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
   

[PATCH 2/2] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio

2015-08-19 Thread mhocko
From: Michal Hocko mho...@suse.com

alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
transaction but this allocation context is rather weak wrt. reclaim
capabilities. The page allocator currently tries hard to not fail these
allocations if they are small (=PAGE_ALLOC_COSTLY_ORDER) but it can
still fail if the _current_ process is the OOM killer victim. Moreover
there is an attempt to move away from the default no-fail behavior and
allow these allocation to fail more eagerly. This would lead to:

[   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045

which is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 53af23f2c087..42b9949dd71d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4914,9 +4914,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int 
total_stripes, int real_stripes)
 * and the stripes
 */
sizeof(u64) * (total_stripes),
-   GFP_NOFS);
-   if (!bbio)
-   return NULL;
+   GFP_NOFS|__GFP_NOFAIL);
 
atomic_set(bbio-error, 0);
atomic_set(bbio-refs, 1);
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 1/8] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves

2015-08-05 Thread mhocko
From: Michal Hocko mho...@suse.com

__GFP_NOFAIL is a big hammer used to ensure that the allocation
request can never fail. This is a strong requirement and as such
it also deserves a special treatment when the system is OOM. The
primary problem here is that the allocation request might have
come with some locks held and the oom victim might be blocked
on the same locks. This is basically an OOM deadlock situation.

This patch tries to reduce the risk of such a deadlocks by giving
__GFP_NOFAIL allocations a special treatment and let them dive into
memory reserves after oom killer invocation. This should help them
to make a progress and release resources they are holding. The OOM
victim should compensate for the reserves consumption.

Signed-off-by: Michal Hocko mho...@suse.com
---
 mm/page_alloc.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1f9ffbb087cb..ee69c338ca2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2732,8 +2732,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
}
/* Exhausted what can be done so it's blamo time */
if (out_of_memory(ac-zonelist, gfp_mask, order, ac-nodemask, false)
-   || WARN_ON_ONCE(gfp_mask  __GFP_NOFAIL))
+   || WARN_ON_ONCE(gfp_mask  __GFP_NOFAIL)) {
*did_some_progress = 1;
+
+   if (gfp_mask  __GFP_NOFAIL) {
+   page = get_page_from_freelist(gfp_mask, order,
+   ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+   WARN_ONCE(!page, Unable to fullfil gfp_nofail 
allocation.
+Consider increasing min_free_kbytes.\n);
+   }
+   }
 out:
mutex_unlock(oom_lock);
return page;
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio

2015-08-05 Thread mhocko
From: Michal Hocko mho...@suse.com

alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since mm:
page_alloc: do not lock up GFP_NOFS allocations upon OOM this is
allowed to fail which can lead to
[   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045

This is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 53af23f2c087..57a99d19533d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4914,7 +4914,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int 
total_stripes, int real_stripes)
 * and the stripes
 */
sizeof(u64) * (total_stripes),
-   GFP_NOFS);
+   GFP_NOFS|__GFP_NOFAIL);
if (!bbio)
return NULL;
 
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 7/8] btrfs: Prevent from early transaction abort

2015-08-05 Thread mhocko
From: Michal Hocko mho...@suse.com

Btrfs relies on GFP_NOFS allocation when commiting the transaction but
since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
those allocations are allowed to fail which can lead to a pre-mature
transaction abort:

[   55.328093] Call Trace:
[   55.328890]  [8154e6f0] dump_stack+0x4f/0x7b
[   55.330518]  [8108fa28] ? console_unlock+0x334/0x363
[   55.332738]  [8110873e] __alloc_pages_nodemask+0x81d/0x8d4
[   55.334910]  [81100752] pagecache_get_page+0x10e/0x20c
[   55.336844]  [a007d916] alloc_extent_buffer+0xd0/0x350 [btrfs]
[   55.338973]  [a0059d8c] btrfs_find_create_tree_block+0x15/0x17 
[btrfs]
[   55.341329]  [a004f728] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[   55.343566]  [a003fa34] split_leaf+0x1e4/0x6a6 [btrfs]
[   55.345577]  [a0040567] btrfs_search_slot+0x671/0x831 [btrfs]
[   55.347679]  [810682d7] ? get_parent_ip+0xe/0x3e
[   55.349434]  [a0041cb2] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[   55.351681]  [a004ecfb] __btrfs_run_delayed_refs+0x7a6/0xf35 
[btrfs]
[   55.353979]  [a00512ea] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.356212]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.358378]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.360626]  [a0060221] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.362894]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.365221]  [a0073428] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.367273]  [81186808] vfs_fsync_range+0x8f/0x9e
[   55.369047]  [81186833] vfs_fsync+0x1c/0x1e
[   55.370654]  [81186869] do_fsync+0x34/0x4e
[   55.372246]  [81186ab3] SyS_fsync+0x10/0x14
[   55.373851]  [81554f97] system_call_fastpath+0x12/0x6f
[   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: 
errno=-12 Out of memory
[   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted 
transaction.
[   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting 
unused transaction(IO failure).
[   55.384280] [ cut here ]
[   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 
btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[   55.384337] Call Trace:
[   55.384353]  [8154e6f0] dump_stack+0x4f/0x7b
[   55.384357]  [8107f717] ? down_trylock+0x2d/0x37
[   55.384359]  [81046977] warn_slowpath_common+0xa1/0xbb
[   55.384398]  [a00a1d6b] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384400]  [81046a34] warn_slowpath_null+0x1a/0x1c
[   55.384423]  [a00a1d6b] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384446]  [a004e5f7] ? __btrfs_run_delayed_refs+0xa2/0xf35 
[btrfs]
[   55.384455]  [a004e600] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[   55.384476]  [a00512ea] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.384499]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.384521]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.384543]  [a0060221] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.384565]  [a0060e21] ? start_transaction+0x192/0x534 [btrfs]
[   55.384588]  [a0073428] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.384591]  [81186808] vfs_fsync_range+0x8f/0x9e
[   55.384592]  [81186833] vfs_fsync+0x1c/0x1e
[   55.384593]  [81186869] do_fsync+0x34/0x4e
[   55.384594]  [81186ab3] SyS_fsync+0x10/0x14
[   55.384595]  [81554f97] system_call_fastpath+0x12/0x6f
[...]
[   55.384608] ---[ end trace c29799da1d4dd621 ]---
[   55.437323] BTRFS info (device hdb1): forced readonly
[   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by reintroducing the no-fail behavior of this allocation path
with the explicit __GFP_NOFAIL.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/btrfs/extent_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c374e1e71e5f..88fad7051e38 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4607,7 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 
start,
 {
struct extent_buffer *eb = NULL;
 
-   eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
+   eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
if (eb == NULL)
return NULL;
eb-start = start;
@@ -4867,7 +4867,7 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
return NULL;
 
for (i = 0; i  num_pages; i++, index++) {
-   p = find_or_create_page(mapping, index, GFP_NOFS);
+   p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
if (!p)
goto free_eb;
 
-- 
2.5.0

--

[RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

2015-08-05 Thread mhocko
From: Michal Hocko mho...@suse.com

Journal transaction might fail prematurely because the frozen_buffer
is allocated by GFP_NOFS request:
[   72.440013] do_get_write_access: OOM for frozen_buffer
[   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
Out of memory in __ext4_journal_get_write_access
[   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: 
Out of memory
(...snipped)
[   72.495559] do_get_write_access: OOM for frozen_buffer
[   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
Out of memory in __ext4_journal_get_write_access
[   72.496839] do_get_write_access: OOM for frozen_buffer
[   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
Out of memory in __ext4_journal_get_write_access
[   72.505766] Aborting journal on device sda1-8.
[   72.505851] EXT4-fs (sda1): Remounting filesystem read-only

This wasn't a problem until mm: page_alloc: do not lock up GFP_NOFS
allocations upon OOM because small GPF_NOFS allocations never failed.
This allocation seems essential for the journal and GFP_NOFS is too
restrictive to the memory allocator so let's use __GFP_NOFAIL here to
emulate the previous behavior.

jbd code has the very same issue so let's do the same there as well.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/jbd/transaction.c  | 11 +--
 fs/jbd2/transaction.c | 14 +++---
 2 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 1695ba8334a2..bf7474deda2f 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head 
*jh,
jbd_unlock_bh_state(bh);
frozen_buffer =
jbd_alloc(jh2bh(jh)-b_size,
-GFP_NOFS);
-   if (!frozen_buffer) {
-   printk(KERN_ERR
-  %s: OOM for frozen_buffer\n,
-  __func__);
-   JBUFFER_TRACE(jh, oom!);
-   error = -ENOMEM;
-   jbd_lock_bh_state(bh);
-   goto done;
-   }
+GFP_NOFS|__GFP_NOFAIL);
goto repeat;
}
jh-b_frozen_data = frozen_buffer;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ff2f2e6ad311..bff071e21553 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head 
*jh,
jbd_unlock_bh_state(bh);
frozen_buffer =
jbd2_alloc(jh2bh(jh)-b_size,
-GFP_NOFS);
-   if (!frozen_buffer) {
-   printk(KERN_ERR
-  %s: OOM for frozen_buffer\n,
-  __func__);
-   JBUFFER_TRACE(jh, oom!);
-   error = -ENOMEM;
-   jbd_lock_bh_state(bh);
-   goto done;
-   }
+GFP_NOFS|__GFP_NOFAIL);
goto repeat;
}
jh-b_frozen_data = frozen_buffer;
@@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct 
buffer_head *bh)
 
 repeat:
if (!jh-b_committed_data) {
-   committed_data = jbd2_alloc(jh2bh(jh)-b_size, GFP_NOFS);
+   committed_data = jbd2_alloc(jh2bh(jh)-b_size,
+   GFP_NOFS|__GFP_NOFAIL);
if (!committed_data) {
printk(KERN_ERR %s: No memory for committed data\n,
__func__);
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 6/8] ext3: Do not abort journal prematurely

2015-08-05 Thread mhocko
From: Michal Hocko mho...@suse.com

journal_get_undo_access is relying on GFP_NOFS allocation yet it is
essential for the journal transaction:

[   83.256914] journal_get_undo_access: No memory for committed data
[   83.258022] EXT3-fs: ext3_free_blocks_sb: aborting transaction: Out
of memory in __ext3_journal_get_undo_access
[   83.259785] EXT3-fs (hdb1): error in ext3_free_blocks_sb: Out of
memory
[   83.267130] Aborting journal on device hdb1.
[   83.292308] EXT3-fs (hdb1): error: ext3_journal_start_sb: Detected
aborted journal
[   83.293630] EXT3-fs (hdb1): error: remounting filesystem read-only

Since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
these allocation requests are allowed to fail so we need to use
__GFP_NOFAIL to imitate the previous behavior.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/jbd/transaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index bf7474deda2f..6c60376a29bc 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -887,7 +887,7 @@ int journal_get_undo_access(handle_t *handle, struct 
buffer_head *bh)
 
 repeat:
if (!jh-b_committed_data) {
-   committed_data = jbd_alloc(jh2bh(jh)-b_size, GFP_NOFS);
+   committed_data = jbd_alloc(jh2bh(jh)-b_size, GFP_NOFS | 
__GFP_NOFAIL);
if (!committed_data) {
printk(KERN_ERR %s: No memory for committed data\n,
__func__);
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 5/8] ext4: Do not fail journal due to block allocator

2015-08-05 Thread mhocko
From: Michal Hocko mho...@suse.com

Since mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
memory allocator doesn't endlessly loop to satisfy low-order allocations
and instead fails them to allow callers to handle them gracefully.

Some of the callers are not yet prepared for this behavior though. ext4
block allocator relies solely on GFP_NOFS allocation requests and
allocation failures lead to aborting yournal too easily:

[  345.028333] oom-trash: page allocation failure: order:0, mode:0x50
[  345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: GW   
4.0.0-nofs3-6-gdfe9931f5f68 #588
[  345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.8.1-20150428_134905-gandalf 04/01/2014
[  345.028339]   880005a17708 81538a54 
8107a40f
[  345.028341]  0050 880005a17798 810fe854 
00018000
[  345.028342]  0046  81a52100 
0246
[  345.028343] Call Trace:
[  345.028348]  [81538a54] dump_stack+0x4f/0x7b
[  345.028370]  [810fe854] warn_alloc_failed+0x12a/0x13f
[  345.028373]  [81101bd2] __alloc_pages_nodemask+0x7f3/0x8aa
[  345.028375]  [810f9933] pagecache_get_page+0x12a/0x1c9
[  345.028390]  [a005bc64] ext4_mb_load_buddy+0x220/0x367 [ext4]
[  345.028414]  [a006014f] ext4_free_blocks+0x522/0xa4c [ext4]
[  345.028425]  [a0054e14] ext4_ext_remove_space+0x833/0xf22 [ext4]
[  345.028434]  [a005677e] ext4_ext_truncate+0x8c/0xb0 [ext4]
[  345.028441]  [a00342bf] ext4_truncate+0x20b/0x38d [ext4]
[  345.028462]  [a003573c] ext4_evict_inode+0x32b/0x4c1 [ext4]
[  345.028464]  [8116d04f] evict+0xa0/0x148
[  345.028466]  [8116dca8] iput+0x1a1/0x1f0
[  345.028468]  [811697b4] __dentry_kill+0x136/0x1a6
[  345.028470]  [81169a3e] dput+0x21a/0x243
[  345.028472]  [81157cda] __fput+0x184/0x19b
[  345.028473]  [81157d29] fput+0xe/0x10
[  345.028475]  [8105a05f] task_work_run+0x8a/0xa1
[  345.028477]  [810452f0] do_exit+0x3c6/0x8dc
[  345.028482]  [8104588a] do_group_exit+0x4d/0xb2
[  345.028483]  [8104eeeb] get_signal+0x5b1/0x5f5
[  345.028488]  [81002202] do_signal+0x28/0x5d0
[...]
[  345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of 
memory
[  345.033097] Aborting journal on device hdb1-8.
[  345.036339] EXT4-fs (hdb1): Remounting filesystem read-only
[  345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
Journal has aborted
[  345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
Journal has aborted
[  345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: 
Journal has aborted
[  345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal 
has aborted
[  345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
Journal has aborted
[  345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has 
aborted
[  345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
Journal has aborted
[  345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal has 
aborted
[  345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: 
Journal has aborted

The failure is really premature because GFP_NOFS allocation context is
very restricted - especially in the fs metadata heavy loads. Before we
go with a more sofisticated solution, let's simply imitate the previous
behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the
buddy block allocator. I wasn't able to trigger the issue with this
patch anymore.

Signed-off-by: Michal Hocko mho...@suse.com
---
 fs/ext4/mballoc.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5b1613a54307..e6361622bfd5 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -992,7 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block 
*sb,
block = group * 2;
pnum = block / blocks_per_page;
poff = block % blocks_per_page;
-   page = find_or_create_page(inode-i_mapping, pnum, GFP_NOFS);
+   page = find_or_create_page(inode-i_mapping, pnum,
+  GFP_NOFS|__GFP_NOFAIL);
if (!page)
return -ENOMEM;
BUG_ON(page-mapping != inode-i_mapping);
@@ -1006,7 +1007,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block 
*sb,
 
block++;
pnum = block / blocks_per_page;
-   page = find_or_create_page(inode-i_mapping, pnum, GFP_NOFS);
+   page = find_or_create_page(inode-i_mapping, pnum,
+  GFP_NOFS|__GFP_NOFAIL);
if (!page)
return -ENOMEM;
BUG_ON(page-mapping != inode-i_mapping);
@@ -1158,7 +1160,8 @@ ext4_mb_load_buddy(struct super_block *sb, 

[RFC 3/8] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM

2015-08-05 Thread mhocko
From: Johannes Weiner han...@cmpxchg.org

GFP_NOFS allocations are not allowed to invoke the OOM killer since
their reclaim abilities are severely diminished.  However, without the
OOM killer available there is no hope of progress once the reclaimable
pages have been exhausted.

Don't risk hanging these allocations.  Leave it to the allocation site
to implement the fallback policy for failing allocations.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
Signed-off-by: Michal Hocko mho...@suse.com
---
 mm/page_alloc.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee69c338ca2a..024d45d51700 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2715,15 +2715,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
if (ac-high_zoneidx  ZONE_NORMAL)
goto out;
/* The OOM killer does not compensate for IO-less reclaim */
-   if (!(gfp_mask  __GFP_FS)) {
-   /*
-* XXX: Page reclaim didn't yield anything,
-* and the OOM killer can't be invoked, but
-* keep looping as per tradition.
-*/
-   *did_some_progress = 1;
+   if (!(gfp_mask  __GFP_FS))
goto out;
-   }
if (pm_suspended_storage())
goto out;
/* The OOM killer may not free memory on a specific node */
-- 
2.5.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 2/8] mm: Allow GFP_IOFS for page_cache_read page cache allocation

2015-08-05 Thread mhocko
From: Michal Hocko mho...@suse.com

page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
however, not called from the fs layera directly so it doesn't need this
protection normally.

ceph and ocfs2 which call filemap_fault from their fault handlers
seem to be OK because they are not taking any fs lock before invoking
generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe
from the reclaim recursion POV because this lock serializes truncate
and punch hole with the page faults and it doesn't get involved in the
reclaim.

The GFP_NOFS protection might be even harmful. There is a push to fail
GFP_NOFS allocations rather than loop within allocator indefinitely with
a very limited reclaim ability. Once we start failing those requests
the OOM killer might be triggered prematurely because the page cache
allocation failure is propagated up the page fault path and end up in
pagefault_out_of_memory.

We cannot play with mapping_gfp_mask directly because that would be racy
wrt. parallel page faults and it might interfere with other users who
really rely on NOFS semantic from the stored gfp_mask. The mask is also
inode proper so it would even be a layering violation. What we can do
instead is to push the gfp_mask into struct vm_fault and allow fs layer
to overwrite it should the callback need to be called with a different
allocation context.

Initialize the default to (mapping_gfp_mask | GFP_IOFS) because this
should be safe from the page fault path normally. Why do we care
about mapping_gfp_mask at all then? Because this doesn't hold only
reclaim protection flags but it also might contain zone and movability
restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect
those.

Reported-by: Tetsuo Handa penguin-ker...@i-love.sakura.ne.jp
Signed-off-by: Michal Hocko mho...@suse.com
---
 include/linux/mm.h |  4 
 mm/filemap.c   |  9 -
 mm/memory.c| 17 +
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f471789781a..962e37c7cd6a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -220,10 +220,14 @@ extern pgprot_t protection_map[16];
  * -fault function. The vma's -fault is responsible for returning a bitmask
  * of VM_FAULT_xxx flags that give details about how the fault was handled.
  *
+ * MM layer fills up gfp_mask for page allocations but fault handler might
+ * alter it if its implementation requires a different allocation context.
+ *
  * pgoff should be used in favour of virtual_address, if possible.
  */
 struct vm_fault {
unsigned int flags; /* FAULT_FLAG_xxx flags */
+   gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
void __user *virtual_address;   /* Faulting virtual address */
 
diff --git a/mm/filemap.c b/mm/filemap.c
index b63fb81df336..8a16a07bbe02 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1774,19 +1774,18 @@ EXPORT_SYMBOL(generic_file_read_iter);
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
 {
struct address_space *mapping = file-f_mapping;
struct page *page;
int ret;
 
do {
-   page = page_cache_alloc_cold(mapping);
+   page = __page_cache_alloc(gfp_mask|__GFP_COLD);
if (!page)
return -ENOMEM;
 
-   ret = add_to_page_cache_lru(page, mapping, offset,
-   GFP_KERNEL  mapping_gfp_mask(mapping));
+   ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL  
gfp_mask);
if (ret == 0)
ret = mapping-a_ops-readpage(file, page);
else if (ret == -EEXIST)
@@ -1969,7 +1968,7 @@ int filemap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
 * We're only likely to ever get here if MADV_RANDOM is in
 * effect.
 */
-   error = page_cache_read(file, offset);
+   error = page_cache_read(file, offset, vmf-gfp_mask);
 
/*
 * The page we want has now been added to the page cache.
diff --git a/mm/memory.c b/mm/memory.c
index 8a2fc9945b46..25ab29560dca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1949,6 +1949,20 @@ static inline void cow_user_page(struct page *dst, 
struct page *src, unsigned lo
copy_user_highpage(dst, src, va, vma);
 }
 
+static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
+{
+   struct file 

[RFC 0/8] Allow GFP_NOFS allocation to fail

2015-08-05 Thread mhocko
Hi,
small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing
traditionally even though their reclaim capabilities are restricted
because the VM code cannot recurse into filesystems to clean dirty
pages. At the same time these allocation requests do not allow to
trigger the OOM killer because that would lead to pre-mature OOM killing
during heavy fs metadata workloads.

This leaves the VM code in an unfortunate situation where GFP_NOFS
requests is looping inside the allocator relying on somebody else to
make a progress on its behalf. This is prone to deadlocks when the
request is holding resources which are necessary for other task to make
a progress and release memory (e.g. OOM victim is blocked on the lock
held by the NONFS request). Another drawback is that the caller of
the allocator cannot define any fallback strategy because the request
doesn't fail.

As the VM cannot do much about these requests we should face the reality
and allow those allocations to fail. Johannes has already posted the
patch which does that (http://marc.info/?l=linux-mmm=142726428514236w=2)
but the discussion died pretty quickly.

I was playing with this patch and xfs, ext[34] and btrfs for a while to
see what is the effect under heavy memory pressure. As expected this led
to some fallouts.
My test consisted of a simple memory hog which allocates a lot of
anonymous memory and writes to a fs mainly to trigger a fs activity on
exit. In parallel there is a parallel fs metadata load (multiple tasks
creating thousands of empty files and directories). All is running
in a VM with small amount of memory to emulate an under provisioned
system. The metadata load is triggering a sufficient load to invoke the
direct reclaim even without the memory hog. The memory hog forks several
tasks sharing the VM and OOM killer manages to kill it without locking
up the system (this was based on the test case from Tetsuo Handa -
http://www.spinics.net/lists/linux-fsdevel/msg82958.html - I just didn't
want to kill my machine ;)).
With all the patches applied none of the 4 filesystems gets aborted
transactions and RO remount (well xfs didn't need any special
treatment). This is obviously not sufficient to claim that failing
GFP_NOFS is OK now but I think it is a good start for the further
discussion. I would be grateful if FS people could have a look at those
patches.  I have simply used __GFP_NOFAIL in the critical paths. This
might be not the best strategy but it sounds like a good first step.

The first patch in the series also allows __GFP_NOFAIL allocations to
access memory reserves when the system is OOM which should help those
requests to make a forward progress - especially in combination with
GFP_NOFS.

The second patch tries to address a potential pre-mature OOM killer from
the page fault path. I have posted it separately but it didn't get much
traction.

The third patch allows GFP_NOFS to fail and I believe it should see much
more testing coverage. It would be really great if it could sit in the
mmotm tree for few release cycles so that we can catch more fallouts.

The rest are the FS specific patches to fortify allocations
requests which are really needed to finish transactions without RO
remounts. There might be more needed but my test case survives with
these in place.
They would obviously need some rewording if they are going to be applied
even without Patch3 and I will do that if respective maintainers will
take them. Ext3 and JBD are going away soon so they might be dropped but
they have been in the tree while I was testing so I've kept them.

Thoughts? Opinions?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html