Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-07 Thread Michal Hocko
On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
[...]
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f54..4dc497d 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
>  {
> int cpu;
> u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
> +   int node, nr;
> 
> BUG_ON(cpu_to_apicid == NULL);
> +   nr = cpumask_weight(cpu_possible_mask);
> +
> +   /* bring up all possible node, since dev->numa_node */
> +   //should check acpi works for node possible,
> +   for_each_node(node)
> +   if (!node_online(node))
> +   init_memory_less_node(node);

I suspect there is no change if you replace for_each_node by
for_each_node_mask(nid, node_possible_map)

here. If that is the case then we are probably calling
free_area_init_node too early. I do not see it yet though.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-07 Thread Michal Hocko
On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
> On Fri, Dec 7, 2018 at 10:22 PM Michal Hocko  wrote:
> >
> > On Fri 07-12-18 21:20:17, Pingfan Liu wrote:
> > [...]
> > > Hi Michal,
> > >
> > > As I mentioned in my previous email, I have manually apply the patch,
> > > and the patch can not work for normal bootup.
> >
> > I am sorry, I have misread your previous response. Is there anything
> > interesting on the serial console by any chance?
> 
> Nothing. It need more effort to debug. But as I mentioned, enable all
> of the rest possible node, then it works. Maybe it can give some help
> for you.

I will have a look. Thanks!

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-07 Thread Michal Hocko
On Fri 07-12-18 21:20:17, Pingfan Liu wrote:
[...]
> Hi Michal,
> 
> As I mentioned in my previous email, I have manually apply the patch,
> and the patch can not work for normal bootup.

I am sorry, I have misread your previous response. Is there anything
interesting on the serial console by any chance?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-07 Thread Michal Hocko
On Fri 07-12-18 17:40:09, Pingfan Liu wrote:
> On Fri, Dec 7, 2018 at 3:53 PM Michal Hocko  wrote:
> >
> > On Fri 07-12-18 10:56:51, Pingfan Liu wrote:
> > [...]
> > > In a short word, the fix method should consider about the two factors:
> > > semantic of online-node and the effect on all archs
> >
> > I am pretty sure there is a lot of room for unification in this area.
> > Nevertheless I strongly believe the bug should be fixed firs with the
> > simplest way and all the cleanup should be done on top.
> >
> > Do I get it right that the diff worked for you and I can prepare a full
> > patch?
> >
> Sure, I am glad to test you new patch.

>From 46e68be89d9c299fd497b2b8bea3f2add144f17f Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 7 Dec 2018 12:23:32 +0100
Subject: [PATCH] x86, numa: always initialize all possible nodes

Pingfan Liu has reported the following splat
[5.772742] BUG: unable to handle kernel paging request at 2088
[5.773618] PGD 0 P4D 0
[5.773618] Oops:  [#1] SMP NOPTI
[5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
06/29/2018
[5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 
ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 
0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
e1 44 89 e6 89
[5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
[5.773618] RAX:  RBX: 006012c0 RCX: 
[5.773618] RDX:  RSI: 0002 RDI: 2080
[5.773618] RBP: 006012c0 R08:  R09: 0002
[5.773618] R10: 006080c0 R11: 0002 R12: 
[5.773618] R13: 0001 R14:  R15: 0002
[5.773618] FS:  () GS:8c69afe0() 
knlGS:
[5.773618] CS:  0010 DS:  ES:  CR0: 80050033
[5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 003406e0
[5.773618] Call Trace:
[5.773618]  new_slab+0xa9/0x570
[5.773618]  ___slab_alloc+0x375/0x540
[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  __slab_alloc+0x1c/0x38
[5.773618]  __kmalloc_node_track_caller+0xc8/0x270
[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  devm_kmalloc+0x28/0x60
[5.773618]  pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  really_probe+0x73/0x420
[5.773618]  driver_probe_device+0x115/0x130
[5.773618]  __driver_attach+0x103/0x110
[5.773618]  ? driver_probe_device+0x130/0x130
[5.773618]  bus_for_each_dev+0x67/0xc0
[5.773618]  ? klist_add_tail+0x3b/0x70
[5.773618]  bus_add_driver+0x41/0x260
[5.773618]  ? pcie_port_setup+0x4d/0x4d
[5.773618]  driver_register+0x5b/0xe0
[5.773618]  ? pcie_port_setup+0x4d/0x4d
[5.773618]  do_one_initcall+0x4e/0x1d4
[5.773618]  ? init_setup+0x25/0x28
[5.773618]  kernel_init_freeable+0x1c1/0x26e
[5.773618]  ? loglevel+0x5b/0x5b
[5.773618]  ? rest_init+0xb0/0xb0
[5.773618]  kernel_init+0xa/0x110
[5.773618]  ret_from_fork+0x22/0x40
[5.773618] Modules linked in:
[5.773618] CR2: 2088
[5.773618] ---[ end trace 1030c9120a03d081 ]---

with his AMD machine with the following topology
  NUMA node0 CPU(s): 0,8,16,24
  NUMA node1 CPU(s): 2,10,18,26
  NUMA node2 CPU(s): 4,12,20,28
  NUMA node3 CPU(s): 6,14,22,30
  NUMA node4 CPU(s): 1,9,17,25
  NUMA node5 CPU(s): 3,11,19,27
  NUMA node6 CPU(s): 5,13,21,29
  NUMA node7 CPU(s): 7,15,23,31

[0.007418] Early memory node ranges
[0.007419]   node   1: [mem 0x1000-0x0008efff]
[0.007420]   node   1: [mem 0x0009-0x0009]
[0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
[0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
[0.007423]   node   1: [mem 0x6c528000-0x6fff]
[0.007424]   node   1: [mem 0x0001-0x00047fff]
[0.007425]   node   5: [mem 0x00048000-0x00087eff]

and nr_cpus set to 4. The underlying reason is tha the device is bound
to node 2 which doesn't have any memory and init_cpu_to_node only
initializes memory-less nodes for possible cpus which nr_cpus restrics.
This in turn means that proper zonelists are not allocated and the page
allocator blows up.

Fix the issue by moving init_memory_less_node into numa_register_memblks
and always initialize all possible nodes consistently at a single place.

Reported-by: Pingfan Liu 
Signed-off-by: Michal Hocko 
---
 arch/x86/mm/numa.c | 33 +++--
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --g

Re: [RFC PATCH 0/3] THP eligibility reporting via proc

2018-12-07 Thread Michal Hocko
On Tue 20-11-18 11:35:12, Michal Hocko wrote:
> Hi,
> this series of three patches aims at making THP eligibility reporting
> much more robust and long term sustainable. The trigger for the change
> is a regression report [1] and the long follow up discussion. In short
> the specific application didn't have good API to query whether a particular
> mapping can be backed by THP so it has used VMA flags to workaround that.
> These flags represent a deep internal state of VMAs and as such they should
> be used by userspace with a great deal of caution.
> 
> A similar has happened for [2] when users complained that VM_MIXEDMAP is
> no longer set on DAX mappings. Again a lack of a proper API led to an
> abuse.
> 
> The first patch in the series tries to emphasise that that the semantic
> of flags might change and any application consuming those should be really
> careful.
> 
> The remaining two patches provide a more suitable interface to address [1]
> and provide a consistent API to query the THP status both for each VMA
> and process wide as well.

Are there any other comments on these? I haven't heard any pushback so
far so I will re-send with RFC dropped early next week.

> 
> [1] 
> http://lkml.kernel.org/r/http://lkml.kernel.org/r/alpine.deb.2.21.1809241054050.224...@chino.kir.corp.google.com
> [2] http://lkml.kernel.org/r/20181002100531.gc4...@quack2.suse.cz
> 

-- 
Michal Hocko
SUSE Labs


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread Michal Hocko
rent
> +  * node in its nodemask, we allocate the standard way.
> +  */
> + if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
> + hpage_node = pol->v.preferred_node;
> +
> + nmask = policy_nodemask(gfp, pol);
> + if (!nmask || node_isset(hpage_node, *nmask)) {
> + mpol_cond_put(pol);
> + page = __alloc_pages_node(hpage_node,
> + gfp | __GFP_THISNODE, order);
> + goto out;
> + }
> +     }
> +
>   nmask = policy_nodemask(gfp, pol);
>   preferred_nid = policy_node(gfp, pol, node);
>   page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
> diff --git a/mm/shmem.c b/mm/shmem.c
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1439,7 +1439,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
>  
>   shmem_pseudo_vma_init(, info, hindex);
>   page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
> - HPAGE_PMD_ORDER, , 0, numa_node_id());
> + HPAGE_PMD_ORDER, , 0, numa_node_id(), true);
>   shmem_pseudo_vma_destroy();
>   if (page)
>   prep_transhuge_page(page);

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-06 Thread Michal Hocko
On Fri 07-12-18 10:56:51, Pingfan Liu wrote:
[...]
> In a short word, the fix method should consider about the two factors:
> semantic of online-node and the effect on all archs

I am pretty sure there is a lot of room for unification in this area.
Nevertheless I strongly believe the bug should be fixed firs with the
simplest way and all the cleanup should be done on top.

Do I get it right that the diff worked for you and I can prepare a full
patch?

-- 
Michal Hocko
SUSE Labs


Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

2018-12-06 Thread Michal Hocko
On Thu 06-12-18 20:31:46, Linus Torvalds wrote:
> [ Oops. different thread for me due to edited subject, so I saw this
> after replying to the earlier email by David ]

Sorry about that but I really wanted to make the actual discussion about
semantic clearly distinguished because the thread just grown too large
with back and forth that didn't lead to anywhere.

> On Thu, Dec 6, 2018 at 1:14 AM Michal Hocko  wrote:
> >
> > MADV_HUGEPAGE changes the picture because the caller expressed a need
> > for THP and is willing to go extra mile to get it.
> 
> Actually, I think MADV_HUGEPAGE should just be
> "TRANSPARENT_HUGEPAGE_ALWAYS but only for this vma".

Yes, that is the case and I didn't want to make the description more
complicated than necessary so I've focused only on the current default.
But historically we have treated defrag=always and MADV_HUGEPAGE the
same.

[...]
> >I believe that something like the below would be sensible
> > 1) THP on a local node with compaction not giving up too early
> > 2) THP on a remote node in NOWAIT mode - so no direct
> >compaction/reclaim (trigger kswapd/kcompactd only for
> >defrag=defer+madvise)
> > 3) fallback to the base page allocation
> 
> That doesn't sound insane to me. That said, the numbers David quoted
> do fairly strongly imply that local small-pages are actually preferred
> to any remote THP pages.

As I and others pointed out elsewhere remote penalty is just a part of
the picture and on its own might be quite misleading. There are other
aspects (TLB pressure, page tables overhead etc) that might amortize the
access latency.

> But *that* in turn makes for other possible questions:
> 
>  - if the reason we couldn't get a local hugepage is that we're simply
> out of local memory (huge *or* small), then maybe a remote hugepage is
> better.
> 
>Note that this now implies that the choice can be an issue of "did
> the hugepage allocation fail due to fragmentation, or due to the node
> being low of memory"

How exactly do you tell? Many systems are simply low on memory due to
caching. A clean pagecache is quite cheap to reclaim but it can be more
expensive to fault in. Do we consider it to be a viable target?

> 
> and there is the other question that I asked in the other thread
> (before subject edit):
> 
>  - how local is the load to begin with?
> 
>Relatively shortlived processes - or processes that are explicitly
> bound to a node - might have different preferences than some
> long-lived process where the CPU bounces around, and might have
> different trade-offs for the local vs remote question too.

Agreed

> So just based on David's numbers, and some wild handwaving on my part,
> a slightly more complex, but still very sensible default might be
> something like
> 
>  1) try to do a cheap local node hugepage allocation
> 
> Rationale: everybody agrees this is the best case.
> 
> But if that fails:
> 
>  2) look at compacting and the local node, but not very hard.
> 
> If there's lots of memory on the local node, but synchronous
> compaction doesn't do anything easily, just fall back to small pages.

Do we reclaim at this stage or this is mostly GFP_NOWAIT attempt?

> Rationale: local memory is generally more important than THP.
> 
> If that fails (ie local node is simply low on memory):
> 
>  3) Try to do remote THP allocation
> 
>  Rationale: Ok, we simply didn't have a lot of local memory, so
> it's not just a question of fragmentation. If it *had* been
> fragmentation, lots of small local pages would have been better than a
> remote THP page.
> 
>  Oops, remote THP allocation failed (possibly after synchronous
> remote compaction, but maybe this is where we do kcompactd).
> 
>  4) Just do any small page, and do reclaim etc. THP isn't happening,
> and it's not a priority when you're starting to feel memory pressure.

If 2) doesn't reclaim heavily (e.g. only try to reclaim clean page
cache) or even do NOWAIT (which would be even better) then I _think_
this sounds sane.

> In general, I really would want to avoid magic kernel command lines
> (or sysfs settings, or whatever) making a huge difference in behavior.
> So I really wish people would see the whole
> 'transparent_hugepage_flags' thing as a way for kernel developers to
> try different settings, not as a way for users to tune their loads.
> 
> Our default should work as sane defaults, we shouldn't have a "ok,
> let's have this sysfs tunable and let people make their own
> decisions". That's a cop-out.

Agreed. I cannot say I am happy with all the ways THP can be tuned. It
is quite confusing to say the least.

-- 
Michal Hocko
SUSE Labs


Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

2018-12-06 Thread Michal Hocko
On Thu 06-12-18 15:49:04, David Rientjes wrote:
> On Thu, 6 Dec 2018, Michal Hocko wrote:
> 
> > MADV_HUGEPAGE changes the picture because the caller expressed a need
> > for THP and is willing to go extra mile to get it. That involves
> > allocation latency and as of now also a potential remote access. We do
> > not have complete agreement on the later but the prevailing argument is
> > that any strong NUMA locality is just reinventing node-reclaim story
> > again or makes THP success rate down the toilet (to quote Mel). I agree
> > that we do not want to fallback to a remote node overeagerly. I believe
> > that something like the below would be sensible
> > 1) THP on a local node with compaction not giving up too early
> > 2) THP on a remote node in NOWAIT mode - so no direct
> >compaction/reclaim (trigger kswapd/kcompactd only for
> >defrag=defer+madvise)
> > 3) fallback to the base page allocation
> > 
> 
> I disagree that MADV_HUGEPAGE should take on any new semantic that 
> overrides the preference of node local memory for a hugepage, which is the 
> nearly four year behavior.  The order of MADV_HUGEPAGE preferences listed 
> above would cause current users to regress who rely on local small page 
> fallback rather than remote hugepages because the access latency is much 
> better.  I think the preference of remote hugepages over local small pages 
> needs to be expressed differently to prevent regression.

Such a model would be broken. It doesn't provide consistent semantic and
leads to surprising results. MADV_HUGEPAGE with local node binding will
not prevent remote base pages to be used and you are back to square one.

It has been a huge mistake to merge your __GFP_THISNODE patch back then
in 4.1. Especially with an absolute lack of numbers for a variety of
workloads. I still believe we can do better, offer a sane mem policy to
help workloads with higher locality demands but it is outright wrong
to confalte demand for THP with the locality semantic.

If this is absolutely no go then we need a MADV_HUGEPAGE_SANE...

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-06 Thread Michal Hocko
On Thu 06-12-18 18:44:03, Pingfan Liu wrote:
> On Thu, Dec 6, 2018 at 6:03 PM Pingfan Liu  wrote:
[...]
> > Which commit is this patch applied on? I can not apply it on latest linux 
> > tree.
> >
> I applied it by manual, will see the test result. I think it should
> work since you instance all the node.
> But there are two things worth to consider:
> -1st. why x86 do not bring up all nodes by default, apparently it will
> be more simple by that way

What do you mean? Why it didn't bring up before? Or do you see some
nodes not being brought up after this patch?

> -2nd. there are other archs, do they obey the rules?

I am afraid that each arch does its own initialization.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-06 Thread Michal Hocko
On Thu 06-12-18 09:15:53, Naoya Horiguchi wrote:
> On Thu, Dec 06, 2018 at 09:32:06AM +0100, Michal Hocko wrote:
> > On Thu 06-12-18 05:21:38, Naoya Horiguchi wrote:
> > > On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote:
> > > > On Wed 05-12-18 13:29:18, Michal Hocko wrote:
> > > > [...]
> > > > > After some more thinking I am not really sure the above reasoning is
> > > > > still true with the current upstream kernel. Maybe I just managed to
> > > > > confuse myself so please hold off on this patch for now. Testing by
> > > > > Oscar has shown this patch is helping but the changelog might need to 
> > > > > be
> > > > > updated.
> > > > 
> > > > OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
> > > > been debugging on behaves slightly different. The underlying problem is
> > > > the same though. So I have reworded the changelog and added "just in
> > > > case" PageLRU handling. Naoya, maybe you have an argument that would
> > > > make this void for current upstream kernels.
> > > 
> > > The following commit (not in 4.4.x stable tree) might explain the
> > > difference you experienced:
> > > 
> > >   commit 286c469a988fbaf68e3a97ddf1e6c245c1446968 
> > >  
> > >   Author: Naoya Horiguchi  
> > >  
> > >   Date:   Wed May 3 14:56:22 2017 -0700   
> > >  
> > >   
> > >  
> > >   mm: hwpoison: call shake_page() after try_to_unmap() for mlocked 
> > > page
> > > 
> > > This commit adds shake_page() for mlocked pages to make sure that the 
> > > target
> > > page is flushed out from LRU cache. Without this shake_page(), subsequent
> > > delete_from_lru_cache() (from me_pagecache_clean()) fails to isolate it 
> > > and
> > > the page will finally return back to LRU list.  So this scenario leads to
> > > "hwpoisoned by still linked to LRU list" page.
> > 
> > OK, I see. So does that mean that the LRU handling is no longer needed
> > and there is a guanratee that all kernels with the above commit cannot
> > ever get an LRU page?
> 
> Theoretically no such gurantee, because try_to_unmap() doesn't have a
> guarantee of success and then memory_failure() returns immediately
> when hwpoison_user_mappings fails.
> Or the following code (comes after hwpoison_user_mappings block) also implies
> that the target page can still have PageLRU flag.
> 
> /*
>  * Torn down by someone else?
>      */
> if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
> action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
> res = -EBUSY;
> goto out;
> }
> 
> So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
> current version of your patch.
> 
> Feel free to add my ack.
> 
> Acked-by: Naoya Horiguchi 

Thanks a lot Naoya! I will extend the changelog with your wording.

-- 
Michal Hocko
SUSE Labs


[PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-06 Thread Michal Hocko
From: Michal Hocko 

We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel. The underlying
reason was that the HWPoison page has an elevated reference count and
the migration keeps failing. There are two problems with that. First
of all it is dubious to migrate the poisoned page because we know that
accessing that memory is possible to fail. Secondly it doesn't make any
sense to migrate a potentially broken content and preserve the memory
corruption over to a new location.

Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase
===

int main(void)
{
int ret;
int i;
int fd;
char *array = malloc(4096);
char *array_locked = malloc(4096);

fd = open("/tmp/data", O_RDONLY);
read(fd, array, 4095);

for (i = 0; i < 4096; i++)
array_locked[i] = 'd';

ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), 
sizeof(array_locked));
if (ret)
perror("mlock");

sleep (20);

ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, 
MADV_HWPOISON);
if (ret)
perror("madvise");

for (i = 0; i < 4096; i++)
array_locked[i] = 'd';

return 0;
}
===

+ offline this memory.

In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel:  [] dump_trace+0x59/0x340
kernel:  [] show_stack_log_lvl+0xea/0x170
kernel:  [] show_stack+0x21/0x40
kernel:  [] dump_stack+0x5c/0x7c
kernel:  [] warn_slowpath_common+0x81/0xb0
kernel:  [] __pagevec_lru_add_fn+0x14c/0x160
kernel:  [] pagevec_lru_move_fn+0xad/0x100
kernel:  [] __lru_cache_add+0x6c/0xb0
kernel:  [] add_to_page_cache_lru+0x46/0x70
kernel:  [] extent_readpages+0xc3/0x1a0 [btrfs]
kernel:  [] __do_page_cache_readahead+0x177/0x200
kernel:  [] ondemand_readahead+0x168/0x2a0
kernel:  [] generic_file_read_iter+0x41f/0x660
kernel:  [] __vfs_read+0xcd/0x140
kernel:  [] vfs_read+0x7a/0x120
kernel:  [] kernel_read+0x3b/0x50
kernel:  [] do_execveat_common.isra.29+0x490/0x6f0
kernel:  [] do_execve+0x28/0x30
kernel:  [] call_usermodehelper_exec_async+0xfb/0x130
kernel:  [] ret_from_fork+0x55/0x80

And that later confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count. It is quite possible that the reuse of the HWPoisoned page is
some kind of fixed race condition but I am not really sure about that.

With the upstream kernel the failure is slightly different. The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails
and do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.

Fix both cases by explicitly checking HWPoisoned pages before we even
try to get a reference on the page, try to unmap it if it is still
mapped. As explained by Naoya
: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.

Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself
about that. Naoya has noted the following
: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
: /*
:  * Torn down by someone else?
:  */
: if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
: action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
: res =3D -EBUSY;
: goto out;
: }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.

Debugged-by: Oscar Salvador 
Cc: stable
Reviewed-by: Oscar Salvador 
Tested-by: Oscar Salvador 
Acked-by: David Hildenbrand 
Acked-by: Naoya Horiguchi 
Signed-off-by: Michal Hocko 
---

Hi Andrew,
this has been posted as an RFC [1] previously. It took 2 versions to get
the patch right but it seems that this one should work reasonably well.
I guess we want to have it in linux-next for some time but I do not
expect many people do test MCEs + hotremove considering the breakage is
old and nobody has noticed so far.

[1] http://lkml.kernel.org/r/20181203100309.14784-1-mho...@kernel.org

 mm/memory_hotplug.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c6c42a7425e5..cfa1a2736876 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -34,6

MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

2018-12-06 Thread Michal Hocko
On Wed 05-12-18 16:58:02, Linus Torvalds wrote:
[...]
> I realize that we probably do want to just have explicit policies that
> do not exist right now, but what are (a) sane defaults, and (b) sane
> policies?

I would focus on the current default first (which is defrag=madvise).
This means that we only try the cheapest possible THP without
MADV_HUGEPAGE. If there is none we simply fallback. We do restrict to
the local node. I guess there is a general agreement that this is a sane
default.

MADV_HUGEPAGE changes the picture because the caller expressed a need
for THP and is willing to go extra mile to get it. That involves
allocation latency and as of now also a potential remote access. We do
not have complete agreement on the later but the prevailing argument is
that any strong NUMA locality is just reinventing node-reclaim story
again or makes THP success rate down the toilet (to quote Mel). I agree
that we do not want to fallback to a remote node overeagerly. I believe
that something like the below would be sensible
1) THP on a local node with compaction not giving up too early
2) THP on a remote node in NOWAIT mode - so no direct
   compaction/reclaim (trigger kswapd/kcompactd only for
   defrag=defer+madvise)
3) fallback to the base page allocation

This would allow both full memory utilization and try to be as local as
possible. Whoever strongly prefers NUMA locality should be using
MPOL_NODE_RECLAIM (or similar) and that would skip 2 and make 1) and 2)
use more aggressive compaction and reclaim.

This will also fit into our existing NUMA api. MPOL_NODE_RECLAIM
wouldn't be restricted to THP obviously. It would act on base pages as
well and it would basically use the same implementation as we have for
the global node_reclaim and make it usable again.

Does this sound at least remotely sane?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-06 Thread Michal Hocko
On Thu 06-12-18 05:21:38, Naoya Horiguchi wrote:
> On Wed, Dec 05, 2018 at 05:57:16PM +0100, Michal Hocko wrote:
> > On Wed 05-12-18 13:29:18, Michal Hocko wrote:
> > [...]
> > > After some more thinking I am not really sure the above reasoning is
> > > still true with the current upstream kernel. Maybe I just managed to
> > > confuse myself so please hold off on this patch for now. Testing by
> > > Oscar has shown this patch is helping but the changelog might need to be
> > > updated.
> > 
> > OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
> > been debugging on behaves slightly different. The underlying problem is
> > the same though. So I have reworded the changelog and added "just in
> > case" PageLRU handling. Naoya, maybe you have an argument that would
> > make this void for current upstream kernels.
> 
> The following commit (not in 4.4.x stable tree) might explain the
> difference you experienced:
> 
>   commit 286c469a988fbaf68e3a97ddf1e6c245c1446968  
>   Author: Naoya Horiguchi   
>   Date:   Wed May 3 14:56:22 2017 -0700
>
>   mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
> 
> This commit adds shake_page() for mlocked pages to make sure that the target
> page is flushed out from LRU cache. Without this shake_page(), subsequent
> delete_from_lru_cache() (from me_pagecache_clean()) fails to isolate it and
> the page will finally return back to LRU list.  So this scenario leads to
> "hwpoisoned by still linked to LRU list" page.

OK, I see. So does that mean that the LRU handling is no longer needed
and there is a guanratee that all kernels with the above commit cannot
ever get an LRU page?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-06 Thread Michal Hocko
On Thu 06-12-18 11:07:33, Pingfan Liu wrote:
> On Wed, Dec 5, 2018 at 5:40 PM Vlastimil Babka  wrote:
> >
> > On 12/5/18 10:29 AM, Pingfan Liu wrote:
> > >> [0.007418] Early memory node ranges
> > >> [0.007419]   node   1: [mem 0x1000-0x0008efff]
> > >> [0.007420]   node   1: [mem 0x0009-0x0009]
> > >> [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> > >> [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> > >> [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> > >> [0.007424]   node   1: [mem 0x0001-0x00047fff]
> > >> [0.007425]   node   5: [mem 0x00048000-0x00087eff]
> > >>
> > >> There is clearly no node2. Where did the driver get the node2 from?
> >
> > I don't understand these tables too much, but it seems the other nodes
> > exist without them:
> >
> > [0.007393] SRAT: PXM 2 -> APIC 0x20 -> Node 2
> >
> > Maybe the nodes are hotplugable or something?
> >
> I also not sure about it, and just have a hurry look at acpi spec. I
> will reply it on another email, and Cced some acpi guys about it
> 
> > > Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing 
> > > code.
> >
> > Indeed, nr_cpus seems to restrict what nodes we allocate and populate
> > zonelists for.
> 
> Yes, in init_cpu_to_node(),  since nr_cpus limits the possible cpu,
> which affects the loop for_each_possible_cpu(cpu) and skip the node2
> in this case.

THanks for pointing this out. It made my life easier. So It think the
bug is that we call init_memory_less_node from this path. I suspect
numa_register_memblks is the right place to do this. So I admit I
am not 100% sure but could you give this a try please?

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f5408bf7..4575ae4d5449 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -527,6 +527,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
}
 }
 
+static void __init init_memory_less_node(int nid)
+{
+   unsigned long zones_size[MAX_NR_ZONES] = {0};
+   unsigned long zholes_size[MAX_NR_ZONES] = {0};
+
+   free_area_init_node(nid, zones_size, 0, zholes_size);
+
+   /*
+* All zonelists will be built later in start_kernel() after per cpu
+* areas are initialized.
+*/
+}
+
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
unsigned long uninitialized_var(pfn_align);
@@ -592,6 +605,8 @@ static int __init numa_register_memblks(struct numa_meminfo 
*mi)
continue;
 
alloc_node_data(nid);
+   if (!end)
+   init_memory_less_node(nid);
}
 
/* Dump memblock with node info and return. */
@@ -721,21 +736,6 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
 }
 
-static void __init init_memory_less_node(int nid)
-{
-   unsigned long zones_size[MAX_NR_ZONES] = {0};
-   unsigned long zholes_size[MAX_NR_ZONES] = {0};
-
-   /* Allocate and initialize node data. Memory-less node is now online.*/
-   alloc_node_data(nid);
-   free_area_init_node(nid, zones_size, 0, zholes_size);
-
-   /*
-* All zonelists will be built later in start_kernel() after per cpu
-* areas are initialized.
-*/
-}
-
 /*
  * Setup early cpu_to_node.
  *
@@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
if (node == NUMA_NO_NODE)
continue;
 
-   if (!node_online(node))
-   init_memory_less_node(node);
-
numa_set_node(cpu, node);
}
 }
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-05 Thread Michal Hocko
On Thu 06-12-18 11:34:30, Pingfan Liu wrote:
[...]
> > I suspect we are looking at two issues here. The first one, and a more
> > important one is that there is a NUMA affinity configured for the device
> > to a non-existing node. The second one is that nr_cpus affects
> > initialization of possible nodes.
> 
> The dev->numa_node info is extracted from acpi table, not depends on
> the instance of numa-node, which may be limited by nr_cpus. Hence the
> node is existing, just not instanced.

Hmm, binding to memory less node is quite dubious. But OK. I am not sure
how much sanitization can we do. We need to fallback anyway so we should
better make sure that all possible nodes are initialized regardless of
nr_cpus. I will look into that.

-- 
Michal Hocko
SUSE Labs


Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-05 Thread Michal Hocko
On Wed 05-12-18 11:49:26, David Rientjes wrote:
> On Wed, 5 Dec 2018, Michal Hocko wrote:
> 
> > > The revert is certainly needed to prevent the regression, yes, but I 
> > > anticipate that Andrea will report back that patch 2 at least improves 
> > > the 
> > > situation for the problem that he was addressing, specifically that it is 
> > > pointless to thrash any node or reclaim unnecessarily when compaction has 
> > > already failed.  This is what setting __GFP_NORETRY for all thp fault 
> > > allocations fixes.
> > 
> > Yes but earlier numbers from Mel and repeated again [1] simply show
> > that the swap storms are only handled in favor of an absolute drop of
> > THP success rate.
> >  
> 
> As we've been over countless times, this is the desired effect for 
> workloads that fit on a single node.  We want local pages of the native 
> page size because they (1) are accessed faster than remote hugepages and 
> (2) are candidates for collapse by khugepaged.
> 
> For applications that do not fit in a single node, we have discussed 
> possible ways to extend the API to allow remote faulting of hugepages, 
> absent remote fragmentation as well, then the long-standing behavior is 
> preserved and large applications can use the API to increase their thp 
> success rate.

OK, I just give up. This doesn't lead anywhere. You keep repeating the
same stuff over and over, neglect other usecases and actually force them
to do something special just to keep your very specific usecase which
you clearly refuse to abstract into a form other people can experiment
with or at least provide more detailed broken down numbers for a more
serious analyses. Fault latency is only a part of the picture which is
much more complex. Look at Mel's report to get an impression of what
might be really useful for a _productive_ discussion.
-- 
Michal Hocko
SUSE Labs


Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

2018-12-05 Thread Michal Hocko
On Wed 05-12-18 11:24:53, David Rientjes wrote:
> On Wed, 5 Dec 2018, Michal Hocko wrote:
> 
> > > > At minimum do not remove the cleanup part which consolidates the gfp
> > > > hadnling to a single place. There is no real reason to have the
> > > > __GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.
> > > > 
> > > 
> > > The __GFP_THISNODE usage is still confined to 
> > > alloc_hugepage_direct_gfpmask() for the thp fault path, we no longer set 
> > > it in alloc_pages_vma() as done before the cleanup.
> > 
> > Why should be new_page any different?
> > 
> 
> To match alloc_new_node_page() which does it correctly and does not change 
> the behavior of mbind() that the cleanup did, which used 
> alloc_hugepage_vma() to get the __GFP_THISNODE behavior.  If there is a 
> reason mbind() is somehow different wrt allocating hugepages locally, I 
> think that should be a separate patch, but the goal of this patch is to 
> revert all the behavioral change that caused hugepages to be allocated 
> remotely.

If the __GFP_THISNODE should be really used then it should be applied to
all other types of pages. Not only THP. And as such done in a separate
patch. Not a part of the revert. The cleanup was meant to unify THP
allocations and that is why I object to reverting it as a part of this
work.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-05 Thread Michal Hocko
On Wed 05-12-18 13:29:18, Michal Hocko wrote:
[...]
> After some more thinking I am not really sure the above reasoning is
> still true with the current upstream kernel. Maybe I just managed to
> confuse myself so please hold off on this patch for now. Testing by
> Oscar has shown this patch is helping but the changelog might need to be
> updated.

OK, so Oscar has nailed it down and it seems that 4.4 kernel we have
been debugging on behaves slightly different. The underlying problem is
the same though. So I have reworded the changelog and added "just in
case" PageLRU handling. Naoya, maybe you have an argument that would
make this void for current upstream kernels.

I have dropped all the reviewed tags as the patch has changed slightly.
Thanks a lot to Oscar for his patience and testing he has devoted to
this issue.

Btw. the way how we drop all the work on the first page that we cannot
isolate is just goofy. Why don't we simply migrate all that we already
have on the list and go on? Something for a followup cleanup though.
---
>From 909521051f41ae46a841b481acaf1ed9c695ae7b Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 3 Dec 2018 10:27:18 +0100
Subject: [PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be
 offlined

We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel. The underlying
reason was that the HWPoison page has an elevated reference count and
the migration keeps failing. There are two problems with that. First
of all it is dubious to migrate the poisoned page because we know that
accessing that memory is possible to fail. Secondly it doesn't make any
sense to migrate a potentially broken content and preserve the memory
corruption over to a new location.

Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase
===

int main(void)
{
int ret;
int i;
int fd;
char *array = malloc(4096);
char *array_locked = malloc(4096);

fd = open("/tmp/data", O_RDONLY);
read(fd, array, 4095);

for (i = 0; i < 4096; i++)
array_locked[i] = 'd';

ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), 
sizeof(array_locked));
if (ret)
perror("mlock");

sleep (20);

ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, 
MADV_HWPOISON);
if (ret)
perror("madvise");

for (i = 0; i < 4096; i++)
array_locked[i] = 'd';

return 0;
}
===

+ offline this memory.

In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel:  [] dump_trace+0x59/0x340
kernel:  [] show_stack_log_lvl+0xea/0x170
kernel:  [] show_stack+0x21/0x40
kernel:  [] dump_stack+0x5c/0x7c
kernel:  [] warn_slowpath_common+0x81/0xb0
kernel:  [] __pagevec_lru_add_fn+0x14c/0x160
kernel:  [] pagevec_lru_move_fn+0xad/0x100
kernel:  [] __lru_cache_add+0x6c/0xb0
kernel:  [] add_to_page_cache_lru+0x46/0x70
kernel:  [] extent_readpages+0xc3/0x1a0 [btrfs]
kernel:  [] __do_page_cache_readahead+0x177/0x200
kernel:  [] ondemand_readahead+0x168/0x2a0
kernel:  [] generic_file_read_iter+0x41f/0x660
kernel:  [] __vfs_read+0xcd/0x140
kernel:  [] vfs_read+0x7a/0x120
kernel:  [] kernel_read+0x3b/0x50
kernel:  [] do_execveat_common.isra.29+0x490/0x6f0
kernel:  [] do_execve+0x28/0x30
kernel:  [] call_usermodehelper_exec_async+0xfb/0x130
kernel:  [] ret_from_fork+0x55/0x80

And that later confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count. It is quite possible that the reuse of the HWPoisoned page is
some kind of fixed race condition but I am not really sure about that.

With the upstream kernel the failure is slightly different. The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails
and do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.

Fix both cases by explicitly checking HWPoisoned pages before we even
try to get a reference on the page, try to unmap it if it is still
mapped. As explained by Naoya
: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.

Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself
about that.

Debugged-by: Oscar Salvador 
Cc: stable
Signed-off-by: Michal Hocko 
---
 mm/memory_hotplug.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c6c42a7425e5..cfa1a2736876 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -34,6 +34,7 @@
 #include 
 #i

Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-05 Thread Michal Hocko
On Mon 03-12-18 11:03:09, Michal Hocko wrote:
> From: Michal Hocko 
> 
> We have received a bug report that an injected MCE about faulty memory
> prevents memory offline to succeed. The underlying reason is that the
> HWPoison page has an elevated reference count and the migration keeps
> failing. There are two problems with that. First of all it is dubious
> to migrate the poisoned page because we know that accessing that memory
> is possible to fail. Secondly it doesn't make any sense to migrate a
> potentially broken content and preserve the memory corruption over to a
> new location.
> 
> Oscar has found out that it is the elevated reference count from
> memory_failure that is confusing the offlining path. HWPoisoned pages
> are isolated from the LRU list but __offline_pages might still try to
> migrate them if there is any preceding migrateable pages in the pfn
> range. Such a migration would fail due to the reference count but
> the migration code would put it back on the LRU list. This is quite
> wrong in itself but it would also make scan_movable_pages stumble over
> it again without any way out.
> 
> This means that the hotremove with hwpoisoned pages has never really
> worked (without a luck). HWPoisoning really needs a larger surgery
> but an immediate and backportable fix is to skip over these pages during
> offlining. Even if they are still mapped for some reason then
> try_to_unmap should turn those mappings into hwpoison ptes and cause
> SIGBUS on access. Nobody should be really touching the content of the
> page so it should be safe to ignore them even when there is a pending
> reference count.

After some more thinking I am not really sure the above reasoning is
still true with the current upstream kernel. Maybe I just managed to
confuse myself so please hold off on this patch for now. Testing by
Oscar has shown this patch is helping but the changelog might need to be
updated.
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Michal Hocko
On Wed 05-12-18 10:43:43, Mel Gorman wrote:
> On Wed, Dec 05, 2018 at 10:08:56AM +0100, Michal Hocko wrote:
> > On Tue 04-12-18 16:47:23, David Rientjes wrote:
> > > On Tue, 4 Dec 2018, Mel Gorman wrote:
> > > 
> > > > What should also be kept in mind is that we should avoid conflating
> > > > locality preferences with THP preferences which is separate from THP
> > > > allocation latencies. The whole __GFP_THISNODE approach is pushing too
> > > > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
> > > > are used which is very unfortunate given that MADV_HUGEPAGE in itself 
> > > > says
> > > > nothing about locality -- that is the business of other madvise flags or
> > > > a specific policy.
> > > 
> > > We currently lack those other madvise modes or mempolicies: mbind() is 
> > > not 
> > > a viable alternative because we do not want to oom kill when local memory 
> > > is depleted, we want to fallback to remote memory.
> > 
> > Yes, there was a clear agreement that there is no suitable mempolicy
> > right now and there were proposals to introduce MPOL_NODE_RECLAIM to
> > introduce that behavior. This would be an improvement regardless of THP
> > because global node-reclaim policy was simply a disaster we had to turn
> > off by default and the global semantic was a reason people just gave up
> > using it completely.
> > 
> 
> The alternative is to define a clear semantic for THP allocation
> requests that are considered "light" regardless of whether that needs a
> GFP flag or not. A sensible default might be
> 
> o Allocate THP local if the amount of work is light or non-existant.
> o Allocate THP remote if one is freely available with no additional work
>   (maybe kick remote kcompactd)
> o Allocate base page local if the amount of work is light or non-existant
> o Allocate base page remote if the amount of work is light or non-existant
> o Do heavy work in zonelist order until a base page is allocated somewhere

I am not sure about the ordering without a deeper consideration but I
thin THP should reflect the approach we have for base bages.

> It's not something could be clearly expressed with either NORETRY or
> THISNODE but longer-term might be saner than chopping and changing on
> which flags are more important and which workload is most relevant. That
> runs the risk of a revert-loop where each person targetting one workload
> reverts one patch to insert another until someone throws up their hands
> in frustration and just carries patches out-of-tree long-term.

Fully agreed!

> I'm not going to prototype something along these lines for now as
> fundamentally a better compaction could cut out part of the root cause
> of pain.

Yes there is some ground work to be done first.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 2/3] mm: Add support for kmem caches in DMA32 zone

2018-12-05 Thread Michal Hocko
On Wed 05-12-18 19:01:03, Nicolas Boichat wrote:
[...]
> > Secondly, why do we need a new sysfs file? Who is going to consume it?
> 
> We have cache_dma, so it seems consistent to add cache_dma32.

I wouldn't copy a pattern unless there is an explicit usecase for it.
We do expose way too much to userspace and that keeps kicking us later.
Not that I am aware of any specific example for cache_dma but seeing
other examples I would rather be more careful.

> I wasn't aware of tools/vm/slabinfo.c, so I can add support for
> cache_dma32 in a follow-up patch. Any other user I should take care
> of?

In general zones are inernal MM implementation details and the less we
export to userspace the better.

> > Then why do we need SLAB_MERGE_SAME to cover GFP_DMA32 as well?
> 
> SLAB_MERGE_SAME tells us which flags _need_ to be the same for the
> slabs to be merged. We don't want slab caches with GFP_DMA32 and
> ~GFP_DMA32 to be merged, so it should be in there.
> (https://elixir.bootlin.com/linux/v4.19.6/source/mm/slab_common.c#L342).

Ohh, my bad, I have misread the change. Sure we definitely not want to
allow merging here. My bad.
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Michal Hocko
On Tue 04-12-18 16:07:27, David Rientjes wrote:
> On Tue, 4 Dec 2018, Michal Hocko wrote:
> 
> > The thing I am really up to here is that reintroduction of
> > __GFP_THISNODE, which you are pushing for, will conflate madvise mode
> > resp. defrag=always with a numa placement policy because the allocation
> > doesn't fallback to a remote node.
> > 
> 
> It isn't specific to MADV_HUGEPAGE, it is the policy for all transparent 
> hugepage allocations, including defrag=always.  We agree that 
> MADV_HUGEPAGE is not exactly defined: does it mean try harder to allocate 
> a hugepage locally, try compaction synchronous to the fault, allow remote 
> fallback?  It's undefined.

Yeah, it is certainly underdefined. One thing is clear though. Using
MADV_HUGEPAGE implies that the specific mapping benefits from THPs and
is willing to pay associated init cost. This doesn't imply anything
regarding NUMA locality and as we have NUMA API it shouldn't even
attempt to do so because it would be conflating two things.
[...]

> > And that is a fundamental problem and the antipattern I am talking
> > about. Look at it this way. All normal allocations are utilizing all the
> > available memory even though they might hit a remote latency penalty. If
> > you do care about NUMA placement you have an API to enforce a specific
> > placement.  What is so different about THP to behave differently. Do
> > we really want to later invent an API to actually allow to utilize all
> > the memory? There are certainly usecases (that triggered the discussion
> > previously) that do not mind the remote latency because all other
> > benefits simply outweight it?
> > 
> 
> What is different about THP is that on every platform I have measured, 
> NUMA matters more than hugepages.  Obviously if on Broadwell, Haswell, and 
> Rome, remote hugepages were a performance win over local pages, this 
> discussion would not be happening.  Faulting local pages rather than 
> local hugepages, if possible, is easy and doesn't require reclaim.  
> Faulting remote pages rather than reclaiming local pages is easy in your 
> scenario, it's non-disruptive.

You keep ignoring all other usecases mentioned before and that is not
really helpful. Access cost can be amortized by other savings. Not to
mention NUMA balancing moving around hot THPs with remote accesses.

> So to answer "what is so different about THP?", it's the performance data.  
> The NUMA locality matters more than whether the pages are huge or not.  We 
> also have the added benefit of khugepaged being able to collapse pages 
> locally if fragmentation improves rather than being stuck accessing a 
> remote hugepage forever.

Please back your claims by a variety of workloads. Including mentioned
KVMs one. You keep hand waving about access latency completely ignoring
all other aspects and that makes my suspicious that you do not really
appreciate all the complexity here even stronger.

If there was a general consensus we want to make THP very special wrt.
numa locality, I could live with that. It would be inconsistency in the
API and as such something that will kick us sooner or later. But it
seems that _you_ are the only one to push that direction and you keep
ignoring all other usecases consistently throughout all the discussions
we have had so far. Several people keeps pointing out that this is a
wrong direction but that seems to be completely ignored.

I believe that the only way forward is back your claims by numbers
covering a larger set of THP users and prove that remote THP is
a wrong default behavior. But you cannot really push that through based
on a single usecase of yours which you refuse to describe beyond a
simple access latency metric.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 2/3] mm: Add support for kmem caches in DMA32 zone

2018-12-05 Thread Michal Hocko
On Wed 05-12-18 13:48:27, Nicolas Boichat wrote:
> In some cases (e.g. IOMMU ARMv7s page allocator), we need to allocate
> data structures smaller than a page with GFP_DMA32 flag.
> 
> This change makes it possible to create a custom cache in DMA32 zone
> using kmem_cache_create, then allocate memory using kmem_cache_alloc.
> 
> We do not create a DMA32 kmalloc cache array, as there are currently
> no users of kmalloc(..., GFP_DMA32). The new test in check_slab_flags
> ensures that such calls still fail (as they do before this change).

The changelog should be much more specific about decisions made here.
First of all it would be nice to mention the usecase.

Secondly, why do we need a new sysfs file? Who is going to consume it?

Then why do we need SLAB_MERGE_SAME to cover GFP_DMA32 as well? I
thought the whole point is to use dedicated slab cache. Who is this
going to merge with?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-05 Thread Michal Hocko
On Wed 05-12-18 17:29:31, Pingfan Liu wrote:
> On Wed, Dec 5, 2018 at 5:21 PM Michal Hocko  wrote:
> >
> > On Wed 05-12-18 13:38:17, Pingfan Liu wrote:
> > > On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko  wrote:
> > > >
> > > > On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > > > > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  wrote:
> > > > > >
> > > > > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > > > > During my test on some AMD machine, with kexec -l nr_cpus=x 
> > > > > > > option, the
> > > > > > > kernel failed to bootup, because some node's data struct can not 
> > > > > > > be allocated,
> > > > > > > e.g, on x86, initialized by 
> > > > > > > init_cpu_to_node()->init_memory_less_node(). But
> > > > > > > device->numa_node info is used as preferred_nid param for
> > > > > > > __alloc_pages_nodemask(), which causes NULL reference
> > > > > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > > > > This patch tries to fix the issue by falling back to the first 
> > > > > > > online node,
> > > > > > > when encountering such corner case.
> > > > > >
> > > > > > We have seen similar issues already and the bug was usually that the
> > > > > > zonelists were not initialized yet or the node is completely bogus.
> > > > > > Zonelists should be initialized by build_all_zonelists quite early 
> > > > > > so I
> > > > > > am wondering whether the later is the case. What is the actual node
> > > > > > number the device is associated with?
> > > > > >
> > > > > The device's node num is 2. And in my case, I used nr_cpus param. Due
> > > > > to init_cpu_to_node() initialize all the possible node.  It is hard
> > > > > for me to figure out without this param, how zonelists is accessed
> > > > > before page allocator works.
> > > >
> > > > I believe we should focus on this. Why does the node have no zonelist
> > > > even though all zonelists should be initialized already? Maybe this is
> > > > nr_cpus pecularity and we do not initialize all the existing numa nodes.
> > > > Or maybe the device is associated to a non-existing node with that
> > > > setup. A full dmesg might help us here.
> > > >
> > > Requiring the machine again, and I got the following without nr_cpus 
> > > option
> > > [root@dell-per7425-03 ~]# cd /sys/devices/system/node/
> > > [root@dell-per7425-03 node]# ls
> > > has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
> > > node4  node5  node6  node7  online  possible  power  uevent
> > > [root@dell-per7425-03 node]# cat has_cpu
> > > 0-7
> > > [root@dell-per7425-03 node]# cat has_memory
> > > 1,5
> > > [root@dell-per7425-03 node]# cat online
> > > 0-7
> > > [root@dell-per7425-03 node]# cat possible
> > > 0-7
> > > And lscpu shows the following numa-cpu info:
> > > NUMA node0 CPU(s): 0,8,16,24
> > > NUMA node1 CPU(s): 2,10,18,26
> > > NUMA node2 CPU(s): 4,12,20,28
> > > NUMA node3 CPU(s): 6,14,22,30
> > > NUMA node4 CPU(s): 1,9,17,25
> > > NUMA node5 CPU(s): 3,11,19,27
> > > NUMA node6 CPU(s): 5,13,21,29
> > > NUMA node7 CPU(s): 7,15,23,31
> > >
> > > For the full panic message (I masked some hostname info with xx),
> > > please see the attachment.
> > > In a short word, it seems a problem with nr_cpus, if without this
> > > option, the kernel can bootup correctly.
> >
> > Yep.
> > [0.007418] Early memory node ranges
> > [0.007419]   node   1: [mem 0x1000-0x0008efff]
> > [0.007420]   node   1: [mem 0x0009-0x0009]
> > [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> > [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> > [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> > [0.007424]   node   1: [mem 0x0001-0x00047fff]
> > [0.007425]   node   5: [mem 0x00048000-0x00087eff]
> >
> > There is clearly no node2. Where did the driver get the node2 from?
> Since using nr_cpus=4 , the node2 is not be in

Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-05 Thread Michal Hocko
On Wed 05-12-18 13:38:17, Pingfan Liu wrote:
> On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko  wrote:
> >
> > On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  wrote:
> > > >
> > > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > > During my test on some AMD machine, with kexec -l nr_cpus=x option, 
> > > > > the
> > > > > kernel failed to bootup, because some node's data struct can not be 
> > > > > allocated,
> > > > > e.g, on x86, initialized by 
> > > > > init_cpu_to_node()->init_memory_less_node(). But
> > > > > device->numa_node info is used as preferred_nid param for
> > > > > __alloc_pages_nodemask(), which causes NULL reference
> > > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > > This patch tries to fix the issue by falling back to the first online 
> > > > > node,
> > > > > when encountering such corner case.
> > > >
> > > > We have seen similar issues already and the bug was usually that the
> > > > zonelists were not initialized yet or the node is completely bogus.
> > > > Zonelists should be initialized by build_all_zonelists quite early so I
> > > > am wondering whether the later is the case. What is the actual node
> > > > number the device is associated with?
> > > >
> > > The device's node num is 2. And in my case, I used nr_cpus param. Due
> > > to init_cpu_to_node() initialize all the possible node.  It is hard
> > > for me to figure out without this param, how zonelists is accessed
> > > before page allocator works.
> >
> > I believe we should focus on this. Why does the node have no zonelist
> > even though all zonelists should be initialized already? Maybe this is
> > nr_cpus pecularity and we do not initialize all the existing numa nodes.
> > Or maybe the device is associated to a non-existing node with that
> > setup. A full dmesg might help us here.
> >
> Requiring the machine again, and I got the following without nr_cpus option
> [root@dell-per7425-03 ~]# cd /sys/devices/system/node/
> [root@dell-per7425-03 node]# ls
> has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
> node4  node5  node6  node7  online  possible  power  uevent
> [root@dell-per7425-03 node]# cat has_cpu
> 0-7
> [root@dell-per7425-03 node]# cat has_memory
> 1,5
> [root@dell-per7425-03 node]# cat online
> 0-7
> [root@dell-per7425-03 node]# cat possible
> 0-7
> And lscpu shows the following numa-cpu info:
> NUMA node0 CPU(s): 0,8,16,24
> NUMA node1 CPU(s): 2,10,18,26
> NUMA node2 CPU(s): 4,12,20,28
> NUMA node3 CPU(s): 6,14,22,30
> NUMA node4 CPU(s): 1,9,17,25
> NUMA node5 CPU(s): 3,11,19,27
> NUMA node6 CPU(s): 5,13,21,29
> NUMA node7 CPU(s): 7,15,23,31
> 
> For the full panic message (I masked some hostname info with xx),
> please see the attachment.
> In a short word, it seems a problem with nr_cpus, if without this
> option, the kernel can bootup correctly.

Yep.
[0.007418] Early memory node ranges
[0.007419]   node   1: [mem 0x1000-0x0008efff]
[0.007420]   node   1: [mem 0x0009-0x0009]
[0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
[0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
[0.007423]   node   1: [mem 0x6c528000-0x6fff]
[0.007424]   node   1: [mem 0x0001-0x00047fff]
[0.007425]   node   5: [mem 0x00048000-0x00087eff]

There is clearly no node2. Where did the driver get the node2 from?
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Michal Hocko
On Tue 04-12-18 16:47:23, David Rientjes wrote:
> On Tue, 4 Dec 2018, Mel Gorman wrote:
> 
> > What should also be kept in mind is that we should avoid conflating
> > locality preferences with THP preferences which is separate from THP
> > allocation latencies. The whole __GFP_THISNODE approach is pushing too
> > hard on locality versus huge pages when MADV_HUGEPAGE or always-defrag
> > are used which is very unfortunate given that MADV_HUGEPAGE in itself says
> > nothing about locality -- that is the business of other madvise flags or
> > a specific policy.
> 
> We currently lack those other madvise modes or mempolicies: mbind() is not 
> a viable alternative because we do not want to oom kill when local memory 
> is depleted, we want to fallback to remote memory.

Yes, there was a clear agreement that there is no suitable mempolicy
right now and there were proposals to introduce MPOL_NODE_RECLAIM to
introduce that behavior. This would be an improvement regardless of THP
because global node-reclaim policy was simply a disaster we had to turn
off by default and the global semantic was a reason people just gave up
using it completely.

[...]

> Sure, but not at the cost of regressing real-world workloads; what is 
> being asked for here is legitimate and worthy of an extension, but since 
> the long-standing behavior has been to use __GFP_THISNODE and people 
> depend on that for NUMA locality,

Well, your patch has altered the semantic and has introduced a subtle
and _undocumented_ NUMA policy into MADV_HUGEPAGE. All that without any
workload numbers. It would be preferable to have a simulator of those
real world workloads of course but even getting some more detailed
metric - e.g. without the patch we have X THP utilization and the
runtime characteristics Y but without X1 and Y1).

> can we not fix the swap storm and look 
> to extending the API to include workloads that span multiple nodes?

Yes, we definitely want to address swap storms. No question about that.
But our established approach for NUMA policy has been to fallback to
other nodes and everybody focused on NUMA latency should use NUMA API to
achive that. Not vice versa.

As I've said in other thread, I am OK with restoring __GFP_THISNODE for
now but we should really have a very good plan for further steps. And
that involves an agreed NUMA behavior. I haven't seen any widespread
agreement on that yet though.

[...]
> > I would also re-emphasise that a major problem with addressing this
> > problem is that we do not have a general reproducible test case for
> > David's scenario where as we do have reproduction cases for the others.
> > They're not related to KVM but that doesn't matter because it's enough
> > to have a memory hog try allocating more memory than fits on a single node.
> > 
> 
> It's trivial to reproduce this issue: fragment all local memory that 
> compaction cannot resolve, do posix_memalign() for hugepage aligned 
> memory, and measure the access latency.  To fragment all local memory, you 
> can simply insert a kernel module and allocate high-order memory (just do 
> kmem_cache_alloc_node() or get_page() to pin it so compaction fails or 
> punch holes in the file as you did above).  You can do this for all memory 
> rather than the local node to measure the even more severe allocation 
> latency regression that not setting __GFP_THISNODE introduces.

Sure, but can we get some numbers from a real workload rather than an
artificial worst case? The utilization issue Mel pointed out before and
here again is a real concern IMHO. We we definitely need a better
picture to make an educated decision.
-- 
Michal Hocko
SUSE Labs


Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-05 Thread Michal Hocko
On Tue 04-12-18 14:04:10, David Rientjes wrote:
> On Tue, 4 Dec 2018, Vlastimil Babka wrote:
> 
> > So, AFAIK, the situation is:
> > 
> > - commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The
> > intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking
> > both as it seemed to make sense).
> 
> Yes, both are based on the preference to fault local thp and fallback to 
> local pages before allocating remotely because it does not cause the 
> performance regression introduced by not setting __GFP_THISNODE.
> 
> > - The resulting node-reclaim-like behavior regressed Andrea's KVM
> > workloads, but reverting it (only for madvised or non-default
> > defrag=always THP by commit ac5b2c18911f) would regress David's
> > workloads starting with 4.20 to pre-4.1 levels.
> > 
> 
> Almost, but the defrag=always case had the subtle difference of also 
> setting __GFP_NORETRY whereas MADV_HUGEPAGE did not.  This was different 
> than the comment in __alloc_pages_slowpath() that expected thp fault 
> allocations to be caught by checking __GFP_NORETRY.
> 
> > If the decision is that it's too late to revert a 4.1 regression for one
> > kind of workload in 4.20 because it causes regression for another
> > workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20
> > and don't rush a different fix (patch 2) to 4.20. It's not a big
> > difference if a 4.1 regression is fixed in 4.20 or 4.21?
> > 
> 
> The revert is certainly needed to prevent the regression, yes, but I 
> anticipate that Andrea will report back that patch 2 at least improves the 
> situation for the problem that he was addressing, specifically that it is 
> pointless to thrash any node or reclaim unnecessarily when compaction has 
> already failed.  This is what setting __GFP_NORETRY for all thp fault 
> allocations fixes.

Yes but earlier numbers from Mel and repeated again [1] simply show
that the swap storms are only handled in favor of an absolute drop of
THP success rate.
 
> > Because there might be other unexpected consequences of patch 2 that
> > testing won't be able to catch in the remaining 4.20 rc's. And I'm not
> > even sure if it will fix Andrea's workloads. While it should prevent
> > node-reclaim-like thrashing, it will still mean that KVM (or anyone)
> > won't be able to allocate THP's remotely, even if the local node is
> > exhausted of both huge and base pages.
> > 
> 
> Patch 2 does nothing with respect to the remote allocation policy; it 
> simply prevents reclaim (and potentially thrashing).  Patch 1 sets 
> __GFP_THISNODE to prevent the remote allocation.

Yes, this is understood. So we are getting worst of both. We have a
numa locality side effect of MADV_HUGEPAGE and we have a poor THP
utilization. So how come this is an improvement. Especially when the
reported regression hasn't been demonstrated on a real or repeatable
workload but rather a very vague presumably worst case behavior where
the access penalty is absolutely prevailing.

[1] http://lkml.kernel.org/r/20181204104558.gv23...@techsingularity.net
-- 
Michal Hocko
SUSE Labs


Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-04 Thread Michal Hocko
On Tue 04-12-18 14:25:54, David Rientjes wrote:
> On Tue, 4 Dec 2018, Michal Hocko wrote:
> 
> > > This fixes a 13.9% of remote memory access regression and 40% remote
> > > memory allocation regression on Haswell when the local node is fragmented
> > > for hugepage sized pages and memory is being faulted with either the thp
> > > defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
> > > 
> > > The usecase that initially identified this issue were binaries that mremap
> > > their .text segment to be backed by transparent hugepages on startup.
> > > They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().
> > 
> > Do you have something you can share with so that other people can play
> > and try to reproduce?
> > 
> 
> This is a single MADV_HUGEPAGE usecase, there is nothing special about it.  
> It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and 
> faulted the memory with a fragmented local node and then measured the 
> remote access latency to the remote hugepage that occurs without setting 
> __GFP_THISNODE.  You can also measure the remote allocation latency by 
> fragmenting the entire system and then faulting.
> 
> (Remapping the text segment only involves parsing /proc/self/exe, mmap, 
> madvise, memcpy, and mremap.)

How does this reflect your real workload and regressions in it. It
certainly shows the worst case behavior where the access penalty is
prevalent while there are no other metrics which might be interesting or
even important. E.g. page table savings or the TLB pressure in general
when THP fail too eagerly.
As Anrea mentioned there are really valid cases where the remote latency
pays off.

Have you actually seen the advertised regression in the real workload
and do you have any means to simulate that workload.

> > > This requires a full revert and partial revert of commits merged during
> > > the 4.20 rc cycle.  The full revert, of ac5b2c18911f ("mm: thp: relax
> > > __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
> > > amounts of swap activity on the local zone when faulting hugepages by
> > > falling back to remote memory.  This remote allocation causes the access
> > > regression and, if fragmented, the allocation regression.
> > 
> > Have you tried to measure any of the workloads Mel and Andrea have
> > pointed out during the previous review discussion? In other words what
> > is the impact on the THP success rate and allocation latencies for other
> > usecases?
> 
> It isn't a property of the workload, it's a property of the how fragmented 
> both local and remote memory is.  In Andrea's case, I believe he has 
> stated that memory compaction has failed locally and the resulting reclaim 
> activity ends up looping and causing it the thrash the local node whereas 
> 75% of remote memory is free and not fragmented.  So we have local 
> fragmentation and reclaim is very expensive to enable compaction to 
> succeed, if it ever does succeed[*], and mostly free remote memory.
> 
> If remote memory is also fragmented, Andrea's case will run into a much 
> more severe swap storm as a result of not setting __GFP_THISNODE.  The 
> premise of the entire change is that his remote memory is mostly free so 
> fallback results in a quick allocation.  For balanced nodes, that's not 
> going to be the case.  The fix to prevent the heavy reclaim activity is to 
> set __GFP_NORETRY as the page allocator suspects, which patch 2 here does.

You are not answering my question and I take it as you haven't actually
tested this in a variety of workload and base your assumptions on an
artificial worst case scenario. Please correct me if I am wrong.

-- 
Michal Hocko
SUSE Labs


Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

2018-12-04 Thread Michal Hocko
On Tue 04-12-18 13:56:30, David Rientjes wrote:
> On Tue, 4 Dec 2018, Michal Hocko wrote:
> 
> > > This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > > MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
> > > consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").
> > > 
> > > By not setting __GFP_THISNODE, applications can allocate remote hugepages
> > > when the local node is fragmented or low on memory when either the thp
> > > defrag setting is "always" or the vma has been madvised with
> > > MADV_HUGEPAGE.
> > > 
> > > Remote access to hugepages often has much higher latency than local pages
> > > of the native page size.  On Haswell, ac5b2c18911f was shown to have a
> > > 13.9% access regression after this commit for binaries that remap their
> > > text segment to be backed by transparent hugepages.
> > > 
> > > The intent of ac5b2c18911f is to address an issue where a local node is
> > > low on memory or fragmented such that a hugepage cannot be allocated.  In
> > > every scenario where this was described as a fix, there is abundant and
> > > unfragmented remote memory available to allocate from, even with a greater
> > > access latency.
> > > 
> > > If remote memory is also low or fragmented, not setting __GFP_THISNODE was
> > > also measured on Haswell to have a 40% regression in allocation latency.
> > > 
> > > Restore __GFP_THISNODE for thp allocations.
> > > 
> > > Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE 
> > > mappings")
> > > Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
> > > alloc_hugepage_direct_gfpmask")
> > 
> > At minimum do not remove the cleanup part which consolidates the gfp
> > hadnling to a single place. There is no real reason to have the
> > __GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.
> > 
> 
> The __GFP_THISNODE usage is still confined to 
> alloc_hugepage_direct_gfpmask() for the thp fault path, we no longer set 
> it in alloc_pages_vma() as done before the cleanup.

Why should be new_page any different?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-04 Thread Michal Hocko
On Tue 04-12-18 09:31:11, Linus Torvalds wrote:
> On Tue, Dec 4, 2018 at 1:58 AM Michal Hocko  wrote:
> >
> > AFAIU both suspend and hibernation require the system to enter quiescent
> > state with no task potentially interfering with suspended devices. And
> > in this particular case those de-thread-ed threads will certainly not
> > interfere so silencing the lockdep sounds like a reasonable workaround.
> 
> I still think it would be better to simply not freeze killed user processes.
> 
> We already  have things like
> 
> if (test_tsk_thread_flag(p, TIF_MEMDIE))
> return false;
> 
> exactly because we do not want to freeze processes that are about to
> die due to being killed. Very similar situation: we don't want to
> freeze those processes, because doing so would halt them from freeing
> the resources that may be needed for suspend or hibernate.

Yeah, we do not freeze oom victims but that is really far from trivial
to achieve. In order to do so we post-pone quiescent state until we know
all the TIF_MEMDIE threads are gone. See oom_killer_disable. In short it
is a painfull code.

> How about something like we set PF_NOFREEZE when we set PF_EXITING? At
> that point we've pretty much turned into a kernel thread, no?

Hmm, that doesn't sound like a bad idea but I am not sure it will
help because those threads we are waiting for might be block way before
they reached PF_EXITING. I would expect that threads that actually reach
that state usually die shortly. Unless I am missing something we are
talking about basically two cases here. Those threads are frozen while
de_thread waits for them because zap_other_threads cannot wake them from
the fridge or they are blocked in uninterruptible sleep somewhere in the
kernel. The later is the fundamental problem. The former could be hacked
around (check PF_FROZEN and use uninterruptible wake), I guess but I
haven't thought that through yet.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-04 Thread Michal Hocko
On Tue 04-12-18 10:33:10, Ingo Molnar wrote:
> 
> * Michal Hocko  wrote:
> 
> > I dunno. I do not use hibernation. I am a heavy user of the suspend 
> > though. I s2ram all the time. And I have certainly experienced cases 
> > where suspend has failed and I onlyi found out later when I've picked 
> > up my laptop from my heat up bag. Nothing fatal has resulted from that 
> > but this is certainly annoying.
> 
> Hm, so I (mistakenly) thought freezing was mostly limited to hibernation 
> and to a few weird cases when in flight DMA must not be suspended - but 
> I'm wrong and in practice we always freeze tasks during s2ram, right?

Yup.

> And indeed:
> 
>  config SUSPEND_FREEZER
> bool "Enable freezer for suspend to RAM/standby" \
> if ARCH_WANTS_FREEZER_CONTROL || BROKEN
> depends on SUSPEND
> default y
> 
> which is essentially always enabled on x86.
> 
> TIL ...
> 
> s2ram is obviously a huge deal.
> 
> Just a newbie question: any chance to not do any freezing at all on 
> modern laptops when doing s2ram, or at least only warn if it fails and 
> try to suspend?

AFAIU both suspend and hibernation require the system to enter quiescent
state with no task potentially interfering with suspended devices. And
in this particular case those de-thread-ed threads will certainly not
interfere so silencing the lockdep sounds like a reasonable workaround.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-04 Thread Michal Hocko
On Tue 04-12-18 09:11:05, Naoya Horiguchi wrote:
> On Tue, Dec 04, 2018 at 09:48:26AM +0100, Michal Hocko wrote:
> > On Tue 04-12-18 07:21:16, Naoya Horiguchi wrote:
> > > On Mon, Dec 03, 2018 at 11:03:09AM +0100, Michal Hocko wrote:
> > > > From: Michal Hocko 
> > > > 
> > > > We have received a bug report that an injected MCE about faulty memory
> > > > prevents memory offline to succeed. The underlying reason is that the
> > > > HWPoison page has an elevated reference count and the migration keeps
> > > > failing. There are two problems with that. First of all it is dubious
> > > > to migrate the poisoned page because we know that accessing that memory
> > > > is possible to fail. Secondly it doesn't make any sense to migrate a
> > > > potentially broken content and preserve the memory corruption over to a
> > > > new location.
> > > > 
> > > > Oscar has found out that it is the elevated reference count from
> > > > memory_failure that is confusing the offlining path. HWPoisoned pages
> > > > are isolated from the LRU list but __offline_pages might still try to
> > > > migrate them if there is any preceding migrateable pages in the pfn
> > > > range. Such a migration would fail due to the reference count but
> > > > the migration code would put it back on the LRU list. This is quite
> > > > wrong in itself but it would also make scan_movable_pages stumble over
> > > > it again without any way out.
> > > > 
> > > > This means that the hotremove with hwpoisoned pages has never really
> > > > worked (without a luck). HWPoisoning really needs a larger surgery
> > > > but an immediate and backportable fix is to skip over these pages during
> > > > offlining. Even if they are still mapped for some reason then
> > > > try_to_unmap should turn those mappings into hwpoison ptes and cause
> > > > SIGBUS on access. Nobody should be really touching the content of the
> > > > page so it should be safe to ignore them even when there is a pending
> > > > reference count.
> > > > 
> > > > Debugged-by: Oscar Salvador 
> > > > Cc: stable
> > > > Signed-off-by: Michal Hocko 
> > > > ---
> > > > Hi,
> > > > I am sending this as an RFC now because I am not fully sure I see all
> > > > the consequences myself yet. This has passed a testing by Oscar but I
> > > > would highly appreciate a review from Naoya about my assumptions about
> > > > hwpoisoning. E.g. it is not entirely clear to me whether there is a
> > > > potential case where the page might be still mapped.
> > > 
> > > One potential case is ksm page, for which we give up unmapping and leave
> > > it unmapped. Rather than that I don't have any idea, but any new type of
> > > page would be potentially categorized to this class.
> > 
> > Could you be more specific why hwpoison code gives up on ksm pages while
> > we can safely unmap here?
> 
> Actually no big reason. Ksm pages never dominate memory, so we simply didn't
> have strong motivation to save the pages.

OK, so the unmapping is safe. I will drop a comment. Does this look good
to you?
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 08c576d5a633..ef5d42759aa2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1370,7 +1370,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
end_pfn)
/*
 * HWPoison pages have elevated reference counts so the 
migration would
 * fail on them. It also doesn't make any sense to migrate them 
in the
-* first place. Still try to unmap such a page in case it is 
still mapped.
+* first place. Still try to unmap such a page in case it is 
still mapped
+* (e.g. current hwpoison implementation doesn't unmap KSM 
pages but keep
+* the unmap as the catch all safety net).
 */
if (PageHWPoison(page)) {
if (page_mapped(page))
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-04 Thread Michal Hocko
On Tue 04-12-18 10:02:28, Ingo Molnar wrote:
> 
> * Michal Hocko  wrote:
> 
> > > Do we actually have reports of this happening for people outside
> > > Android?
> > 
> > Not that I am aware of.
> 
> I'd say outside of Android 99% of the use of hibernation is the fail-safe 
> that distributions offer on laptops with very low battery levels: the 
> emergency hibernation when there's almost no power left anymore.
> 
> Do these hibernation failure messages typically make it to persistent 
> logs before the system uses power?

I dunno. I do not use hibernation. I am a heavy user of the suspend
though. I s2ram all the time. And I have certainly experienced cases
where suspend has failed and I onlyi found out later when I've picked up
my laptop from my heat up bag. Nothing fatal has resulted from that but
this is certainly annoying.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Michal Hocko
On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  wrote:
> >
> > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> > > kernel failed to bootup, because some node's data struct can not be 
> > > allocated,
> > > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). 
> > > But
> > > device->numa_node info is used as preferred_nid param for
> > > __alloc_pages_nodemask(), which causes NULL reference
> > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > This patch tries to fix the issue by falling back to the first online 
> > > node,
> > > when encountering such corner case.
> >
> > We have seen similar issues already and the bug was usually that the
> > zonelists were not initialized yet or the node is completely bogus.
> > Zonelists should be initialized by build_all_zonelists quite early so I
> > am wondering whether the later is the case. What is the actual node
> > number the device is associated with?
> >
> The device's node num is 2. And in my case, I used nr_cpus param. Due
> to init_cpu_to_node() initialize all the possible node.  It is hard
> for me to figure out without this param, how zonelists is accessed
> before page allocator works.

I believe we should focus on this. Why does the node have no zonelist
even though all zonelists should be initialized already? Maybe this is
nr_cpus pecularity and we do not initialize all the existing numa nodes.
Or maybe the device is associated to a non-existing node with that
setup. A full dmesg might help us here.

> > Your patch is not correct btw, because we want to fallback into the node in
> > the distance order rather into the first online node.
> > --
> What about this:
> +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
> +
>  /*
>   * We get the zone list from the current node and the gfp_mask.
>   * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
> @@ -453,6 +455,11 @@ static inline int gfp_zonelist(gfp_t flags)
>   */
>  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>  {
> +   if (unlikely(!node_online(nid))) {
> +   nodemask_t used_mask;
> +   nodes_complement(used_mask, node_online_map);
> +   nid = find_next_best_node(nid, _mask);
> +   }
> return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>  }
> 
> I just finished the compiling, not test it yet, since the machine is
> not on hand yet. It needs some time to get it again.

This is clearly a no-go. nodemask_t can be giant and you cannot have it
on the stack for allocation paths which might be called from a deep
stack already. Also this is called from the allocator hot paths and each
branch counts.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-04 Thread Michal Hocko
On Tue 04-12-18 07:21:16, Naoya Horiguchi wrote:
> On Mon, Dec 03, 2018 at 11:03:09AM +0100, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > We have received a bug report that an injected MCE about faulty memory
> > prevents memory offline to succeed. The underlying reason is that the
> > HWPoison page has an elevated reference count and the migration keeps
> > failing. There are two problems with that. First of all it is dubious
> > to migrate the poisoned page because we know that accessing that memory
> > is possible to fail. Secondly it doesn't make any sense to migrate a
> > potentially broken content and preserve the memory corruption over to a
> > new location.
> > 
> > Oscar has found out that it is the elevated reference count from
> > memory_failure that is confusing the offlining path. HWPoisoned pages
> > are isolated from the LRU list but __offline_pages might still try to
> > migrate them if there is any preceding migrateable pages in the pfn
> > range. Such a migration would fail due to the reference count but
> > the migration code would put it back on the LRU list. This is quite
> > wrong in itself but it would also make scan_movable_pages stumble over
> > it again without any way out.
> > 
> > This means that the hotremove with hwpoisoned pages has never really
> > worked (without a luck). HWPoisoning really needs a larger surgery
> > but an immediate and backportable fix is to skip over these pages during
> > offlining. Even if they are still mapped for some reason then
> > try_to_unmap should turn those mappings into hwpoison ptes and cause
> > SIGBUS on access. Nobody should be really touching the content of the
> > page so it should be safe to ignore them even when there is a pending
> > reference count.
> > 
> > Debugged-by: Oscar Salvador 
> > Cc: stable
> > Signed-off-by: Michal Hocko 
> > ---
> > Hi,
> > I am sending this as an RFC now because I am not fully sure I see all
> > the consequences myself yet. This has passed a testing by Oscar but I
> > would highly appreciate a review from Naoya about my assumptions about
> > hwpoisoning. E.g. it is not entirely clear to me whether there is a
> > potential case where the page might be still mapped.
> 
> One potential case is ksm page, for which we give up unmapping and leave
> it unmapped. Rather than that I don't have any idea, but any new type of
> page would be potentially categorized to this class.

Could you be more specific why hwpoison code gives up on ksm pages while
we can safely unmap here?

[...]
> 
> I think this looks OK (no better idea.)
> 
> Reviewed-by: Naoya Horiguchi 

Thanks!

> I wondered why I didn't find this for long, and found that my testing only
> covered the case where PageHWPoison is the first page of memory block.
> scan_movable_pages() considers PageHWPoison as non-movable, so 
> do_migrate_range()
> started with pfn after the PageHWPoison and never tried to migrate it
> (so effectively ignored every PageHWPoison as the above code does.)

Yeah, it seems that the hotremove worked only by chance in presence of
hwpoison pages so far. The specific usecase which triggered this patch
is a heavily memory utilized system with in memory database IIRC. So it
is quite likely that hwpoison pages are punched to otherwise used
memory.

Thanks for the review Naoya!

-- 
Michal Hocko
SUSE Labs


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-04 Thread Michal Hocko
On Mon 03-12-18 13:53:21, David Rientjes wrote:
> On Mon, 3 Dec 2018, Michal Hocko wrote:
> 
> > > I think extending functionality so thp can be allocated remotely if truly 
> > > desired is worthwhile
> > 
> > This is a complete NUMA policy antipatern that we have for all other
> > user memory allocations. So far you have to be explicit for your numa
> > requirements. You are trying to conflate NUMA api with MADV and that is
> > just conflating two orthogonal things and that is just wrong.
> > 
> 
> No, the page allocator change for both my patch and __GFP_COMPACT_ONLY has 
> nothing to do with any madvise() mode.  It has to do with where thp 
> allocations are preferred.  Yes, this is different than other memory 
> allocations where it doesn't cause a 13.9% access latency regression for 
> the lifetime of a binary for users who back their text with hugepages.  
> MADV_HUGEPAGE still has its purpose to try synchronous memory compaction 
> at fault time under all thp defrag modes other than "never".  The specific 
> problem being reported here, and that both my patch and __GFP_COMPACT_ONLY 
> address, is the pointless reclaim activity that does not assist in making 
> compaction more successful.

You do not address my concern though. Sure there are reclaim related
issues. Nobody is questioning that. But that is only half of the
problem.

The thing I am really up to here is that reintroduction of
__GFP_THISNODE, which you are pushing for, will conflate madvise mode
resp. defrag=always with a numa placement policy because the allocation
doesn't fallback to a remote node.

And that is a fundamental problem and the antipattern I am talking
about. Look at it this way. All normal allocations are utilizing all the
available memory even though they might hit a remote latency penalty. If
you do care about NUMA placement you have an API to enforce a specific
placement.  What is so different about THP to behave differently. Do
we really want to later invent an API to actually allow to utilize all
the memory? There are certainly usecases (that triggered the discussion
previously) that do not mind the remote latency because all other
benefits simply outweight it?

That being said what should users who want to use all the memory do to
use as many THPs as possible?
-- 
Michal Hocko
SUSE Labs


Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 15:50:18, David Rientjes wrote:
> This fixes a 13.9% of remote memory access regression and 40% remote
> memory allocation regression on Haswell when the local node is fragmented
> for hugepage sized pages and memory is being faulted with either the thp
> defrag setting of "always" or has been madvised with MADV_HUGEPAGE.
> 
> The usecase that initially identified this issue were binaries that mremap
> their .text segment to be backed by transparent hugepages on startup.
> They do mmap(), madvise(MADV_HUGEPAGE), memcpy(), and mremap().

Do you have something you can share with so that other people can play
and try to reproduce?

> This requires a full revert and partial revert of commits merged during
> the 4.20 rc cycle.  The full revert, of ac5b2c18911f ("mm: thp: relax
> __GFP_THISNODE for MADV_HUGEPAGE mappings"), was anticipated to fix large
> amounts of swap activity on the local zone when faulting hugepages by
> falling back to remote memory.  This remote allocation causes the access
> regression and, if fragmented, the allocation regression.

Have you tried to measure any of the workloads Mel and Andrea have
pointed out during the previous review discussion? In other words what
is the impact on the THP success rate and allocation latencies for other
usecases?
-- 
Michal Hocko
SUSE Labs


Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

2018-12-03 Thread Michal Hocko
 Synchronous compaction if madvised, otherwise kick kcompactd */
>   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, 
> _hugepage_flags))
> - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? 
> __GFP_DIRECT_RECLAIM :
> -  
> __GFP_KSWAPD_RECLAIM | this_node);
> + return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> +   __GFP_KSWAPD_RECLAIM);
> +
> + /* Only do synchronous compaction if madvised */
>   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, 
> _hugepage_flags))
> - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? 
> __GFP_DIRECT_RECLAIM :
> -  this_node);
> - return GFP_TRANSHUGE_LIGHT | this_node;
> + return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
> +
> + return gfp_mask;
>  }
>  
>  /* Caller must hold page table lock. */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1116,8 +1116,9 @@ static struct page *new_page(struct page *page, 
> unsigned long start)
>   } else if (PageTransHuge(page)) {
>   struct page *thp;
>  
> - thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma,
> - address, numa_node_id());
> + thp = alloc_pages_vma(GFP_TRANSHUGE | __GFP_THISNODE,
> +   HPAGE_PMD_ORDER, vma, address,
> +   numa_node_id());
>   if (!thp)
>   return NULL;
>   prep_transhuge_page(thp);
> @@ -1662,7 +1663,7 @@ struct mempolicy *__get_vma_policy(struct 
> vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>   unsigned long addr)
>  {
>   struct mempolicy *pol = __get_vma_policy(vma, addr);

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Michal Hocko
On Tue 04-12-18 10:40:29, Xunlei Pang wrote:
> On 2018/12/4 AM 1:22, Michal Hocko wrote:
> > On Mon 03-12-18 23:20:31, Xunlei Pang wrote:
> >> On 2018/12/3 下午7:56, Michal Hocko wrote:
> >>> On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
> >>>> There may be cgroup memory overcommitment, it will become
> >>>> even common in the future.
> >>>>
> >>>> Let's enable kswapd to reclaim low-protected memory in case
> >>>> of memory pressure, to mitigate the global direct reclaim
> >>>> pressures which could cause jitters to the response time of
> >>>> lantency-sensitive groups.
> >>>
> >>> Please be more descriptive about the problem you are trying to handle
> >>> here. I haven't actually read the patch but let me emphasise that the
> >>> low limit protection is important isolation tool. And allowing kswapd to
> >>> reclaim protected memcgs is going to break the semantic as it has been
> >>> introduced and designed.
> >>
> >> We have two types of memcgs: online groups(important business)
> >> and offline groups(unimportant business). Online groups are
> >> all configured with MAX low protection, while offline groups
> >> are not at all protected(with default 0 low).
> >>
> >> When offline groups are overcommitted, the global memory pressure
> >> suffers. This will cause the memory allocations from online groups
> >> constantly go to the slow global direct reclaim in order to reclaim
> >> online's page caches, as kswap is not able to reclaim low-protection
> >> memory. low is not hard limit, it's reasonable to be reclaimed by
> >> kswapd if there's no other reclaimable memory.
> > 
> > I am sorry I still do not follow. What role do offline cgroups play.
> > Those are certainly not low mem protected because mem_cgroup_css_offline
> > will reset them to 0.
> > 
> 
> Oh, I meant "offline groups" to be "offline-business groups", memcgs
> refered to here are all "online state" from kernel's perspective.

What is offline-business group? Please try to explain the actual problem
in much more details and do not let us guess.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-03 Thread Michal Hocko
On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> kernel failed to bootup, because some node's data struct can not be allocated,
> e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> device->numa_node info is used as preferred_nid param for
> __alloc_pages_nodemask(), which causes NULL reference
>   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> This patch tries to fix the issue by falling back to the first online node,
> when encountering such corner case.

We have seen similar issues already and the bug was usually that the
zonelists were not initialized yet or the node is completely bogus.
Zonelists should be initialized by build_all_zonelists quite early so I
am wondering whether the later is the case. What is the actual node
number the device is associated with?

Your patch is not correct btw, because we want to fallback into the node in
the distance order rather into the first online node.
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 12:39:34, David Rientjes wrote:
> On Mon, 3 Dec 2018, Michal Hocko wrote:
> 
> > I have merely said that a better THP locality needs more work and during
> > the review discussion I have even volunteered to work on that. There
> > are other reclaim related fixes under work right now. All I am saying
> > is that MADV_TRANSHUGE having numa locality implications cannot satisfy
> > all the usecases and it is particurarly KVM that suffers from it.
> 
> I think extending functionality so thp can be allocated remotely if truly 
> desired is worthwhile

This is a complete NUMA policy antipatern that we have for all other
user memory allocations. So far you have to be explicit for your numa
requirements. You are trying to conflate NUMA api with MADV and that is
just conflating two orthogonal things and that is just wrong.

Let's put the __GFP_THISNODE issue aside. I do not remember you
confirming that __GFP_COMPACT_ONLY patch is OK for you (sorry it might
got lost in the emails storm from back then) but if that is the only
agreeable solution for now then I can live with that. __GFP_NORETRY hack
was shown to not work properly by Mel AFAIR. Again if I misremember then
I am sorry and I can live with that. But conflating MADV_TRANSHUGE with
an implicit numa placement policy and/or adding an opt-in for remote
NUMA placing is completely backwards and a broken API which will likely
bites us later. I sincerely hope we are not going to repeat mistakes
from the past.
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 10:45:35, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 10:30 AM Michal Hocko  wrote:
> >
> > I do not get it. 5265047ac301 which this patch effectively reverts has
> > regressed kvm workloads. People started to notice only later because
> > they were not running on kernels with that commit until later. We have
> > 4.4 based kernels reports. What do you propose to do for those people?
> 
> We have at least two patches that others claim to fix things.
> 
> You dismissed them and said "can't be done".

You are misinterpreting my words. I haven't dismissed anything. I do
recognize both usecases under discussion.

I have merely said that a better THP locality needs more work and during
the review discussion I have even volunteered to work on that. There
are other reclaim related fixes under work right now. All I am saying
is that MADV_TRANSHUGE having numa locality implications cannot satisfy
all the usecases and it is particurarly KVM that suffers from it.
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 10:19:55, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 10:15 AM Michal Hocko  wrote:
> >
> > The thing is that there is no universal win here. There are two
> > different types of workloads and we cannot satisfy both.
> 
> Ok, if that's the case, then I'll just revert the commit.
> 
> Michal, our rules are very simple: we don't generate regressions. It's
> better to have old reliable behavior than to start creating *new*
> problems.

I do not get it. 5265047ac301 which this patch effectively reverts has
regressed kvm workloads. People started to notice only later because
they were not running on kernels with that commit until later. We have
4.4 based kernels reports. What do you propose to do for those people?
Let me remind that it was David who introduced 5265047ac301, presumably
because his workload benefits from it.
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 10:01:18, Linus Torvalds wrote:
> On Wed, Nov 28, 2018 at 8:48 AM Linus Torvalds
>  wrote:
> >
> > On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying  wrote:
> > >
> > > In general, memory allocation fairness among processes should be a good
> > > thing.  So I think the report should have been a "performance
> > > improvement" instead of "performance regression".
> >
> > Hey, when you put it that way...
> >
> > Let's ignore this issue for now, and see if it shows up in some real
> > workload and people complain.
> 
> Well, David Rientjes points out that it *does* cause real problems in
> real workloads, so it's not just this benchmark run that shows the
> issue.

The thing is that there is no universal win here. There are two
different types of workloads and we cannot satisfy both. This has been
discussed at lenght during the review process. David's workload makes
some assumptions about the MADV_HUGEPAGE numa placement. There are other
workalods like KVM setups which do not really require that and those are
ones which regressed.

The prevalent consensus was that a NUMA placement is not really implied
by MADV_HUGEPAGE because a) this has never been documented or intended
behavior and b) it is not a universal win (basically the same as
node/zone_reclaim which used to be on by default on some NUMA setups).

Reverting the patch would regress another class of workloads. As we
cannot satisfy both I believe we should make the API clear and in favor
of a more relaxed workloads. Those with special requirements should have
a proper API to reflect that (this is our general NUMA policy pattern
already).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 23:20:31, Xunlei Pang wrote:
> On 2018/12/3 下午7:56, Michal Hocko wrote:
> > On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
> >> There may be cgroup memory overcommitment, it will become
> >> even common in the future.
> >>
> >> Let's enable kswapd to reclaim low-protected memory in case
> >> of memory pressure, to mitigate the global direct reclaim
> >> pressures which could cause jitters to the response time of
> >> lantency-sensitive groups.
> > 
> > Please be more descriptive about the problem you are trying to handle
> > here. I haven't actually read the patch but let me emphasise that the
> > low limit protection is important isolation tool. And allowing kswapd to
> > reclaim protected memcgs is going to break the semantic as it has been
> > introduced and designed.
> 
> We have two types of memcgs: online groups(important business)
> and offline groups(unimportant business). Online groups are
> all configured with MAX low protection, while offline groups
> are not at all protected(with default 0 low).
> 
> When offline groups are overcommitted, the global memory pressure
> suffers. This will cause the memory allocations from online groups
> constantly go to the slow global direct reclaim in order to reclaim
> online's page caches, as kswap is not able to reclaim low-protection
> memory. low is not hard limit, it's reasonable to be reclaimed by
> kswapd if there's no other reclaimable memory.

I am sorry I still do not follow. What role do offline cgroups play.
Those are certainly not low mem protected because mem_cgroup_css_offline
will reset them to 0.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 09:06:18, Linus Torvalds wrote:
> On Mon, Dec 3, 2018 at 6:17 AM Michal Hocko  wrote:
> >
> > This argument just doesn't make any sense. Rare bugs are maybe even more
> > annoying because you do not expect them to happen.
> 
> Absolutely.
> 
> And valid lockdep complaints are a real issue too.

No questions about that. But in this case there is no real deadlock so
the splat is disputable. And we have a "do not use" freezable_schedule_unsafe
which shouldn't tigger the splat yet it would make the suspend issue go
away AFIU. Not a great fix and we should target for something better
long term but good enough as a poor's man stop gap fix.

> So I don't think there's any question that this should be reverted,
> the only question is whether I take the revert directly or get it from
> the PM tree.
> 
> It does sound like the de_thread() behavior needs more work. Maybe,
> for example, we need to make it clear that zapped threads are *not*
> frozen, and instead the freezer will wait for them to exit?

I was actually proposing something like that during the review
discussion. Basically set PF_NOFREEZE all the threads and wake them up.
But that is not so simple because they might be at any path when they
sleep (this is the biggest design flaw of the freezer IMHO).

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 15:14:59, Pavel Machek wrote:
> On Mon 2018-12-03 14:53:51, Michal Hocko wrote:
> > On Mon 03-12-18 14:10:06, Pavel Machek wrote:
> > > On Mon 2018-12-03 13:38:57, Michal Hocko wrote:
> > > > On Mon 03-12-18 13:31:49, Oleg Nesterov wrote:
> > > > > On 12/03, Michal Hocko wrote:
> > > > > >
> > > > > > Now, I wouldn't mind to revert this because the code is really old 
> > > > > > and
> > > > > > we haven't seen many bug reports about failing suspend yet. But 
> > > > > > what is
> > > > > > the actual plan to make this work properly?
> > > > > 
> > > > > I don't see a simple solution...
> > > > > 
> > > > > But we need to fix exec/de_thread anyway, then we can probably 
> > > > > reconsider
> > > > > this patch.
> > > > 
> > > > My concern is that de_thread fix might be too disruptive for stable
> > > > kernels while we might want to have a simple enough fix for the the
> > > > suspend issue in the meantime. That was actually the primary reason I've
> > > > acked the hack even though I didn't like it.
> > > 
> > > Do we care about failing sleep in stable? Does someone hit the issue 
> > > there?
> > > 
> > > This sounds like issue only Android is hitting, and they run very
> > > heavily patched kernels, far away from mainline or stable.
> > 
> > But the underlying issue is the same and independent on their patches
> > AFAIU. And is this really a common problem to care about in stable? I
> > dunno to be honest but it sounds annoying for sure. Failing suspend is
> > something that doesn't make your day when you are in hurry and want
> > find out only later when your laptop heats up your bag ;)
> 
> In general, yes. In practice, if it happens 1 in 100 suspends, you
> don't care that much (but Android cares).

This argument just doesn't make any sense. Rare bugs are maybe even more
annoying because you do not expect them to happen. But I would be more
interested to see whether they are any downside. Is there any actual
risk to silence the lockup detector that you can see?

> Do we actually have reports of this happening for people outside
> Android?

Not that I am aware of.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 14:10:06, Pavel Machek wrote:
> On Mon 2018-12-03 13:38:57, Michal Hocko wrote:
> > On Mon 03-12-18 13:31:49, Oleg Nesterov wrote:
> > > On 12/03, Michal Hocko wrote:
> > > >
> > > > Now, I wouldn't mind to revert this because the code is really old and
> > > > we haven't seen many bug reports about failing suspend yet. But what is
> > > > the actual plan to make this work properly?
> > > 
> > > I don't see a simple solution...
> > > 
> > > But we need to fix exec/de_thread anyway, then we can probably reconsider
> > > this patch.
> > 
> > My concern is that de_thread fix might be too disruptive for stable
> > kernels while we might want to have a simple enough fix for the the
> > suspend issue in the meantime. That was actually the primary reason I've
> > acked the hack even though I didn't like it.
> 
> Do we care about failing sleep in stable? Does someone hit the issue there?
> 
> This sounds like issue only Android is hitting, and they run very
> heavily patched kernels, far away from mainline or stable.

But the underlying issue is the same and independent on their patches
AFAIU. And is this really a common problem to care about in stable? I
dunno to be honest but it sounds annoying for sure. Failing suspend is
something that doesn't make your day when you are in hurry and want
find out only later when your laptop heats up your bag ;)

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 13:31:49, Oleg Nesterov wrote:
> On 12/03, Michal Hocko wrote:
> >
> > Now, I wouldn't mind to revert this because the code is really old and
> > we haven't seen many bug reports about failing suspend yet. But what is
> > the actual plan to make this work properly?
> 
> I don't see a simple solution...
> 
> But we need to fix exec/de_thread anyway, then we can probably reconsider
> this patch.

My concern is that de_thread fix might be too disruptive for stable
kernels while we might want to have a simple enough fix for the the
suspend issue in the meantime. That was actually the primary reason I've
acked the hack even though I didn't like it.

So can we find a way to shut the lockdep up when this is not really a
deadlock?  Or maybe this really is one and then we need a real fix for
stable as well.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 3/3] mm/memcg: Avoid reclaiming below hard protection

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 16:01:19, Xunlei Pang wrote:
> When memcgs get reclaimed after its usage exceeds min, some
> usages below the min may also be reclaimed in the current
> implementation, the amount is considerably large during kswapd
> reclaim according to my ftrace results.

And here again. Describe the setup and the behavior please?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/3] mm/vmscan: Enable kswapd to reclaim low-protected memory

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 16:01:18, Xunlei Pang wrote:
> There may be cgroup memory overcommitment, it will become
> even common in the future.
> 
> Let's enable kswapd to reclaim low-protected memory in case
> of memory pressure, to mitigate the global direct reclaim
> pressures which could cause jitters to the response time of
> lantency-sensitive groups.

Please be more descriptive about the problem you are trying to handle
here. I haven't actually read the patch but let me emphasise that the
low limit protection is important isolation tool. And allowing kswapd to
reclaim protected memcgs is going to break the semantic as it has been
introduced and designed.

> 
> Signed-off-by: Xunlei Pang 
> ---
>  mm/vmscan.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 62ac0c488624..3d412eb91f73 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3531,6 +3531,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>  
>   count_vm_event(PAGEOUTRUN);
>  
> +retry:
>   do {
>   unsigned long nr_reclaimed = sc.nr_reclaimed;
>   bool raise_priority = true;
> @@ -3622,6 +3623,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>   sc.priority--;
>   } while (sc.priority >= 1);
>  
> + if (!sc.nr_reclaimed && sc.memcg_low_skipped) {
> + sc.priority = DEF_PRIORITY;
> + sc.memcg_low_reclaim = 1;
> + sc.memcg_low_skipped = 0;
> + goto retry;
> +     }
> +
>   if (!sc.nr_reclaimed)
>   pgdat->kswapd_failures++;
>  
> -- 
> 2.13.5 (Apple Git-94)
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/3] mm/memcg: Fix min/low usage in propagate_protected_usage()

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 16:01:17, Xunlei Pang wrote:
> When usage exceeds min, min usage should be min other than 0.
> Apply the same for low.

Why? What is the actual problem.

> Signed-off-by: Xunlei Pang 
> ---
>  mm/page_counter.c | 12 ++--
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index de31470655f6..75d53f15f040 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter 
> *c,
>   return;
>  
>   if (c->min || atomic_long_read(>min_usage)) {
> - if (usage <= c->min)
> - protected = usage;
> - else
> - protected = 0;
> -
> + protected = min(usage, c->min);
>   old_protected = atomic_long_xchg(>min_usage, protected);
>   delta = protected - old_protected;
>   if (delta)
> @@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter 
> *c,
>   }
>  
>   if (c->low || atomic_long_read(>low_usage)) {
> - if (usage <= c->low)
> - protected = usage;
> - else
> - protected = 0;
> -
> + protected = min(usage, c->low);
>   old_protected = atomic_long_xchg(>low_usage, protected);
>   delta = protected - old_protected;
>   if (delta)
> -- 
> 2.13.5 (Apple Git-94)
> 

-- 
Michal Hocko
SUSE Labs


[RFC PATCH] hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-03 Thread Michal Hocko
From: Michal Hocko 

We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed. The underlying reason is that the
HWPoison page has an elevated reference count and the migration keeps
failing. There are two problems with that. First of all it is dubious
to migrate the poisoned page because we know that accessing that memory
is possible to fail. Secondly it doesn't make any sense to migrate a
potentially broken content and preserve the memory corruption over to a
new location.

Oscar has found out that it is the elevated reference count from
memory_failure that is confusing the offlining path. HWPoisoned pages
are isolated from the LRU list but __offline_pages might still try to
migrate them if there is any preceding migrateable pages in the pfn
range. Such a migration would fail due to the reference count but
the migration code would put it back on the LRU list. This is quite
wrong in itself but it would also make scan_movable_pages stumble over
it again without any way out.

This means that the hotremove with hwpoisoned pages has never really
worked (without a luck). HWPoisoning really needs a larger surgery
but an immediate and backportable fix is to skip over these pages during
offlining. Even if they are still mapped for some reason then
try_to_unmap should turn those mappings into hwpoison ptes and cause
SIGBUS on access. Nobody should be really touching the content of the
page so it should be safe to ignore them even when there is a pending
reference count.

Debugged-by: Oscar Salvador 
Cc: stable
Signed-off-by: Michal Hocko 
---
Hi,
I am sending this as an RFC now because I am not fully sure I see all
the consequences myself yet. This has passed a testing by Oscar but I
would highly appreciate a review from Naoya about my assumptions about
hwpoisoning. E.g. it is not entirely clear to me whether there is a
potential case where the page might be still mapped. I have put
try_to_unmap just to be sure. It would be really great if I could drop
that part because then it is not really great which of the TTU flags to
use to cover all potential cases.

I have marked the patch for stable but I have no idea how far back it
should go. Probably everything that already has hotremove and hwpoison
code.

Thanks in advance!

 mm/memory_hotplug.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c6c42a7425e5..08c576d5a633 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1366,6 +1367,17 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
end_pfn)
pfn = page_to_pfn(compound_head(page))
+ hpage_nr_pages(page) - 1;
 
+   /*
+* HWPoison pages have elevated reference counts so the 
migration would
+* fail on them. It also doesn't make any sense to migrate them 
in the
+* first place. Still try to unmap such a page in case it is 
still mapped.
+*/
+   if (PageHWPoison(page)) {
+   if (page_mapped(page))
+   try_to_unmap(page, TTU_IGNORE_MLOCK | 
TTU_IGNORE_ACCESS);
+   continue;
+   }
+
if (!get_page_unless_zero(page))
continue;
/*
-- 
2.19.1



Re: [PATCH v2] memblock: Anonotate memblock_is_reserved() with __init_memblock.

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 04:00:08, Yueyi Li wrote:
> Found warning:
> 
> WARNING: EXPORT symbol "gsi_write_channel_scratch" [vmlinux] version 
> generation failed, symbol will not be versioned.
> WARNING: vmlinux.o(.text+0x1e0a0): Section mismatch in reference from the 
> function valid_phys_addr_range() to the function 
> .init.text:memblock_is_reserved()
> The function valid_phys_addr_range() references
> the function __init memblock_is_reserved().
> This is often because valid_phys_addr_range lacks a __init
> annotation or the annotation of memblock_is_reserved is wrong.
> 
> Use __init_memblock instead of __init.

Yes, it really doesn't make much sense to stand this out of all other
helpers.

> Signed-off-by: liyueyi 

Acked-by: Michal Hocko 

> ---
> 
>  Changes v2: correct typo in 'warning'.
> 
>  mm/memblock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 9a2d5ae..81ae63c 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1727,7 +1727,7 @@ static int __init_memblock memblock_search(struct 
> memblock_type *type, phys_addr
>   return -1;
>  }
>  
> -bool __init memblock_is_reserved(phys_addr_t addr)
> +bool __init_memblock memblock_is_reserved(phys_addr_t addr)
>  {
>   return memblock_search(, addr) != -1;
>  }
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 08:47:00, Ingo Molnar wrote:
[...]
> I reviewed the ->cred_guard_mutex code, and the mutex is held across all 
> of exec() - and we always did this.

Yes, this is something that has been pointed out during the review. Oleg
has argued that making this path freezable is really hard and that we
should be changing de_thread to sleep withtou cred_guard_mutex long term
anyway (http://lkml.kernel.org/r/20181114143705.gb13...@redhat.com).

Failing suspend seems like a real problem while the lockdep one doesn't
really reflect any real deadlock, right? So while the patch is not
perfect it shouldn't make the situation much worse. Lockdep splat is
certainly annoying but is it any worse than a suspend failing?

Now, I wouldn't mind to revert this because the code is really old and
we haven't seen many bug reports about failing suspend yet. But what is
the actual plan to make this work properly? Use
freezable_schedule_unsafe instead? Freezer code has some
fundamental design issues which are quite hard to get over.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] mm: page_mapped: don't assume compound page is huge or THP

2018-11-30 Thread Michal Hocko
On Fri 30-11-18 15:36:51, Kirill A. Shutemov wrote:
> On Fri, Nov 30, 2018 at 01:18:51PM +0100, Michal Hocko wrote:
> > On Fri 30-11-18 13:06:57, Jan Stancek wrote:
> > > LTP proc01 testcase has been observed to rarely trigger crashes
> > > on arm64:
> > > page_mapped+0x78/0xb4
> > > stable_page_flags+0x27c/0x338
> > > kpageflags_read+0xfc/0x164
> > > proc_reg_read+0x7c/0xb8
> > > __vfs_read+0x58/0x178
> > > vfs_read+0x90/0x14c
> > > SyS_read+0x60/0xc0
> > > 
> > > Issue is that page_mapped() assumes that if compound page is not
> > > huge, then it must be THP. But if this is 'normal' compound page
> > > (COMPOUND_PAGE_DTOR), then following loop can keep running
> > > (for HPAGE_PMD_NR iterations) until it tries to read from memory
> > > that isn't mapped and triggers a panic:
> > > for (i = 0; i < hpage_nr_pages(page); i++) {
> > > if (atomic_read([i]._mapcount) >= 0)
> > > return true;
> > >   }
> > > 
> > > I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only
> > > with a custom kernel module [1] which:
> > > - allocates compound page (PAGEC) of order 1
> > > - allocates 2 normal pages (COPY), which are initialized to 0xff
> > >   (to satisfy _mapcount >= 0)
> > > - 2 PAGEC page structs are copied to address of first COPY page
> > > - second page of COPY is marked as not present
> > > - call to page_mapped(COPY) now triggers fault on access to 2nd
> > >   COPY page at offset 0x30 (_mapcount)
> > > 
> > > [1] 
> > > https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_crash/repro.c
> > > 
> > > Fix the loop to iterate for "1 << compound_order" pages.
> > 
> > This is much less magic than the previous version. It is still not clear
> > to me how is mapping higher order pages to page tables other than THP
> > though. So a more detailed information about the source would bre really
> > welcome. Once we know that we can add a Fixes tag and also mark the
> > patch for stable because that sounds like a stable material.
> 
> IIRC, sound subsystem can producuce custom mapped compound pages.

Do I assume correctly that consecutive ptes simply point to subpages?

> The bug dates back to e1534ae95004.

Thanks for the pointer. I thought this was a new and creative usage of
the pte->page mapping but it really looks like your commit just changed
the underlying semantic.
 
> > > Debugged-by: Laszlo Ersek 
> > > Suggested-by: "Kirill A. Shutemov" 
> > > Signed-off-by: Jan Stancek 
> > 
> > The patch looks sensible to me
> > Acked-by: Michal Hocko 
> 
> Acked-by: Kirill A. Shutemov 
> 
> -- 
>  Kirill A. Shutemov

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] mm: page_mapped: don't assume compound page is huge or THP

2018-11-30 Thread Michal Hocko
On Fri 30-11-18 13:06:57, Jan Stancek wrote:
> LTP proc01 testcase has been observed to rarely trigger crashes
> on arm64:
> page_mapped+0x78/0xb4
> stable_page_flags+0x27c/0x338
> kpageflags_read+0xfc/0x164
> proc_reg_read+0x7c/0xb8
> __vfs_read+0x58/0x178
> vfs_read+0x90/0x14c
> SyS_read+0x60/0xc0
> 
> Issue is that page_mapped() assumes that if compound page is not
> huge, then it must be THP. But if this is 'normal' compound page
> (COMPOUND_PAGE_DTOR), then following loop can keep running
> (for HPAGE_PMD_NR iterations) until it tries to read from memory
> that isn't mapped and triggers a panic:
> for (i = 0; i < hpage_nr_pages(page); i++) {
> if (atomic_read([i]._mapcount) >= 0)
> return true;
>   }
> 
> I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only
> with a custom kernel module [1] which:
> - allocates compound page (PAGEC) of order 1
> - allocates 2 normal pages (COPY), which are initialized to 0xff
>   (to satisfy _mapcount >= 0)
> - 2 PAGEC page structs are copied to address of first COPY page
> - second page of COPY is marked as not present
> - call to page_mapped(COPY) now triggers fault on access to 2nd
>   COPY page at offset 0x30 (_mapcount)
> 
> [1] 
> https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_crash/repro.c
> 
> Fix the loop to iterate for "1 << compound_order" pages.

This is much less magic than the previous version. It is still not clear
to me how is mapping higher order pages to page tables other than THP
though. So a more detailed information about the source would bre really
welcome. Once we know that we can add a Fixes tag and also mark the
patch for stable because that sounds like a stable material.

> Debugged-by: Laszlo Ersek 
> Suggested-by: "Kirill A. Shutemov" 
> Signed-off-by: Jan Stancek 

The patch looks sensible to me
Acked-by: Michal Hocko 

Thanks!

> ---
>  mm/util.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Changes in v2:
> - change the loop instead so we check also mapcount of subpages
> 
> diff --git a/mm/util.c b/mm/util.c
> index 8bf08b5b5760..5c9c7359ee8a 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -478,7 +478,7 @@ bool page_mapped(struct page *page)
>   return true;
>   if (PageHuge(page))
>   return false;
> - for (i = 0; i < hpage_nr_pages(page); i++) {
> + for (i = 0; i < (1 << compound_order(page)); i++) {
>   if (atomic_read([i]._mapcount) >= 0)
>   return true;
>   }
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: remove pte_lock_deinit()

2018-11-29 Thread Michal Hocko
On Wed 28-11-18 16:55:25, Yu Zhao wrote:
> Pagetable page doesn't touch page->mapping or have any used field
> that overlaps with it. No need to clear mapping in dtor. In fact,
> doing so might mask problems that otherwise would be detected by
> bad_page().

yes the layour of the structure has changed since Hugh introduced the
pte lock split

> Signed-off-by: Yu Zhao 

Acked-by: Michal Hocko 

> ---
>  include/linux/mm.h | 11 ++-
>  1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5411de93a363..7c8f4fc9244e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1900,13 +1900,6 @@ static inline bool ptlock_init(struct page *page)
>   return true;
>  }
>  
> -/* Reset page->mapping so free_pages_check won't complain. */
> -static inline void pte_lock_deinit(struct page *page)
> -{
> - page->mapping = NULL;
> - ptlock_free(page);
> -}
> -
>  #else/* !USE_SPLIT_PTE_PTLOCKS */
>  /*
>   * We use mm->page_table_lock to guard all pagetable pages of the mm.
> @@ -1917,7 +1910,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct 
> *mm, pmd_t *pmd)
>  }
>  static inline void ptlock_cache_init(void) {}
>  static inline bool ptlock_init(struct page *page) { return true; }
> -static inline void pte_lock_deinit(struct page *page) {}
> +static inline void ptlock_free(struct page *page) {}
>  #endif /* USE_SPLIT_PTE_PTLOCKS */
>  
>  static inline void pgtable_init(void)
> @@ -1937,7 +1930,7 @@ static inline bool pgtable_page_ctor(struct page *page)
>  
>  static inline void pgtable_page_dtor(struct page *page)
>  {
> - pte_lock_deinit(page);
> + ptlock_free(page);
>   __ClearPageTable(page);
>   dec_zone_page_state(page, NR_PAGETABLE);
>  }
> -- 
> 2.20.0.rc1.387.gf8505762e3-goog
> 

-- 
Michal Hocko
SUSE Labs


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Michal Hocko
On Tue 27-11-18 14:50:05, Linus Torvalds wrote:
> On Tue, Nov 27, 2018 at 12:57 PM Andrea Arcangeli  wrote:
> >
> > This difference can only happen with defrag=always, and that's not the
> > current upstream default.
> 
> Ok, thanks. That makes it a bit less critical.
> 
> > That MADV_HUGEPAGE causes flights with NUMA balancing is not great
> > indeed, qemu needs NUMA locality too, but then the badness caused by
> > __GFP_THISNODE was a larger regression in the worst case for qemu.
> [...]
> > So the short term alternative again would be the alternate patch that
> > does __GFP_THISNODE|GFP_ONLY_COMPACT appended below.
> 
> Sounds like we should probably do this. Particularly since Vlastimil
> pointed out that we'd otherwise have issues with the back-port for 4.4
> where that "defrag=always" was the default.
> 
> The patch doesn't look horrible, and it directly addresses this
> particular issue.
> 
> Is there some reason we wouldn't want to do it?

We have discussed it previously and the biggest concern was that it
introduces a new GFP flag with a very weird and one-off semantic.
Anytime we have done that in the past it basically kicked back because
people have started to use such a flag and any further changes were
really hard to do. So I would really prefer some more systematic
solution. And I believe we can do that here. MADV_HUGEPAGE (resp. THP
always enabled) has gained a local memory policy with the patch which
got effectively reverted. I do believe that conflating "I want THP" with
"I want them local" is just wrong from the API point of view. There are
different classes of usecases which obviously disagree on the later.

So I believe that a long term solution should introduce a
MPOL_NODE_RECLAIM kind of policy. It would effectively reclaim local
nodes (within NODE_RECLAIM distance) before falling to other nodes.

Apart from that we need a less disruptive reclaim driven by compaction
and Mel is already working on that AFAIK.
-- 
Michal Hocko
SUSE Labs


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Michal Hocko
On Tue 27-11-18 19:17:27, Michal Hocko wrote:
> On Tue 27-11-18 09:08:50, Linus Torvalds wrote:
> > On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
> >  wrote:
> > >
> > > FYI, we noticed a -61.3% regression of vm-scalability.throughput due
> > > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > > MADV_HUGEPAGE mappings")
> > 
> > Well, that's certainly noticeable and not good.
> > 
> > Andrea, I suspect it might be causing fights with auto numa migration..
> > 
> > Lots more system time, but also look at this:
> > 
> > >1122389 ±  9% +17.2%1315380 ±  4%  proc-vmstat.numa_hit
> > > 214722 ±  5% +21.6% 261076 ±  3%  
> > > proc-vmstat.numa_huge_pte_updates
> > >1108142 ±  9% +17.4%1300857 ±  4%  proc-vmstat.numa_local
> > > 145368 ± 48% +63.1% 237050 ± 17%  proc-vmstat.numa_miss
> > > 159615 ± 44% +57.6% 251573 ± 16%  proc-vmstat.numa_other
> > > 185.50 ± 81%   +8278.6%  15542 ± 40%  
> > > proc-vmstat.numa_pages_migrated
> > 
> > Should the commit be reverted? Or perhaps at least modified?
> 
> Well, the commit is trying to revert to the behavior before
> 5265047ac301 because there are real usecases that suffered from that
> change and bug reports as a result of that.
> 
> will-it-scale is certainly worth considering but it is an artificial
> testcase. A higher NUMA miss rate is an expected side effect of the
> patch because the fallback to a different NUMA node is more likely. The
> __GFP_THISNODE side effect is basically introducing node-reclaim
> behavior for THPages. Another thing is that there is no good behavior
> for everybody. Reclaim locally vs. THP on a remote node is hard to
> tell by default. We have discussed that at length and there were some
> conclusions. One of them is that we need a numa policy to tell whether
> a expensive localility is preferred over remote allocation.  Also we
> definitely need a better pro-active defragmentation to allow larger
> pages on a local node. This is a work in progress and this patch is a
> stop gap fix.

Btw. the associated discussion is 
http://lkml.kernel.org/r/20180925120326.24392-1-mho...@kernel.org

-- 
Michal Hocko
SUSE Labs


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Michal Hocko
On Tue 27-11-18 09:08:50, Linus Torvalds wrote:
> On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
>  wrote:
> >
> > FYI, we noticed a -61.3% regression of vm-scalability.throughput due
> > to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > MADV_HUGEPAGE mappings")
> 
> Well, that's certainly noticeable and not good.
> 
> Andrea, I suspect it might be causing fights with auto numa migration..
> 
> Lots more system time, but also look at this:
> 
> >1122389 ±  9% +17.2%1315380 ±  4%  proc-vmstat.numa_hit
> > 214722 ±  5% +21.6% 261076 ±  3%  
> > proc-vmstat.numa_huge_pte_updates
> >1108142 ±  9% +17.4%1300857 ±  4%  proc-vmstat.numa_local
> > 145368 ± 48% +63.1% 237050 ± 17%  proc-vmstat.numa_miss
> > 159615 ± 44% +57.6% 251573 ± 16%  proc-vmstat.numa_other
> > 185.50 ± 81%   +8278.6%  15542 ± 40%  
> > proc-vmstat.numa_pages_migrated
> 
> Should the commit be reverted? Or perhaps at least modified?

Well, the commit is trying to revert to the behavior before
5265047ac301 because there are real usecases that suffered from that
change and bug reports as a result of that.

will-it-scale is certainly worth considering but it is an artificial
testcase. A higher NUMA miss rate is an expected side effect of the
patch because the fallback to a different NUMA node is more likely. The
__GFP_THISNODE side effect is basically introducing node-reclaim
behavior for THPages. Another thing is that there is no good behavior
for everybody. Reclaim locally vs. THP on a remote node is hard to
tell by default. We have discussed that at length and there were some
conclusions. One of them is that we need a numa policy to tell whether
a expensive localility is preferred over remote allocation.  Also we
definitely need a better pro-active defragmentation to allow larger
pages on a local node. This is a work in progress and this patch is a
stop gap fix.

-- 
Michal Hocko
SUSE Labs


Re: [PATCHi v2] mm: put_and_wait_on_page_locked() while page is migrated

2018-11-27 Thread Michal Hocko
On Tue 27-11-18 16:49:47, Cristopher Lameter wrote:
> On Tue, 27 Nov 2018, Mike Rapoport wrote:
> 
> > >  * @page: The page to wait for.
> > >  *
> > >  * The caller should hold a reference on @page.  They expect the page to
> > >  * become unlocked relatively soon, but do not wish to hold up migration
> > >  * (for example) by holding the reference while waiting for the page to
> > >  * come unlocked.  After this function returns, the caller should not
> > >  * dereference @page.
> > >  */
> >
> > How about:
> >
> > They expect the page to become unlocked relatively soon, but they can wait
> > for the page to come unlocked without holding the reference, to allow
> > other users of the @page (for example migration) to continue.
> 
> All of this seems a bit strange and it seems unnecessary? Maybe we need a
> better explanation?
> 
> A process has no refcount on a page struct and is waiting for it to become
> unlocked? Why? Should it not simply ignore that page and continue? It
> cannot possibly do anything with the page since it does not hold a
> refcount.

So do you suggest busy waiting on the page under migration?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: warn only once if page table misaccounting is detected

2018-11-27 Thread Michal Hocko
On Tue 27-11-18 15:36:38, Heiko Carstens wrote:
> On Tue, Nov 27, 2018 at 02:19:16PM +0100, Michal Hocko wrote:
> > On Tue 27-11-18 09:36:03, Heiko Carstens wrote:
> > > Use pr_alert_once() instead of pr_alert() if page table misaccounting
> > > has been detected.
> > > 
> > > If this happens once it is very likely that there will be numerous
> > > other occurrence as well, which would flood dmesg and the console with
> > > hardly any added information. Therefore print the warning only once.
> > 
> > Have you actually experience a flood of these messages? Is one per mm
> > message really that much?
> 
> Yes, I did. Since in this case all compat processes caused the message
> to appear, I saw thousands of these messages.

This means something went colossally wrong and seeing an avalanche of
messages might be actually helpful because you can at least see the
pattern. I wonder whether the underlying issue would be obvious from a
single instance.

Maybe we want ratelimit instead?
 
> > If yes why rss counters do not exhibit the same problem?
> 
> No rss counter messages appeared. Or do you suggest that the other
> pr_alert() within check_mm() should also be changed?

Whatever we go with (and I do not have a strong opinion here) we should
be consistent I believe.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc

2018-11-27 Thread Michal Hocko
On Tue 27-11-18 07:50:08, William Kucharski wrote:
> 
> 
> > On Nov 27, 2018, at 6:17 AM, Michal Hocko  wrote:
> > 
> > This is only about the process wide flag to disable THP. I do not see
> > how this can be alighnement related. I suspect you wanted to ask in the
> > smaps patch?
> 
> No, answered below.
> 
> > 
> >> I'm having to deal with both these issues in the text page THP
> >> prototype I've been working on for some time now.
> > 
> > Could you be more specific about the issue and how the alignment comes
> > into the game? The only thing I can think of is to not report VMAs
> > smaller than the THP as eligible. Is this what you are looking for?
> 
> Basically, if the faulting VA is one that cannot be mapped with a THP
> due to alignment or size constraints, it may be "eligible" for THP
> mapping but ultimately can't be.
> 
> I was just double checking that this was meant to be more of a check done
> before code elsewhere performs additional checks and does the actual THP
> mapping, not an all-encompassing go/no go check for THP mapping.

I am still not sure I follow you completely here. This just reports
per-task eligibility. The system wide eligibility is reported via sysfs
and the per vma eligibility is reported via /proc//smaps.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: warn only once if page table misaccounting is detected

2018-11-27 Thread Michal Hocko
On Tue 27-11-18 09:36:03, Heiko Carstens wrote:
> Use pr_alert_once() instead of pr_alert() if page table misaccounting
> has been detected.
> 
> If this happens once it is very likely that there will be numerous
> other occurrence as well, which would flood dmesg and the console with
> hardly any added information. Therefore print the warning only once.

Have you actually experience a flood of these messages? Is one per mm
message really that much? If yes why rss counters do not exhibit the
same problem?

> Cc: Kirill A. Shutemov 
> Cc: Martin Schwidefsky 
> Signed-off-by: Heiko Carstens 
> ---
>  kernel/fork.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 07cddff89c7b..c887e9eba89f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -647,8 +647,8 @@ static void check_mm(struct mm_struct *mm)
>   }
>  
>   if (mm_pgtables_bytes(mm))
> - pr_alert("BUG: non-zero pgtables_bytes on freeing mm: %ld\n",
> - mm_pgtables_bytes(mm));
> + pr_alert_once("BUG: non-zero pgtables_bytes on freeing mm: 
> %ld\n",
> +   mm_pgtables_bytes(mm));
>  
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
>   VM_BUG_ON_MM(mm->pmd_huge_pte, mm);
> -- 
> 2.16.4

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc

2018-11-27 Thread Michal Hocko
On Mon 26-11-18 17:33:32, William Kucharski wrote:
> 
> 
> This determines whether the page can theoretically be THP-mapped , but
> is the intention to also check for proper alignment and/or preexisting
> PAGESIZE page cache mappings for the address range?

This is only about the process wide flag to disable THP. I do not see
how this can be alighnement related. I suspect you wanted to ask in the
smaps patch?

> I'm having to deal with both these issues in the text page THP
> prototype I've been working on for some time now.

Could you be more specific about the issue and how the alignment comes
into the game? The only thing I can think of is to not report VMAs
smaller than the THP as eligible. Is this what you are looking for?
-- 
Michal Hocko
SUSE Labs


Re: [PATCHi v2] mm: put_and_wait_on_page_locked() while page is migrated

2018-11-26 Thread Michal Hocko
On Mon 26-11-18 11:27:07, Hugh Dickins wrote:
[...]
> @@ -1049,25 +1056,44 @@ static void wake_up_page(struct page *page, int bit)
>   wake_up_page_bit(page, bit);
>  }
>  
> +/*
> + * A choice of three behaviors for wait_on_page_bit_common():
> + */
> +enum behavior {
> + EXCLUSIVE,  /* Hold ref to page and take the bit when woken, like
> +  * __lock_page() waiting on then setting PG_locked.
> +  */
> + SHARED, /* Hold ref to page and check the bit when woken, like
> +  * wait_on_page_writeback() waiting on PG_writeback.
> +  */
> + DROP,   /* Drop ref to page before wait, no check when woken,
> +  * like put_and_wait_on_page_locked() on PG_locked.
> +  */
> +};

I like this. It makes to semantic much more clear.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/5] mm: print more information about mapping in __dump_page

2018-11-25 Thread Michal Hocko
On Fri 23-11-18 16:04:04, Andrew Morton wrote:
> On Wed,  7 Nov 2018 11:18:26 +0100 Michal Hocko  wrote:
> 
> > From: Michal Hocko 
> > 
> > __dump_page prints the mapping pointer but that is quite unhelpful
> > for many reports because the pointer itself only helps to distinguish
> > anon/ksm mappings from other ones (because of lowest bits
> > set). Sometimes it would be much more helpful to know what kind of
> > mapping that is actually and if we know this is a file mapping then also
> > try to resolve the dentry name.
> > 
> > ...
> >
> > --- a/mm/debug.c
> > +++ b/mm/debug.c
> >
> > ...
> >
> > @@ -70,6 +71,18 @@ void __dump_page(struct page *page, const char *reason)
> > if (PageCompound(page))
> > pr_cont(" compound_mapcount: %d", compound_mapcount(page));
> > pr_cont("\n");
> > +   if (PageAnon(page))
> > +   pr_emerg("anon ");
> > +   else if (PageKsm(page))
> > +   pr_emerg("ksm ");
> > +   else if (mapping) {
> > +   pr_emerg("%ps ", mapping->a_ops);
> > +   if (mapping->host->i_dentry.first) {
> > +   struct dentry *dentry;
> > +   dentry = container_of(mapping->host->i_dentry.first, 
> > struct dentry, d_u.d_alias);
> > +   pr_emerg("name:\"%*s\" ", dentry->d_name.len, 
> > dentry->d_name.name);
> > +   }
> > +   }
> 
> There has to be a better way of printing the filename.  It is so often
> needed.
> 
> The (poorly named and gleefully undocumented)
> take_dentry_name_snapshot() looks promising.  However it's unclear that
> __dump_page() is always called from contexts where
> take_dentry_name_snapshot() and release_dentry_name_snapshot() can be
> safely called.  Probably it's OK, but how to guarantee it?

http://lkml.kernel.org/r/20181125080834.gb12...@dhcp22.suse.cz as
suggested by Tetsuo?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma

2018-11-23 Thread Michal Hocko
On Fri 23-11-18 16:07:06, Vlastimil Babka wrote:
> On 11/20/18 11:35 AM, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > Userspace falls short when trying to find out whether a specific memory
> > range is eligible for THP. There are usecases that would like to know
> > that
> > http://lkml.kernel.org/r/alpine.deb.2.21.1809251248450.50...@chino.kir.corp.google.com
> > : This is used to identify heap mappings that should be able to fault thp
> > : but do not, and they normally point to a low-on-memory or fragmentation
> > : issue.
> > 
> > The only way to deduce this now is to query for hg resp. nh flags and
> > confronting the state with the global setting. Except that there is
> > also PR_SET_THP_DISABLE that might change the picture. So the final
> > logic is not trivial. Moreover the eligibility of the vma depends on
> > the type of VMA as well. In the past we have supported only anononymous
> > memory VMAs but things have changed and shmem based vmas are supported
> > as well these days and the query logic gets even more complicated
> > because the eligibility depends on the mount option and another global
> > configuration knob.
> > 
> > Simplify the current state and report the THP eligibility in
> > /proc//smaps for each existing vma. Reuse transparent_hugepage_enabled
> > for this purpose. The original implementation of this function assumes
> > that the caller knows that the vma itself is supported for THP so make
> > the core checks into __transparent_hugepage_enabled and use it for
> > existing callers. __show_smap just use the new transparent_hugepage_enabled
> > which also checks the vma support status (please note that this one has
> > to be out of line due to include dependency issues).
> > 
> > Signed-off-by: Michal Hocko 
> 
> Not thrilled by this,

Any specific concern?

> but kernel is always better suited to report this,
> than userspace piecing it together from multiple sources, relying on
> possibly outdated knowledge of kernel implementation details...

yep.

> Acked-by: Vlastimil Babka 

Thanks!

> A nitpick:
> 
> > ---
> >  Documentation/filesystems/proc.txt |  3 +++
> >  fs/proc/task_mmu.c |  2 ++
> >  include/linux/huge_mm.h| 13 -
> >  mm/huge_memory.c   | 12 +++-
> >  mm/memory.c|  4 ++--
> >  5 files changed, 30 insertions(+), 4 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/proc.txt 
> > b/Documentation/filesystems/proc.txt
> > index b1fda309f067..06562bab509a 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -425,6 +425,7 @@ SwapPss:   0 kB
> >  KernelPageSize:    4 kB
> >  MMUPageSize:   4 kB
> >  Locked:0 kB
> > +THPeligible:   0
> 
> I would use THP_Eligible. There are already fields with underscore in smaps.

I do not feel strongly. I will wait for more comments and see whether
there is some consensus.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 0/4] mm, memory_hotplug: allocate memmap from hotadded memory

2018-11-23 Thread Michal Hocko
On Fri 23-11-18 13:51:57, David Hildenbrand wrote:
> On 23.11.18 13:42, Michal Hocko wrote:
> > On Fri 23-11-18 12:55:41, Oscar Salvador wrote:
[...]
> >> It is not memory that the system can use.
> > 
> > same as bootmem ;)
> 
> Fair enough, just saying that it represents a change :)
> 
> (but people also already complained if their VM has XGB but they don't
> see actual XGB as total memory e.g. due to the crash kernel size)

I can imagine. I have seen many "where's my memory dude" questions... We
have so many unaccounted usages that it is simply impossible to see the
full picture of where the memory is consumed. The current implementation
would account memmaps in unreclaimable slabs but you still do not know
how much was spent for it...
 
> >> I also guess that if there is a strong opinion on this, we could create
> >> a counter, something like NR_VMEMMAP_PAGES, and show it under 
> >> /proc/meminfo.
> > 
> > Do we really have to? Isn't the number quite obvious from the size of
> > the hotpluged memory?
> 
> At least the size of vmmaps cannot reliably calculated from "MemTotal" .
> But maybe based on something else. (there, it is indeed obvious)

Everybody knows the struct page size obviously :p and the rest is a
simple exercise. But more seriously, I see what you are saying. We do
not have a good counter now and the patch doesn't improve that. But I
guess this is a separate discussion.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 2/4] mm, memory_hotplug: provide a more generic restrictions for memory hotplug

2018-11-23 Thread Michal Hocko
[Cc Alexander - email thread starts 
http://lkml.kernel.org/r/20181116101222.16581-1-osalva...@suse.com]

On Fri 16-11-18 11:12:20, Oscar Salvador wrote:
> From: Michal Hocko 
> 
> arch_add_memory, __add_pages take a want_memblock which controls whether
> the newly added memory should get the sysfs memblock user API (e.g.
> ZONE_DEVICE users do not want/need this interface). Some callers even
> want to control where do we allocate the memmap from by configuring
> altmap. This is currently done quite ugly by searching for altmap down
> in memory hotplug (to_vmem_altmap). It should be the caller to provide
> the altmap down the call chain.
> 
> Add a more generic hotplug context for arch_add_memory and __add_pages.
> struct mhp_restrictions contains flags which contains additional
> features to be enabled by the memory hotplug (MHP_MEMBLOCK_API
> currently) and altmap for alternative memmap allocator.

One note here as well. In the retrospect the API I have come up
with here is quite hackish. Considering the recent discussion about
special needs ZONE_DEVICE has for both initialization and struct page
allocations with Alexander Duyck I believe we wanted a more abstracted
API with allocator and constructor callbacks. This would allow different
usecases to fine tune their needs without specialcasing deep in the core
hotplug code paths.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 0/4] mm, memory_hotplug: allocate memmap from hotadded memory

2018-11-23 Thread Michal Hocko
On Fri 23-11-18 12:55:41, Oscar Salvador wrote:
> On Thu, Nov 22, 2018 at 10:21:24AM +0100, David Hildenbrand wrote:
> > 1. How are we going to present such memory to the system statistics?
> > 
> > In my opinion, this vmemmap memory should
> > a) still account to total memory
> > b) show up as allocated
> > 
> > So just like before.
> 
> No, it does not show up under total memory and neither as allocated memory.
> This memory is not for use for anything but for creating the pagetables
> for the memmap array for the section/s.

I haven't read through your patches yet but wanted to clarfify few
points here.

This should essentially follow the bootmem allocated memory pattern. So
it is present and accounted to spanned pages but it is not managed.

> It is not memory that the system can use.

same as bootmem ;)
 
> I also guess that if there is a strong opinion on this, we could create
> a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.

Do we really have to? Isn't the number quite obvious from the size of
the hotpluged memory?

> 
> > 2. Is this optional, in other words, can a device driver decide to not
> > to it like that?
> 
> Right now, is a per arch setup.
> For example, x86_64/powerpc/arm64 will do it inconditionally.
> 
> If we want to restrict this a per device-driver thing, I guess that we could
> allow to pass a flag to add_memory()->add_memory_resource(), and there
> unset MHP_MEMMAP_FROM_RANGE in case that flag is enabled.

I believe we will need to make this opt-in. There are some usecases
which hotplug an expensive (per size) memory via hotplug and it would be
too wasteful to use it for struct pages. I haven't bothered to address
that with my previous patches because I just wanted to make the damn
thing work first.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v15 2/2] Add oom victim's memcg to the oom context information

2018-11-22 Thread Michal Hocko
On Wed 21-11-18 19:29:59, ufo19890...@gmail.com wrote:
> From: yuzhoujian 
> 
> The current oom report doesn't display victim's memcg context during the
> global OOM situation. While this information is not strictly needed, it
> can be really helpful for containerized environments to locate which
> container has lost a process. Now that we have a single line for the oom
> context, we can trivially add both the oom memcg (this can be either
> global_oom or a specific memcg which hits its hard limits) and task_memcg
> which is the victim's memcg.
> 
> Below is the single line output in the oom report after this patch.
> - global oom context information:
> oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,global_oom,task_memcg=,task=,pid=,uid=
> - memcg oom context information:
> oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,oom_memcg=,task_memcg=,task=,pid=,uid=
> 
> Signed-off-by: yuzhoujian 

I thought I have acked this one already.
Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v15 1/2] Reorganize the oom report in dump_header

2018-11-22 Thread Michal Hocko
On Wed 21-11-18 19:29:58, ufo19890...@gmail.com wrote:
> From: yuzhoujian 
> 
> OOM report contains several sections. The first one is the allocation
> context that has triggered the OOM. Then we have cpuset context
> followed by the stack trace of the OOM path. The tird one is the OOM
> memory information. Followed by the current memory state of all system
> tasks. At last, we will show oom eligible tasks and the information
> about the chosen oom victim.
> 
> One thing that makes parsing more awkward than necessary is that we do
> not have a single and easily parsable line about the oom context. This
> patch is reorganizing the oom report to
> 1) who invoked oom and what was the allocation request
> [  515.902945] tuned invoked oom-killer: 
> gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
> 
> 2) OOM stack trace
> [  515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
> [  515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 
> 11/14/2016
> [  515.906821] Call Trace:
> [  515.908062]  dump_stack+0x5a/0x73
> [  515.909311]  dump_header+0x55/0x28c
> [  515.914260]  oom_kill_process+0x2d8/0x300
> [  515.916708]  out_of_memory+0x145/0x4a0
> [  515.917932]  __alloc_pages_slowpath+0x7d2/0xa16
> [  515.919157]  __alloc_pages_nodemask+0x277/0x290
> [  515.920367]  filemap_fault+0x3d0/0x6c0
> [  515.921529]  ? filemap_map_pages+0x2b8/0x420
> [  515.922709]  ext4_filemap_fault+0x2c/0x40 [ext4]
> [  515.923884]  __do_fault+0x20/0x80
> [  515.925032]  __handle_mm_fault+0xbc0/0xe80
> [  515.926195]  handle_mm_fault+0xfa/0x210
> [  515.927357]  __do_page_fault+0x233/0x4c0
> [  515.928506]  do_page_fault+0x32/0x140
> [  515.929646]  ? page_fault+0x8/0x30
> [  515.930770]  page_fault+0x1e/0x30
> 
> 3) OOM memory information
> [  515.958093] Mem-Info:
> [  515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
>  active_file:4402672 inactive_file:483963 isolated_file:1344
>  unevictable:0 dirty:4886753 writeback:0 unstable:0
>  slab_reclaimable:148442 slab_unreclaimable:18741
>  mapped:1347 shmem:1347 pagetables:58669 bounce:0
>  free:88663 free_pcp:0 free_cma:0
> ...
> 
> 4) current memory state of all system tasks
> [  516.079544] [744] 0   744 9211 1345   114688   82  
>0 systemd-journal
> [  516.082034] [787] 0   787317640   143360   92  
>0 lvmetad
> [  516.084465] [792] 0   792109301   110592  208  
>-1000 systemd-udevd
> [  516.086865] [   1199] 0  1199138660   131072  112  
>-1000 auditd
> [  516.089190] [   1222] 0  1222319901   110592  157  
>0 smartd
> [  516.091477] [   1225] 0  1225 4864   8581920   43  
>0 irqbalance
> [  516.093712] [   1226] 0  1226526120   258048  426  
>0 abrtd
> [  516.112128] [   1280] 0  1280   109774   55   299008  400  
>0 NetworkManager
> [  516.113998] [   1295] 0  129528817   3769632   24  
>0 ksmtuned
> [  516.144596] [  10718] 0 10718  2622484  1721372 15998976   267219  
>0 panic
> [  516.145792] [  10719] 0 10719  2622484  1164767  981811253576  
>0 panic
> [  516.146977] [  10720] 0 10720  2622484  1174361  990412853709  
>0 panic
> [  516.148163] [  10721] 0 10721  2622484  1209070 1019494454824  
>0 panic
> [  516.149329] [  10722] 0 10722  2622484  1745799 1477427291138  
>0 panic
> 
> 5) oom context (contrains and the chosen victim).
> oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0
> 
> An admin can easily get the full oom context at a single line which
> makes parsing much easier.
> 
> Signed-off-by: yuzhoujian 

Looks good, finally
Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm, fault_around: do not take a reference to a locked page

2018-11-22 Thread Michal Hocko
On Wed 21-11-18 18:27:11, Hugh Dickins wrote:
> On Wed, 21 Nov 2018, Michal Hocko wrote:
> > On Tue 20-11-18 17:47:21, Hugh Dickins wrote:
> > > On Tue, 20 Nov 2018, Michal Hocko wrote:
> > > 
> > > > From: Michal Hocko 
> > > > 
> > > > filemap_map_pages takes a speculative reference to each page in the
> > > > range before it tries to lock that page. While this is correct it
> > > > also can influence page migration which will bail out when seeing
> > > > an elevated reference count. The faultaround code would bail on
> > > > seeing a locked page so we can pro-actively check the PageLocked
> > > > bit before page_cache_get_speculative and prevent from pointless
> > > > reference count churn.
> > > > 
> > > > Cc: "Kirill A. Shutemov" 
> > > > Suggested-by: Jan Kara 
> > > > Signed-off-by: Michal Hocko 
> > > 
> > > Acked-by: Hugh Dickins 
> > 
> > Thanks!
> > 
> > > though I think this patch is more useful to the avoid atomic ops,
> > > and unnecessary dirtying of the cacheline, than to avoid the very
> > > transient elevation of refcount, which will not affect page migration
> > > very much.
> > 
> > Are you sure it would really be transient? In other words is it possible
> > that the fault around can block migration repeatedly under refault heavy
> > workload? I just couldn't convince myself, to be honest.
> 
> I don't deny that it is possible: I expect that, using fork() (which does
> not copy the ptes in a shared file vma), you can construct a test case
> where each child faults one or another page near a page of no interest,
> and that page of no interest is a target of migration perpetually
> frustrated by filemap_map_pages()'s briefly raised refcount.

The other issue I am debugging and which very likely has the same
underlying issue in the end has shown
[  883.930477] rac1 kernel: page:ea2084bf5cc0 count:1889 mapcount:1887 
mapping:8833c82c9ad8 index:0x6b
[  883.930485] rac1 kernel: ext4_da_aops [ext4]
[  883.930497] rac1 kernel: name:"libc-2.22.so"
[  883.931241] rac1 kernel: do_migrate_range done ret=23

pattern. After we have disabled the faultaround the failure has moved to
a different page but libc hasn't shown up again. This might be a matter
of (bad)luck and timing. But we thought that it is not too unlikely for
faultaround on such a shared page to strike in.

> But I suggest that's a third-order effect: well worth fixing because
> it's easily and uncontroversially dealt with, as you have; but not of
> great importance.
> 
> The first-order effect is migration conspiring to defeat itself: that's
> what my put_and_wait_on_page_locked() patch, in other thread, is about.

yes. That is obviously a much more effective fix.

> The second order effect is when a page that is really wanted is waited
> on - the target of a fault, for which page refcount is raised maybe
> long before it finally gets into the page table (whereupon it becomes
> visible to try_to_unmap(), and its mapcount matches refcount so that
> migration can fully account for the page).  One class of that can be
> well dealt with by using put_and_wait_on_page_locked_killable() in
> lock_page_or_retry(), but I was keeping that as a future instalment.
> 
> But I shouldn't denigrate the transient case by referring so lightly
> to migrate_pages()' 10 attempts: each of those failed attempts can
> be very expensive, unmapping and TLB flushing (including IPIs) and
> remapping. It may well be that 2 or 3 would be a more cost-effective
> number of attempts, at least when the page is mapped.

If you want some update to the comment in this function or to the
changelog, I am open of course. Right now I have
+* Check for a locked page first, as a speculative
+* reference may adversely influence page migration.
as suggested by William.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc//smaps

2018-11-21 Thread Michal Hocko
On Wed 21-11-18 18:54:28, Mike Rapoport wrote:
> On Tue, Nov 20, 2018 at 11:35:13AM +0100, Michal Hocko wrote:
[...]
> > diff --git a/Documentation/filesystems/proc.txt 
> > b/Documentation/filesystems/proc.txt
> > index 12a5e6e693b6..b1fda309f067 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -496,7 +496,9 @@ flags associated with the particular virtual memory 
> > area in two letter encoded
> > 
> >  Note that there is no guarantee that every flag and associated mnemonic 
> > will
> >  be present in all further kernel releases. Things get changed, the flags 
> > may
> > -be vanished or the reverse -- new added.
> > +be vanished or the reverse -- new added. Interpretatation of their meaning
> > +might change in future as well. So each consumnent of these flags have to
> 
>    consumer? has

fixed. Thanks!

-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-21 Thread Michal Hocko
On Mon 19-11-18 21:44:41, Hugh Dickins wrote:
[...]
> [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> 
> We have all assumed that it is essential to hold a page reference while
> waiting on a page lock: partly to guarantee that there is still a struct
> page when MEMORY_HOTREMOVE is configured, but also to protect against
> reuse of the struct page going to someone who then holds the page locked
> indefinitely, when the waiter can reasonably expect timely unlocking.

I would add the following for the "problem statement". Feel free to
reuse per your preference:
"
An elevated reference count, however, stands in the way of migration and
forces it to fail with a bad timing. This is especially a problem for
memory offlining which retries for ever (or until the operation is
terminated from userspace) because a heavy refault workload can trigger
essentially an endless loop of migration failures. Therefore
__migration_entry_wait is essentially harmful for the even it is waiting
for.
"

> But in fact, so long as wait_on_page_bit_common() does the put_page(),
> and is careful not to rely on struct page contents thereafter, there is
> no need to hold a reference to the page while waiting on it.  That does
> mean that this case cannot go back through the loop: but that's fine for
> the page migration case, and even if used more widely, is limited by the
> "Stop walking if it's locked" optimization in wake_page_function().

I would appreciate this would be more explicit about the existence of
the elevated-ref-count problem but it reduces it to a tiny time window
compared to the whole time the waiter is blocked. So a great
improvement.

> Add interface put_and_wait_on_page_locked() to do this, using negative
> value of the lock arg to wait_on_page_bit_common() to implement it.
> No interruptible or killable variant needed yet, but they might follow:
> I have a vague notion that reporting -EINTR should take precedence over
> return from wait_on_page_bit_common() without knowing the page state,
> so arrange it accordingly - but that may be nothing but pedantic.
> 
> shrink_page_list()'s __ClearPageLocked(): that was a surprise!

and I can imagine a bad one. Do we really have to be so clever here?
The unlock_page went away in the name of performance (a978d6f521063)
and I would argue that this is a slow path where this is just not worth
it.

> this
> survived a lot of testing before that showed up.  It does raise the
> question: should is_page_cache_freeable() and __remove_mapping() now
> treat a PG_waiters page as if an extra reference were held?  Perhaps,
> but I don't think it matters much, since shrink_page_list() already
> had to win its trylock_page(), so waiters are not very common there: I
> noticed no difference when trying the bigger change, and it's surely not
> needed while put_and_wait_on_page_locked() is only for page migration.
> 
> Signed-off-by: Hugh Dickins 

The patch looks good to me - quite ugly but it doesn't make the existing
code much worse.

With the problem described Vlastimil fixed, feel free to add
Acked-by: Michal Hocko 

And thanks for a prompt patch. This is something I've been chasing for
quite some time. __migration_entry_wait came to my radar only recently
because this is an extremely volatile area.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4.4 131/160] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

2018-11-21 Thread Michal Hocko
On Tue 20-11-18 15:53:10, David Rientjes wrote:
> On Tue, 20 Nov 2018, Michal Hocko wrote:
> 
> > On Mon 19-11-18 14:16:24, David Rientjes wrote:
> > > On Mon, 19 Nov 2018, Greg Kroah-Hartman wrote:
> > > 
> > > > 4.4-stable review patch.  If anyone has any objections, please let me 
> > > > know.
> > > > 
> > > 
> > > As I noted when this patch was originally proposed and when I nacked 
> > > it[*] 
> > > because it causes a 13.9% increase in remote memory access latency and up 
> > > to 40% increase in remote memory allocation latency on much of our 
> > > software stack that uses MADV_HUGEPAGE after mremapping the text segment 
> > > to memory backed by hugepages, I don't think this is stable material.
> > 
> > There was a wider consensus that this is the most minimal fix for users
> > who see a regression introduced by 5265047ac301 ("mm, thp: really
> > limit transparent hugepage allocation to local node"). As it has been
> > discussed extensively there is no universal win but we should always opt
> > for the safer side which this patch is accomplishing. The changelog goes
> > in length explaining them along with numbers. I am not happy that your
> > particular workload is suffering but this area certainly requires much
> > more changes to satisfy wider range of users.
> > 
> > > The 4.4 kernel is almost three years old and this changes the NUMA 
> > > locality of any user of MADV_HUGEPAGE.
> > 
> > Yes and we have seen bug reports as we adopted this older kernel only
> > now.
> 
> I think the responsible thing to do would be allow users to remain on 
> their stable kernel that they know works, whether that's 4.4 or any of the 
> others this is proposed for, and downgrade from any current kernel release 
> that causes their workloads to have such severe regressions once they try 
> a kernel with this commit.

But we do know that there are people affected on 4.4 kernel. Besides
that we can revert in the stable tree as soon as we see bug reports on
new stable tree releases.

Really, there is no single proper behavior. It was a mistake to merge
5265047ac301. Since then we are in an unfortunate situation that some
workload might have started to depend on the new behavior.

But rather than repeating the previous long discussion I would call for
a new one which actually deals with fallouts. AFAIR there is a patch
series to reduce the fragmentation issues by Mel with a zero feedback so
far. I also think we should start discussing a new memory policy to
establish the semantic you are after.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm, fault_around: do not take a reference to a locked page

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 17:47:21, Hugh Dickins wrote:
> On Tue, 20 Nov 2018, Michal Hocko wrote:
> 
> > From: Michal Hocko 
> > 
> > filemap_map_pages takes a speculative reference to each page in the
> > range before it tries to lock that page. While this is correct it
> > also can influence page migration which will bail out when seeing
> > an elevated reference count. The faultaround code would bail on
> > seeing a locked page so we can pro-actively check the PageLocked
> > bit before page_cache_get_speculative and prevent from pointless
> > reference count churn.
> > 
> > Cc: "Kirill A. Shutemov" 
> > Suggested-by: Jan Kara 
> > Signed-off-by: Michal Hocko 
> 
> Acked-by: Hugh Dickins 

Thanks!

> though I think this patch is more useful to the avoid atomic ops,
> and unnecessary dirtying of the cacheline, than to avoid the very
> transient elevation of refcount, which will not affect page migration
> very much.

Are you sure it would really be transient? In other words is it possible
that the fault around can block migration repeatedly under refault heavy
workload? I just couldn't convince myself, to be honest.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm, fault_around: do not take a reference to a locked page

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 21:51:39, William Kucharski wrote:
> 
> 
> > On Nov 20, 2018, at 7:12 AM, Michal Hocko  wrote:
> > 
> > +   /*
> > +* Check the locked pages before taking a reference to not
> > +* go in the way of migration.
> > +*/
> 
> Could you make this a tiny bit more explanative, something like:
> 
> + /*
> +  * Check for a locked page first, as a speculative
> +  * reference may adversely influence page migration.
> +  */

sure

> 
> Reviewed-by: William Kucharski 

Thanks!

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc//smaps

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 10:32:07, Dan Williams wrote:
> On Tue, Nov 20, 2018 at 2:35 AM Michal Hocko  wrote:
> >
> > From: Michal Hocko 
> >
> > Even though vma flags exported via /proc//smaps are explicitly
> > documented to be not guaranteed for future compatibility the warning
> > doesn't go far enough because it doesn't mention semantic changes to
> > those flags. And they are important as well because these flags are
> > a deep implementation internal to the MM code and the semantic might
> > change at any time.
> >
> > Let's consider two recent examples:
> > http://lkml.kernel.org/r/20181002100531.gc4...@quack2.suse.cz
> > : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> > : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> > : mean time certain customer of ours started poking into /proc//smaps
> > : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> > : flags, the application just fails to start complaining that DAX support is
> > : missing in the kernel.
> >
> > http://lkml.kernel.org/r/alpine.deb.2.21.1809241054050.224...@chino.kir.corp.google.com
> > : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> > : introduced a regression in that userspace cannot always determine the set
> > : of vmas where thp is ineligible.
> > : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> > : to determine if a vma is eligible to be backed by hugepages.
> > : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
> > : be disabled and emit "nh" as a flag for the corresponding vmas as part of
> > : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
> > : flag and "nh" is not emitted.
> > : This causes smaps parsing libraries to assume a vma is eligible for thp
> > : and ends up puzzling the user on why its memory is not backed by thp.
> >
> > In both cases userspace was relying on a semantic of a specific VMA
> > flag. The primary reason why that happened is a lack of a proper
> > internface. While this has been worked on and it will be fixed properly,
> > it seems that our wording could see some refinement and be more vocal
> > about semantic aspect of these flags as well.
> >
> > Cc: Jan Kara 
> > Cc: Dan Williams 
> > Cc: David Rientjes 
> > Signed-off-by: Michal Hocko 
> > ---
> >  Documentation/filesystems/proc.txt | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/filesystems/proc.txt 
> > b/Documentation/filesystems/proc.txt
> > index 12a5e6e693b6..b1fda309f067 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -496,7 +496,9 @@ flags associated with the particular virtual memory 
> > area in two letter encoded
> >
> >  Note that there is no guarantee that every flag and associated mnemonic 
> > will
> >  be present in all further kernel releases. Things get changed, the flags 
> > may
> > -be vanished or the reverse -- new added.
> > +be vanished or the reverse -- new added. Interpretatation of their meaning
> > +might change in future as well. So each consumnent of these flags have to
> > +follow each specific kernel version for the exact semantic.
> 
> Can we start to claw some of this back? Perhaps with a config option
> to hide the flags to put applications on notice?

I would love to. My knowledge of CRIU is very minimal, but my
understanding is that this is the primary consumer of those flags. And
checkpointing is so close to the specific kernel version that I assume
that this abuse is somehow justified. We can hide it behind
CONFIG_CHECKPOINT_RESTORE but does it going to help? I presume that many
distro kernels will have the config enabled.

> I recall that when I
> introduced CONFIG_IO_STRICT_DEVMEM it caused enough regressions that
> distros did not enable it, but now a few years out I'm finding that it
> is enabled in more places.
> 
> In any event,
> 
> Acked-by: Dan Williams 

Thanks!

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc//smaps

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 16:01:47, David Rientjes wrote:
> On Tue, 20 Nov 2018, Jan Kara wrote:
> 
> > > Even though vma flags exported via /proc//smaps are explicitly
> > > documented to be not guaranteed for future compatibility the warning
> > > doesn't go far enough because it doesn't mention semantic changes to
> > > those flags. And they are important as well because these flags are
> > > a deep implementation internal to the MM code and the semantic might
> > > change at any time.
> > > 
> > > Let's consider two recent examples:
> > > http://lkml.kernel.org/r/20181002100531.gc4...@quack2.suse.cz
> > > : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" 
> > > has
> > > : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in 
> > > the
> > > : mean time certain customer of ours started poking into /proc//smaps
> > > : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> > > : flags, the application just fails to start complaining that DAX support 
> > > is
> > > : missing in the kernel.
> > > 
> > > http://lkml.kernel.org/r/alpine.deb.2.21.1809241054050.224...@chino.kir.corp.google.com
> > > : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> > > : introduced a regression in that userspace cannot always determine the 
> > > set
> > > : of vmas where thp is ineligible.
> > > : Userspace relies on the "nh" flag being emitted as part of 
> > > /proc/pid/smaps
> > > : to determine if a vma is eligible to be backed by hugepages.
> > > : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
> > > : be disabled and emit "nh" as a flag for the corresponding vmas as part 
> > > of
> > > : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
> > > : flag and "nh" is not emitted.
> > > : This causes smaps parsing libraries to assume a vma is eligible for thp
> > > : and ends up puzzling the user on why its memory is not backed by thp.
> > > 
> > > In both cases userspace was relying on a semantic of a specific VMA
> > > flag. The primary reason why that happened is a lack of a proper
> > > internface. While this has been worked on and it will be fixed properly,
> > > it seems that our wording could see some refinement and be more vocal
> > > about semantic aspect of these flags as well.
> > > 
> > > Cc: Jan Kara 
> > > Cc: Dan Williams 
> > > Cc: David Rientjes 
> > > Signed-off-by: Michal Hocko 
> > 
> > Honestly, it just shows that no amount of documentation is going to stop
> > userspace from abusing API that's exposing too much if there's no better
> > alternative. But this is a good clarification regardless. So feel free to
> > add:
> > 
> > Acked-by: Jan Kara 
> > 
> 
> I'm not sure what is expected of a userspace developer who finds they have 
> a single way to determine if something is enabled/disabled.  Should they 
> refer to the documentation and see that the flag may be unstable so they 
> write a kernel patch and have it merged upstream before using it?  What to 
> do when they don't control the kernel version they are running on?

Well, I would treat it as any standard feature request. Ask for the
feature upstream and work with the comunity to come up with a reasonable
and a stable API.

> Anyway, mentioning that the vm flags here only have meaning depending on 
> the kernel version seems like a worthwhile addition:
> 
> Acked-by: David Rientjes 

Thanks!

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 2/3] mm, memory_hotplug: deobfuscate migration part of offlining

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 16:13:35, Oscar Salvador wrote:
> 
> > Signed-off-by: Michal Hocko 
> [...]
> > +   do {
> > +   for (pfn = start_pfn; pfn;)
> > +   {
> > +   /* start memory hot removal */
> 
> Should we change thAT comment? I mean, this is not really the hot-
> removal stage.
> 
> Maybe "start memory migration" suits better? or memory offlining?

I will just drop it. It doesn't really explain anything.
[...]
> 
> This indeed looks much nicer and it is easier to follow.
> With the changes commented by David:
> 
> Reviewed-by: Oscar Salvador 

Thanks!

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/3] mm, memory_hotplug: try to migrate full section worth of pages

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 15:51:32, Oscar Salvador wrote:
> On Tue, 2018-11-20 at 14:43 +0100, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > do_migrate_range has been limiting the number of pages to migrate to
> > 256
> > for some reason which is not documented. 
> 
> When looking back at old memory-hotplug commits one feels pretty sad
> about the brevity of the changelogs.

Well, things evolve and we've become much more careful about changelogs
over time. It still gets quite a lot of time to push back on changelogs
even these days though. People still keep forgetting that "what" is not
as important as "why" because the former is usually quite easy to
understand from reading the diff. The intention behind is usually what
gets forgotten after years. I guess people realize this much more after
few excavation git blame tours.

> > Signed-off-by: Michal Hocko 
> 
> Reviewed-by: Oscar Salvador 

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 2/3] mm, memory_hotplug: deobfuscate migration part of offlining

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 15:26:43, David Hildenbrand wrote:
[...]
> > +   do {
> > +   for (pfn = start_pfn; pfn;)
> > +   {
> 
> { on a new line looks weird.
> 
> > +   /* start memory hot removal */
> > +   ret = -EINTR;
> 
> I think we can move that into the "if (signal_pending(current))"
> 
> (if my eyes are not wrong, this will not be touched otherwise)

Better?

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9cd161db3061..6bc3aee30f5e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1592,11 +1592,10 @@ static int __ref __offline_pages(unsigned long 
start_pfn,
}
 
do {
-   for (pfn = start_pfn; pfn;)
-   {
+   for (pfn = start_pfn; pfn;) {
/* start memory hot removal */
-   ret = -EINTR;
if (signal_pending(current)) {
+   ret = -EINTR;
reason = "signal backoff";
    goto failed_removal_isolated;
}
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm, fault_around: do not take a reference to a locked page

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 17:17:00, Kirill A. Shutemov wrote:
> On Tue, Nov 20, 2018 at 03:12:07PM +0100, Michal Hocko wrote:
> > On Tue 20-11-18 17:07:15, Kirill A. Shutemov wrote:
> > > On Tue, Nov 20, 2018 at 02:43:23PM +0100, Michal Hocko wrote:
> > > > From: Michal Hocko 
> > > > 
> > > > filemap_map_pages takes a speculative reference to each page in the
> > > > range before it tries to lock that page. While this is correct it
> > > > also can influence page migration which will bail out when seeing
> > > > an elevated reference count. The faultaround code would bail on
> > > > seeing a locked page so we can pro-actively check the PageLocked
> > > > bit before page_cache_get_speculative and prevent from pointless
> > > > reference count churn.
> > > 
> > > Looks fine to me.
> > 
> > Thanks for the review.
> > 
> > > But please drop a line of comment in the code. As is it might be confusing
> > > for a reader.
> > 
> > This?
> 
> Yep.
> 
> Acked-by: Kirill A. Shutemov 

Cool, thanks! I will wait some more time for review feedback for other
patches and then repost with this folded in.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/3] mm, memory_hotplug: try to migrate full section worth of pages

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 15:18:41, David Hildenbrand wrote:
[...]
> (we could also check for pending signals inside that function if really
> required)

do_migrate_pages is not the proper layer to check signals. Because the
loop only isolates pages and that is not expensive. The most expensive
part is deeper down in the migration core. We wait for page lock or
writeback and that can take a long. None of that is killable wait which
is a larger surgery but something that we should consider should there
be any need to address this.

> Reviewed-by: David Hildenbrand 

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 3/3] mm, fault_around: do not take a reference to a locked page

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 17:07:15, Kirill A. Shutemov wrote:
> On Tue, Nov 20, 2018 at 02:43:23PM +0100, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > filemap_map_pages takes a speculative reference to each page in the
> > range before it tries to lock that page. While this is correct it
> > also can influence page migration which will bail out when seeing
> > an elevated reference count. The faultaround code would bail on
> > seeing a locked page so we can pro-actively check the PageLocked
> > bit before page_cache_get_speculative and prevent from pointless
> > reference count churn.
> 
> Looks fine to me.

Thanks for the review.

> But please drop a line of comment in the code. As is it might be confusing
> for a reader.

This?

diff --git a/mm/filemap.c b/mm/filemap.c
index c76d6a251770..7c4e439a2e85 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2554,6 +2554,10 @@ void filemap_map_pages(struct vm_fault *vmf,
 
head = compound_head(page);
 
+   /*
+* Check the locked pages before taking a reference to not
+* go in the way of migration.
+*/
if (PageLocked(head))
goto next;
    if (!page_cache_get_speculative(head))
-- 
Michal Hocko
SUSE Labs


Re: Memory hotplug softlock issue

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 21:58:03, Baoquan He wrote:
> Hi,
> 
> On 11/20/18 at 02:38pm, Vlastimil Babka wrote:
> > On 11/20/18 6:44 AM, Hugh Dickins wrote:
> > > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> > > 
> > > We have all assumed that it is essential to hold a page reference while
> > > waiting on a page lock: partly to guarantee that there is still a struct
> > > page when MEMORY_HOTREMOVE is configured, but also to protect against
> > > reuse of the struct page going to someone who then holds the page locked
> > > indefinitely, when the waiter can reasonably expect timely unlocking.
> > > 
> > > But in fact, so long as wait_on_page_bit_common() does the put_page(),
> > > and is careful not to rely on struct page contents thereafter, there is
> > > no need to hold a reference to the page while waiting on it.  That does
> > 
> > So there's still a moment where refcount is elevated, but hopefully
> > short enough, right? Let's see if it survives Baoquan's stress testing.
> 
> Yes, I applied Hugh's patch 8 hours ago, then our QE Ping operated on
> that machine, after many times of hot removing/adding, the endless
> looping during mirgrating is not seen any more. The test result for
> Hugh's patch is positive. I even suggested Ping increasing the memory
> pressure to "stress -m 250", it still succeeded to offline and remove.
> 
> So I think this patch works to solve the issue. Thanks a lot for your
> help, all of you. 

This is a great news! Thanks for your swift feedback. I will go and try
to review Hugh's patch soon.

> High, will you post a formal patch in a separate thread?
> 
> Meanwhile we found sometime onlining page may not add back all memory
> blocks on one memory board, then hot removing/adding them will cause
> kernel panic. I will investigate further and collect information, see if
> it's a kernel issue or udev issue.

It would be great to get a report in a new email thread.
> 
> Thanks
> Baoquan
> 
> > 
> > > mean that this case cannot go back through the loop: but that's fine for
> > > the page migration case, and even if used more widely, is limited by the
> > > "Stop walking if it's locked" optimization in wake_page_function().
> > > 
> > > Add interface put_and_wait_on_page_locked() to do this, using negative
> > > value of the lock arg to wait_on_page_bit_common() to implement it.
> > > No interruptible or killable variant needed yet, but they might follow:
> > > I have a vague notion that reporting -EINTR should take precedence over
> > > return from wait_on_page_bit_common() without knowing the page state,
> > > so arrange it accordingly - but that may be nothing but pedantic.
> > > 
> > > shrink_page_list()'s __ClearPageLocked(): that was a surprise! this
> > > survived a lot of testing before that showed up.  It does raise the
> > > question: should is_page_cache_freeable() and __remove_mapping() now
> > > treat a PG_waiters page as if an extra reference were held?  Perhaps,
> > > but I don't think it matters much, since shrink_page_list() already
> > > had to win its trylock_page(), so waiters are not very common there: I
> > > noticed no difference when trying the bigger change, and it's surely not
> > > needed while put_and_wait_on_page_locked() is only for page migration.
> > > 
> > > Signed-off-by: Hugh Dickins 
> > > ---
> > 
> > ...
> > 
> > > @@ -1100,6 +,17 @@ static inline int 
> > > wait_on_page_bit_common(wait_queue_head_t *q,
> > >   ret = -EINTR;
> > >   break;
> > >   }
> > > +
> > > + if (lock < 0) {
> > > + /*
> > > +  * We can no longer safely access page->flags:
> > 
> > Hmm...
> > 
> > > +  * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
> > > +  * there is a risk of waiting forever on a page reused
> > > +  * for something that keeps it locked indefinitely.
> > > +  * But best check for -EINTR above before breaking.
> > > +  */
> > > + break;
> > > + }
> > >   }
> > >  
> > >   finish_wait(q, wait);
> > 
> > ... the code continues by:
> > 
> > if (thrashing) {
> > if (!PageSwapBacked(page))
> > 
> > So maybe we should not set 'thrashing' true when lock < 0?
> > 
> > Thanks!
> > Vlastimil

-- 
Michal Hocko
SUSE Labs


[RFC PATCH 1/3] mm, memory_hotplug: try to migrate full section worth of pages

2018-11-20 Thread Michal Hocko
From: Michal Hocko 

do_migrate_range has been limiting the number of pages to migrate to 256
for some reason which is not documented. Even if the limit made some
sense back then when it was introduced it doesn't really serve a good
purpose these days. If the range contains huge pages then
we break out of the loop too early and go through LRU and pcp
caches draining and scan_movable_pages is quite suboptimal.

The only reason to limit the number of pages I can think of is to reduce
the potential time to react on the fatal signal. But even then the
number of pages is a questionable metric because even a single page
might migration block in a non-killable state (e.g. __unmap_and_move).

Remove the limit and offline the full requested range (this is one
membblock worth of pages with the current code). Should we ever get a
report that offlining takes too long to react on fatal signal then we
should rather fix the core migration to use killable waits and bailout
on a signal.

Signed-off-by: Michal Hocko 
---
 mm/memory_hotplug.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c82193db4be6..6263c8cd4491 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1339,18 +1339,16 @@ static struct page *new_node_page(struct page *page, 
unsigned long private)
return new_page_nodemask(page, nid, );
 }
 
-#define NR_OFFLINE_AT_ONCE_PAGES   (256)
 static int
 do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 {
unsigned long pfn;
struct page *page;
-   int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
int not_managed = 0;
int ret = 0;
LIST_HEAD(source);
 
-   for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
+   for (pfn = start_pfn; pfn < end_pfn; pfn++) {
if (!pfn_valid(pfn))
continue;
page = pfn_to_page(pfn);
@@ -1362,8 +1360,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
end_pfn)
ret = -EBUSY;
break;
}
-   if (isolate_huge_page(page, ))
-   move_pages -= 1 << compound_order(head);
+   isolate_huge_page(page, );
continue;
} else if (PageTransHuge(page))
pfn = page_to_pfn(compound_head(page))
@@ -1382,7 +1379,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
end_pfn)
if (!ret) { /* Success */
put_page(page);
list_add_tail(>lru, );
-   move_pages--;
if (!__PageMovable(page))
inc_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
-- 
2.19.1



[RFC PATCH 0/3] few memory offlining enhancements

2018-11-20 Thread Michal Hocko
I have been chasing memory offlining not making progress recently. On
the way I have noticed few weird decisions in the code. The migration
itself is restricted without a reasonable justification and the retry
loop around the migration is quite messy. This is addressed by patch 1
and patch 2.

Patch 3 is targeting on the faultaround code which has been a hot
candidate for the initial issue reported upstream [1] and that I am
debugging internally. It turned out to be not the main contributor
in the end but I believe we should address it regardless. See the patch
description for more details.

[1] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv




[RFC PATCH 2/3] mm, memory_hotplug: deobfuscate migration part of offlining

2018-11-20 Thread Michal Hocko
From: Michal Hocko 

Memory migration might fail during offlining and we keep retrying in
that case. This is currently obfuscate by goto retry loop. The code
is hard to follow and as a result it is even suboptimal becase each
retry round scans the full range from start_pfn even though we have
successfully scanned/migrated [start_pfn, pfn] range already. This
is all only because check_pages_isolated failure has to rescan the full
range again.

De-obfuscate the migration retry loop by promoting it to a real for
loop. In fact remove the goto altogether by making it a proper double
loop (yeah, gotos are nasty in this specific case). In the end we
will get a slightly more optimal code which is better readable.

Signed-off-by: Michal Hocko 
---
 mm/memory_hotplug.c | 60 +++--
 1 file changed, 31 insertions(+), 29 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6263c8cd4491..9cd161db3061 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1591,38 +1591,40 @@ static int __ref __offline_pages(unsigned long 
start_pfn,
goto failed_removal_isolated;
}
 
-   pfn = start_pfn;
-repeat:
-   /* start memory hot removal */
-   ret = -EINTR;
-   if (signal_pending(current)) {
-   reason = "signal backoff";
-   goto failed_removal_isolated;
-   }
+   do {
+   for (pfn = start_pfn; pfn;)
+   {
+   /* start memory hot removal */
+   ret = -EINTR;
+   if (signal_pending(current)) {
+   reason = "signal backoff";
+   goto failed_removal_isolated;
+   }
 
-   cond_resched();
-   lru_add_drain_all();
-   drain_all_pages(zone);
+   cond_resched();
+   lru_add_drain_all();
+   drain_all_pages(zone);
 
-   pfn = scan_movable_pages(start_pfn, end_pfn);
-   if (pfn) { /* We have movable pages */
-   ret = do_migrate_range(pfn, end_pfn);
-   goto repeat;
-   }
+   pfn = scan_movable_pages(pfn, end_pfn);
+   if (pfn) {
+   /* TODO fatal migration failures should bail 
out */
+   do_migrate_range(pfn, end_pfn);
+   }
+   }
+
+   /*
+* dissolve free hugepages in the memory block before doing 
offlining
+* actually in order to make hugetlbfs's object counting 
consistent.
+*/
+   ret = dissolve_free_huge_pages(start_pfn, end_pfn);
+   if (ret) {
+   reason = "failure to dissolve huge pages";
+   goto failed_removal_isolated;
+   }
+   /* check again */
+   offlined_pages = check_pages_isolated(start_pfn, end_pfn);
+   } while (offlined_pages < 0);
 
-   /*
-* dissolve free hugepages in the memory block before doing offlining
-* actually in order to make hugetlbfs's object counting consistent.
-*/
-   ret = dissolve_free_huge_pages(start_pfn, end_pfn);
-   if (ret) {
-   reason = "failure to dissolve huge pages";
-   goto failed_removal_isolated;
-   }
-   /* check again */
-   offlined_pages = check_pages_isolated(start_pfn, end_pfn);
-   if (offlined_pages < 0)
-   goto repeat;
pr_info("Offlined Pages %ld\n", offlined_pages);
/* Ok, all of our target is isolated.
   We cannot do rollback at this point. */
-- 
2.19.1



[RFC PATCH 3/3] mm, fault_around: do not take a reference to a locked page

2018-11-20 Thread Michal Hocko
From: Michal Hocko 

filemap_map_pages takes a speculative reference to each page in the
range before it tries to lock that page. While this is correct it
also can influence page migration which will bail out when seeing
an elevated reference count. The faultaround code would bail on
seeing a locked page so we can pro-actively check the PageLocked
bit before page_cache_get_speculative and prevent from pointless
reference count churn.

Cc: "Kirill A. Shutemov" 
Suggested-by: Jan Kara 
Signed-off-by: Michal Hocko 
---
 mm/filemap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 81adec8ee02c..c76d6a251770 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2553,6 +2553,9 @@ void filemap_map_pages(struct vm_fault *vmf,
goto next;
 
head = compound_head(page);
+
+   if (PageLocked(head))
+   goto next;
if (!page_cache_get_speculative(head))
goto next;
 
-- 
2.19.1



Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc

2018-11-20 Thread Michal Hocko
Damn, David somehow didn't make it to the CC list. Sorry about that.

On Tue 20-11-18 11:35:15, Michal Hocko wrote:
> From: Michal Hocko 
> 
> David Rientjes has reported that 1860033237d4 ("mm: make
> PR_SET_THP_DISABLE immediately active") has changed the way how
> we report THPable VMAs to the userspace. Their monitoring tool is
> triggering false alarms on PR_SET_THP_DISABLE tasks because it considers
> an insufficient THP usage as a memory fragmentation resp. memory
> pressure issue.
> 
> Before the said commit each newly created VMA inherited VM_NOHUGEPAGE
> flag and that got exposed to the userspace via /proc//smaps file.
> This implementation had its downsides as explained in the commit message
> but it is true that the userspace doesn't have any means to query for
> the process wide THP enabled/disabled status.
> 
> PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense
> to export in the process wide context rather than per-vma. Introduce
> a new field to /proc//status which export this status.  If
> PR_SET_THP_DISABLE is used then it reports false same as when the THP is
> not compiled in. It doesn't consider the global THP status because we
> already export that information via sysfs
> 
> Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> Signed-off-by: Michal Hocko 
> ---
>  Documentation/filesystems/proc.txt |  3 +++
>  fs/proc/array.c| 10 ++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index 06562bab509a..7995e9322889 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -182,6 +182,7 @@ For example, to get the status information of a process, 
> all you have to do is
>VmSwap:0 kB
>HugetlbPages:  0 kB
>CoreDumping:0
> +  THP_enabled: 1
>Threads:1
>SigQ:   0/28578
>SigPnd: 
> @@ -256,6 +257,8 @@ Table 1-2: Contents of the status files (as of 4.8)
>   HugetlbPagessize of hugetlb memory portions
>   CoreDumping process's memory is currently being dumped
>   (killing the process may lead to a corrupted 
> core)
> + THP_enabled  process is allowed to use THP (returns 0 when
> +  PR_SET_THP_DISABLE is set on the process
>   Threads number of threads
>   SigQnumber of signals queued/max. number for queue
>   SigPnd  bitmap of pending signals for the thread
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 0ceb3b6b37e7..9d428d5a0ac8 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -392,6 +392,15 @@ static inline void task_core_dumping(struct seq_file *m, 
> struct mm_struct *mm)
>   seq_putc(m, '\n');
>  }
>  
> +static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
> +{
> + bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
> +
> + if (thp_enabled)
> + thp_enabled = !test_bit(MMF_DISABLE_THP, >flags);
> + seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
> +}
> +
>  int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
>   struct pid *pid, struct task_struct *task)
>  {
> @@ -406,6 +415,7 @@ int proc_pid_status(struct seq_file *m, struct 
> pid_namespace *ns,
>   if (mm) {
>   task_mem(m, mm);
>   task_core_dumping(m, mm);
> + task_thp_status(m, mm);
>   mmput(mm);
>   }
>   task_sig(m, task);
> -- 
> 2.19.1

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma

2018-11-20 Thread Michal Hocko
Damn, David somehow didn't make it to the CC list. Sorry about that.

On Tue 20-11-18 11:35:14, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Userspace falls short when trying to find out whether a specific memory
> range is eligible for THP. There are usecases that would like to know
> that
> http://lkml.kernel.org/r/alpine.deb.2.21.1809251248450.50...@chino.kir.corp.google.com
> : This is used to identify heap mappings that should be able to fault thp
> : but do not, and they normally point to a low-on-memory or fragmentation
> : issue.
> 
> The only way to deduce this now is to query for hg resp. nh flags and
> confronting the state with the global setting. Except that there is
> also PR_SET_THP_DISABLE that might change the picture. So the final
> logic is not trivial. Moreover the eligibility of the vma depends on
> the type of VMA as well. In the past we have supported only anononymous
> memory VMAs but things have changed and shmem based vmas are supported
> as well these days and the query logic gets even more complicated
> because the eligibility depends on the mount option and another global
> configuration knob.
> 
> Simplify the current state and report the THP eligibility in
> /proc//smaps for each existing vma. Reuse transparent_hugepage_enabled
> for this purpose. The original implementation of this function assumes
> that the caller knows that the vma itself is supported for THP so make
> the core checks into __transparent_hugepage_enabled and use it for
> existing callers. __show_smap just use the new transparent_hugepage_enabled
> which also checks the vma support status (please note that this one has
> to be out of line due to include dependency issues).
> 
> Signed-off-by: Michal Hocko 
> ---
>  Documentation/filesystems/proc.txt |  3 +++
>  fs/proc/task_mmu.c |  2 ++
>  include/linux/huge_mm.h| 13 -
>  mm/huge_memory.c   | 12 +++-
>  mm/memory.c|  4 ++--
>  5 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index b1fda309f067..06562bab509a 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -425,6 +425,7 @@ SwapPss:   0 kB
>  KernelPageSize:4 kB
>  MMUPageSize:   4 kB
>  Locked:0 kB
> +THPeligible:   0
>  VmFlags: rd ex mr mw me dw
>  
>  the first of these lines shows the same information as is displayed for the
> @@ -462,6 +463,8 @@ replaced by copy-on-write) part of the underlying shmem 
> object out on swap.
>  "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
>  does not take into account swapped out page of underlying shmem objects.
>  "Locked" indicates whether the mapping is locked in memory or not.
> +"THPeligible" indicates whether the mapping is eligible for THP pages - 1 if
> +true, 0 otherwise.
>  
>  "VmFlags" field deserves a separate description. This member represents the 
> kernel
>  flags associated with the particular virtual memory area in two letter 
> encoded
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 47c3764c469b..c9f160eb9fbc 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -790,6 +790,8 @@ static int show_smap(struct seq_file *m, void *v)
>  
>   __show_smap(m, );
>  
> + seq_printf(m, "THPeligible:%d\n", 
> transparent_hugepage_enabled(vma));
> +
>   if (arch_pkeys_enabled())
>   seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
>   show_smap_vma_flags(m, vma);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4663ee96cf59..381e872bfde0 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -93,7 +93,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct 
> *vma);
>  
>  extern unsigned long transparent_hugepage_flags;
>  
> -static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
> +/*
> + * to be used on vmas which are known to support THP.
> + * Use transparent_hugepage_enabled otherwise
> + */
> +static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
>  {
>   if (vma->vm_flags & VM_NOHUGEPAGE)
>   return false;
> @@ -117,6 +121,8 @@ static inline bool transparent_hugepage_enabled(struct 
> vm_area_struct *vma)
>   return false;
>  }
>  
> +bool transparent_hugepage_enabled(struct vm_area_struct *vma);
> +
>  #define transparent_hugepage_use_zero_page() \
>

Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc//smaps

2018-11-20 Thread Michal Hocko
On Tue 20-11-18 11:51:35, Jan Kara wrote:
> Honestly, it just shows that no amount of documentation is going to stop
> userspace from abusing API that's exposing too much if there's no better
> alternative.

Yeah, I agree. And we should never expose such a low level stuff in the
first place. But, well, this ship has already sailed...

> But this is a good clarification regardless. So feel free to
> add:
> 
> Acked-by: Jan Kara 

Thanks!
-- 
Michal Hocko
SUSE Labs


[RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc

2018-11-20 Thread Michal Hocko
From: Michal Hocko 

David Rientjes has reported that 1860033237d4 ("mm: make
PR_SET_THP_DISABLE immediately active") has changed the way how
we report THPable VMAs to the userspace. Their monitoring tool is
triggering false alarms on PR_SET_THP_DISABLE tasks because it considers
an insufficient THP usage as a memory fragmentation resp. memory
pressure issue.

Before the said commit each newly created VMA inherited VM_NOHUGEPAGE
flag and that got exposed to the userspace via /proc//smaps file.
This implementation had its downsides as explained in the commit message
but it is true that the userspace doesn't have any means to query for
the process wide THP enabled/disabled status.

PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense
to export in the process wide context rather than per-vma. Introduce
a new field to /proc//status which export this status.  If
PR_SET_THP_DISABLE is used then it reports false same as when the THP is
not compiled in. It doesn't consider the global THP status because we
already export that information via sysfs

Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
Signed-off-by: Michal Hocko 
---
 Documentation/filesystems/proc.txt |  3 +++
 fs/proc/array.c| 10 ++
 2 files changed, 13 insertions(+)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 06562bab509a..7995e9322889 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -182,6 +182,7 @@ For example, to get the status information of a process, 
all you have to do is
   VmSwap:0 kB
   HugetlbPages:  0 kB
   CoreDumping:0
+  THP_enabled:   1
   Threads:1
   SigQ:   0/28578
   SigPnd: 
@@ -256,6 +257,8 @@ Table 1-2: Contents of the status files (as of 4.8)
  HugetlbPagessize of hugetlb memory portions
  CoreDumping process's memory is currently being dumped
  (killing the process may lead to a corrupted core)
+ THP_enabledprocess is allowed to use THP (returns 0 when
+PR_SET_THP_DISABLE is set on the process
  Threads number of threads
  SigQnumber of signals queued/max. number for queue
  SigPnd  bitmap of pending signals for the thread
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 0ceb3b6b37e7..9d428d5a0ac8 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -392,6 +392,15 @@ static inline void task_core_dumping(struct seq_file *m, 
struct mm_struct *mm)
seq_putc(m, '\n');
 }
 
+static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
+{
+   bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
+
+   if (thp_enabled)
+   thp_enabled = !test_bit(MMF_DISABLE_THP, >flags);
+   seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
 {
@@ -406,6 +415,7 @@ int proc_pid_status(struct seq_file *m, struct 
pid_namespace *ns,
if (mm) {
task_mem(m, mm);
task_core_dumping(m, mm);
+   task_thp_status(m, mm);
mmput(mm);
}
task_sig(m, task);
-- 
2.19.1



[RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc//smaps

2018-11-20 Thread Michal Hocko
From: Michal Hocko 

Even though vma flags exported via /proc//smaps are explicitly
documented to be not guaranteed for future compatibility the warning
doesn't go far enough because it doesn't mention semantic changes to
those flags. And they are important as well because these flags are
a deep implementation internal to the MM code and the semantic might
change at any time.

Let's consider two recent examples:
http://lkml.kernel.org/r/20181002100531.gc4...@quack2.suse.cz
: commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
: removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
: mean time certain customer of ours started poking into /proc//smaps
: and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
: flags, the application just fails to start complaining that DAX support is
: missing in the kernel.

http://lkml.kernel.org/r/alpine.deb.2.21.1809241054050.224...@chino.kir.corp.google.com
: Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
: introduced a regression in that userspace cannot always determine the set
: of vmas where thp is ineligible.
: Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
: to determine if a vma is eligible to be backed by hugepages.
: Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
: be disabled and emit "nh" as a flag for the corresponding vmas as part of
: /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
: flag and "nh" is not emitted.
: This causes smaps parsing libraries to assume a vma is eligible for thp
: and ends up puzzling the user on why its memory is not backed by thp.

In both cases userspace was relying on a semantic of a specific VMA
flag. The primary reason why that happened is a lack of a proper
internface. While this has been worked on and it will be fixed properly,
it seems that our wording could see some refinement and be more vocal
about semantic aspect of these flags as well.

Cc: Jan Kara 
Cc: Dan Williams 
Cc: David Rientjes 
Signed-off-by: Michal Hocko 
---
 Documentation/filesystems/proc.txt | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 12a5e6e693b6..b1fda309f067 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -496,7 +496,9 @@ flags associated with the particular virtual memory area in 
two letter encoded
 
 Note that there is no guarantee that every flag and associated mnemonic will
 be present in all further kernel releases. Things get changed, the flags may
-be vanished or the reverse -- new added.
+be vanished or the reverse -- new added. Interpretatation of their meaning
+might change in future as well. So each consumnent of these flags have to
+follow each specific kernel version for the exact semantic.
 
 This file is only present if the CONFIG_MMU kernel configuration option is
 enabled.
-- 
2.19.1



[RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma

2018-11-20 Thread Michal Hocko
From: Michal Hocko 

Userspace falls short when trying to find out whether a specific memory
range is eligible for THP. There are usecases that would like to know
that
http://lkml.kernel.org/r/alpine.deb.2.21.1809251248450.50...@chino.kir.corp.google.com
: This is used to identify heap mappings that should be able to fault thp
: but do not, and they normally point to a low-on-memory or fragmentation
: issue.

The only way to deduce this now is to query for hg resp. nh flags and
confronting the state with the global setting. Except that there is
also PR_SET_THP_DISABLE that might change the picture. So the final
logic is not trivial. Moreover the eligibility of the vma depends on
the type of VMA as well. In the past we have supported only anononymous
memory VMAs but things have changed and shmem based vmas are supported
as well these days and the query logic gets even more complicated
because the eligibility depends on the mount option and another global
configuration knob.

Simplify the current state and report the THP eligibility in
/proc//smaps for each existing vma. Reuse transparent_hugepage_enabled
for this purpose. The original implementation of this function assumes
that the caller knows that the vma itself is supported for THP so make
the core checks into __transparent_hugepage_enabled and use it for
existing callers. __show_smap just use the new transparent_hugepage_enabled
which also checks the vma support status (please note that this one has
to be out of line due to include dependency issues).

Signed-off-by: Michal Hocko 
---
 Documentation/filesystems/proc.txt |  3 +++
 fs/proc/task_mmu.c |  2 ++
 include/linux/huge_mm.h| 13 -
 mm/huge_memory.c   | 12 +++-
 mm/memory.c|  4 ++--
 5 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index b1fda309f067..06562bab509a 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -425,6 +425,7 @@ SwapPss:   0 kB
 KernelPageSize:4 kB
 MMUPageSize:   4 kB
 Locked:0 kB
+THPeligible:   0
 VmFlags: rd ex mr mw me dw
 
 the first of these lines shows the same information as is displayed for the
@@ -462,6 +463,8 @@ replaced by copy-on-write) part of the underlying shmem 
object out on swap.
 "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
 does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
+"THPeligible" indicates whether the mapping is eligible for THP pages - 1 if
+true, 0 otherwise.
 
 "VmFlags" field deserves a separate description. This member represents the 
kernel
 flags associated with the particular virtual memory area in two letter encoded
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 47c3764c469b..c9f160eb9fbc 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -790,6 +790,8 @@ static int show_smap(struct seq_file *m, void *v)
 
__show_smap(m, );
 
+   seq_printf(m, "THPeligible:%d\n", 
transparent_hugepage_enabled(vma));
+
if (arch_pkeys_enabled())
seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
show_smap_vma_flags(m, vma);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4663ee96cf59..381e872bfde0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -93,7 +93,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct 
*vma);
 
 extern unsigned long transparent_hugepage_flags;
 
-static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
+/*
+ * to be used on vmas which are known to support THP.
+ * Use transparent_hugepage_enabled otherwise
+ */
+static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
if (vma->vm_flags & VM_NOHUGEPAGE)
return false;
@@ -117,6 +121,8 @@ static inline bool transparent_hugepage_enabled(struct 
vm_area_struct *vma)
return false;
 }
 
+bool transparent_hugepage_enabled(struct vm_area_struct *vma);
+
 #define transparent_hugepage_use_zero_page()   \
(transparent_hugepage_flags &   \
 (1<vm_file->f_mapping) && shmem_huge_enabled(vma))
+   return __transparent_hugepage_enabled(vma);
+
+   return false;
+}
+
 static struct page *get_huge_zero_page(void)
 {
struct page *zero_page;
@@ -1303,7 +1313,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, 
pmd_t orig_pmd)
get_page(page);
spin_unlock(vmf->ptl);
 alloc:
-   if (transparent_hugepage_enabled(vma) &&
+   if (__transparent_hugepage_enabled(vma) &&
  

  1   2   3   4   5   6   7   8   9   10   >