from:"Vlastimil Babka"

Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-27 Thread Vlastimil Babka

On 11/2/23 16:46, Paolo Bonzini wrote:
> On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson  wrote:
>> Actually, looking that this again, there's not actually a hard dependency on 
>> THP.
>> A THP-enabled kernel _probably_  gives a higher probability of using 
>> hugepages,
>> but mostly because THP selects COMPACTION, and I suppose because using THP 
>> for
>> other allocations reduces overall fragmentation.
> 
> Yes, that's why I didn't even bother enabling it unless THP is
> enabled, but it makes even more sense to just try.
> 
>> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I 
>> think
>> we should do the below (I verified KVM can create hugepages with THP=n).  
>> We'll
>> need another capability, but (a) we probably should have that anyways and 
>> (b) it
>> provides a cleaner path to adding PUD-sized hugepage support in the future.
> 
> I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This
> should be a generic kernel API and in fact the sizes are available in
> a not-so-friendly format in /sys/kernel/mm/hugepages.
> 
> We should just add /sys/kernel/mm/hugepages/sizes that contains
> "2097152 1073741824" on x86 (only the former if 1G pages are not
> supported).
> 
> Plus: is this the best API if we need something else for 1G pages?
> 
> Let's drop *this* patch and proceed incrementally. (Again, this is
> what I want to do with this final review: identify places that are
> stil sticky, and don't let them block the rest).
> 
> Coincidentially we have an open spot next week at plumbers. Let's
> extend Fuad's section to cover more guestmem work.

Hi,

was there any outcome wrt this one? Based on my experience with THP's it
would be best if userspace didn't have to opt-in, nor care about the
supported size. If the given size is unaligned, provide a mix of large pages
up to an aligned size, and for the rest fallback to base pages, which should
be better than -EINVAL on creation (is it possible with the current
implementation? I'd hope so so?). A way to opt-out from huge pages could be
useful although there's always the risk of some initial troubles resulting
in various online sources cargo-cult recommending to opt-out forever.

Vlastimil

Re: [powerpc] Lockups seen during/just after boot (bisected)

2023-11-24 Thread Vlastimil Babka

On 11/23/23 15:35, Chengming Zhou wrote:
> On 2023/11/23 19:27, Sachin Sant wrote:
>> While booting recent -next kernel on IBM Power server, I have observed 
>> lockups
>> either during boot or just after.
>> 
>> [ 3631.015775] watchdog: CPU 3 self-detected hard LOCKUP @ 
>> __update_freelist_slow+0x74/0x90
> 
> Sorry, the bug can be fixed by this patch from Vlastimil Babka:
> 
> https://lore.kernel.org/all/83ff4b9e-94f1-8b35-1233-3dd414ea4...@suse.cz/

The current -next should be fixed, the fix was folded to the preparatory
commit, which is now:

8a399e2f6003 ("slub: Keep track of whether slub is on the per-node partial
list")

> Thanks!

[PATCH gmem FIXUP v2] mm, compaction: make testing mapping_unmovable() safe

2023-09-08 Thread Vlastimil Babka

As Kirill pointed out, mapping can be removed under us due to
truncation. Test it under folio lock as already done for the async
compaction / dirty folio case. To prevent locking every folio with
mapping to do the test, do it only for unevictable folios, as we can
expect the unmovable mapping folios are also unevictable. To enforce
that expecation, make mapping_set_unmovable() also set AS_UNEVICTABLE.

Also incorporate comment update suggested by Matthew.

Fixes: 3424873596ce ("mm: Add AS_UNMOVABLE to mark mapping as completely 
unmovable")
Signed-off-by: Vlastimil Babka 
---
v2: mapping_set_unmovable() sets also AS_UNEVICTABLE, as Sean suggested.

 include/linux/pagemap.h |  6 +
 mm/compaction.c | 49 +++--
 virt/kvm/guest_mem.c|  2 +-
 3 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 931d2f1da7d5..4070c59e6f25 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -276,6 +276,12 @@ static inline int mapping_use_writeback_tags(struct 
address_space *mapping)
 
 static inline void mapping_set_unmovable(struct address_space *mapping)
 {
+   /*
+* It's expected unmovable mappings are also unevictable. Compaction
+* migrate scanner (isolate_migratepages_block()) relies on this to
+* reduce page locking.
+*/
+   set_bit(AS_UNEVICTABLE, >flags);
set_bit(AS_UNMOVABLE, >flags);
 }
 
diff --git a/mm/compaction.c b/mm/compaction.c
index a3d2b132df52..e0e439b105b5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -862,6 +862,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
/* Time to isolate some pages for migration */
for (; low_pfn < end_pfn; low_pfn++) {
+   bool is_dirty, is_unevictable;
 
if (skip_on_failure && low_pfn >= next_skip_pfn) {
/*
@@ -1047,10 +1048,6 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (!mapping && (folio_ref_count(folio) - 1) > 
folio_mapcount(folio))
goto isolate_fail_put;
 
-   /* The mapping truly isn't movable. */
-   if (mapping && mapping_unmovable(mapping))
-   goto isolate_fail_put;
-
/*
 * Only allow to migrate anonymous pages in GFP_NOFS context
 * because those do not depend on fs locks.
@@ -1062,8 +1059,10 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (!folio_test_lru(folio))
goto isolate_fail_put;
 
+   is_unevictable = folio_test_unevictable(folio);
+
/* Compaction might skip unevictable pages but CMA takes them */
-   if (!(mode & ISOLATE_UNEVICTABLE) && 
folio_test_unevictable(folio))
+   if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
goto isolate_fail_put;
 
/*
@@ -1075,26 +1074,42 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if ((mode & ISOLATE_ASYNC_MIGRATE) && 
folio_test_writeback(folio))
goto isolate_fail_put;
 
-   if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {
-   bool migrate_dirty;
+   is_dirty = folio_test_dirty(folio);
+
+   if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty)
+   || (mapping && is_unevictable)) {
+   bool migrate_dirty = true;
+   bool is_unmovable;
 
/*
-* Only pages without mappings or that have a
-* ->migrate_folio callback are possible to migrate
-* without blocking. However, we can be racing with
-* truncation so it's necessary to lock the page
-* to stabilise the mapping as truncation holds
-* the page lock until after the page is removed
-* from the page cache.
+* Only folios without mappings or that have
+* a ->migrate_folio callback are possible to migrate
+* without blocking.
+*
+* Folios from unmovable mappings are not migratable.
+*
+* However, we can be racing with truncation, which can
+* free the mapping that we need to check. Truncation
+* holds the folio lock until after the folio is removed
+* from the page so holding it oursel

Re: [PATCH gmem FIXUP] mm, compaction: make testing mapping_unmovable() safe

2023-09-06 Thread Vlastimil Babka

On 9/6/23 01:56, Sean Christopherson wrote:
> On Fri, Sep 01, 2023, Vlastimil Babka wrote:
>> As Kirill pointed out, mapping can be removed under us due to
>> truncation. Test it under folio lock as already done for the async
>> compaction / dirty folio case. To prevent locking every folio with
>> mapping to do the test, do it only for unevictable folios, as we can
>> expect the unmovable mapping folios are also unevictable - it is the
>> case for guest memfd folios.
> 
> Rather than expect/assume that unmovable mappings are always unevictable, how 
> about
> requiring that?  E.g. either through a VM_WARN_ON in mapping_set_unmovable(), 
> or by
> simply having that helper forcefully set AS_UNEVICTABLE as well.

Yeah I guess we could make the helper do that, with a comment, as gmem is
the only user right now. And if in the future somebody has case where it
makes sense to have unmovable without unevictable, we can discuss what to do
about it then.

Re: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

2023-09-01 Thread Vlastimil Babka

On 7/25/23 14:51, Matthew Wilcox wrote:
> On Tue, Jul 25, 2023 at 01:24:03PM +0300, Kirill A . Shutemov wrote:
>> On Tue, Jul 18, 2023 at 04:44:53PM -0700, Sean Christopherson wrote:
>> > diff --git a/mm/compaction.c b/mm/compaction.c
>> > index dbc9f86b1934..a3d2b132df52 100644
>> > --- a/mm/compaction.c
>> > +++ b/mm/compaction.c
>> > @@ -1047,6 +1047,10 @@ isolate_migratepages_block(struct compact_control 
>> > *cc, unsigned long low_pfn,
>> >if (!mapping && (folio_ref_count(folio) - 1) > 
>> > folio_mapcount(folio))
>> >goto isolate_fail_put;
>> >  
>> > +  /* The mapping truly isn't movable. */
>> > +  if (mapping && mapping_unmovable(mapping))
>> > +  goto isolate_fail_put;
>> > +
>> 
>> I doubt that it is safe to dereference mapping here. I believe the folio
>> can be truncated from under us and the mapping freed with the inode.
>> 
>> The folio has to be locked to dereference mapping safely (given that the
>> mapping is still tied to the folio).
> 
> There's even a comment to that effect later on in the function:
> 
> /*
>  * Only pages without mappings or that have a
>  * ->migrate_folio callback are possible to migrate
>  * without blocking. However, we can be racing with
>  * truncation so it's necessary to lock the page
>  * to stabilise the mapping as truncation holds
>  * the page lock until after the page is removed
>  * from the page cache.
>  */
> 
> (that could be reworded to make it clear how dangerous dereferencing
> ->mapping is without the lock ... and it does need to be changed to say
> "folio lock" instead of "page lock", so ...)
> 
> How does this look?
> 
> /*
>  * Only folios without mappings or that have
>  * a ->migrate_folio callback are possible to
>  * migrate without blocking. However, we can
>  * be racing with truncation, which can free
>  * the mapping.  Truncation holds the folio lock
>  * until after the folio is removed from the page
>  * cache so holding it ourselves is sufficient.
>  */

Incorporated to my attempt at a fix (posted separately per the requested
process):

https://lore.kernel.org/all/20230901082025.20548-2-vba...@suse.cz/

[PATCH gmem FIXUP] mm, compaction: make testing mapping_unmovable() safe

2023-09-01 Thread Vlastimil Babka

As Kirill pointed out, mapping can be removed under us due to
truncation. Test it under folio lock as already done for the async
compaction / dirty folio case. To prevent locking every folio with
mapping to do the test, do it only for unevictable folios, as we can
expect the unmovable mapping folios are also unevictable - it is the
case for guest memfd folios.

Also incorporate comment update suggested by Matthew.

Fixes: 3424873596ce ("mm: Add AS_UNMOVABLE to mark mapping as completely 
unmovable")
Signed-off-by: Vlastimil Babka 
---
Feel free to squash into 3424873596ce.

 mm/compaction.c | 49 -
 1 file changed, 32 insertions(+), 17 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a3d2b132df52..e0e439b105b5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -862,6 +862,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
/* Time to isolate some pages for migration */
for (; low_pfn < end_pfn; low_pfn++) {
+   bool is_dirty, is_unevictable;
 
if (skip_on_failure && low_pfn >= next_skip_pfn) {
/*
@@ -1047,10 +1048,6 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (!mapping && (folio_ref_count(folio) - 1) > 
folio_mapcount(folio))
goto isolate_fail_put;
 
-   /* The mapping truly isn't movable. */
-   if (mapping && mapping_unmovable(mapping))
-   goto isolate_fail_put;
-
/*
 * Only allow to migrate anonymous pages in GFP_NOFS context
 * because those do not depend on fs locks.
@@ -1062,8 +1059,10 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (!folio_test_lru(folio))
goto isolate_fail_put;
 
+   is_unevictable = folio_test_unevictable(folio);
+
/* Compaction might skip unevictable pages but CMA takes them */
-   if (!(mode & ISOLATE_UNEVICTABLE) && 
folio_test_unevictable(folio))
+   if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
goto isolate_fail_put;
 
/*
@@ -1075,26 +1074,42 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if ((mode & ISOLATE_ASYNC_MIGRATE) && 
folio_test_writeback(folio))
goto isolate_fail_put;
 
-   if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {
-   bool migrate_dirty;
+   is_dirty = folio_test_dirty(folio);
+
+   if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty)
+   || (mapping && is_unevictable)) {
+   bool migrate_dirty = true;
+   bool is_unmovable;
 
/*
-* Only pages without mappings or that have a
-* ->migrate_folio callback are possible to migrate
-* without blocking. However, we can be racing with
-* truncation so it's necessary to lock the page
-* to stabilise the mapping as truncation holds
-* the page lock until after the page is removed
-* from the page cache.
+* Only folios without mappings or that have
+* a ->migrate_folio callback are possible to migrate
+* without blocking.
+*
+* Folios from unmovable mappings are not migratable.
+*
+* However, we can be racing with truncation, which can
+* free the mapping that we need to check. Truncation
+* holds the folio lock until after the folio is removed
+* from the page so holding it ourselves is sufficient.
+*
+* To avoid this folio locking to inspect every folio
+* with mapping for being unmovable, we assume every
+* such folio is also unevictable, which is a cheaper
+* test. If our assumption goes wrong, it's not a bug,
+* just potentially wasted cycles.
 */
if (!folio_trylock(folio))
goto isolate_fail_put;
 
mapping = folio_mapping(folio);
-   migrate_dirty = !mapping ||
-   mapping->a_ops->migrate_folio;
+   if ((mode & ISOLATE_ASYNC_MIGRATE) &am

Re: [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

2023-08-03 Thread Vlastimil Babka

On 7/26/23 13:20, Nikunj A. Dadhania wrote:
> Hi Sean,
> 
> On 7/24/2023 10:30 PM, Sean Christopherson wrote:
>> On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote:
>>> On 7/19/2023 5:14 AM, Sean Christopherson wrote:
 This is the next iteration of implementing fd-based (instead of vma-based)
 memory for KVM guests.  If you want the full background of why we are doing
 this, please go read the v10 cover letter[1].

 The biggest change from v10 is to implement the backing storage in KVM
 itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
 See link[2] for details on why we pivoted to a KVM-specific approach.

 Key word is "biggest".  Relative to v10, there are many big changes.
 Highlights below (I can't remember everything that got changed at
 this point).

 Tagged RFC as there are a lot of empty changelogs, and a lot of missing
 documentation.  And ideally, we'll have even more tests before merging.
 There are also several gaps/opens (to be discussed in tomorrow's PUCK).
>>>
>>> As per our discussion on the PUCK call, here are the memory/NUMA accounting 
>>> related observations that I had while working on SNP guest secure page 
>>> migration:
>>>
>>> * gmem allocations are currently treated as file page allocations
>>>   accounted to the kernel and not to the QEMU process.
>> 
>> We need to level set on terminology: these are all *stats*, not accounting.  
>> That
>> distinction matters because we have wiggle room on stats, e.g. we can 
>> probably get
>> away with just about any definition of how guest_memfd memory impacts stats, 
>> so
>> long as the information that is surfaced to userspace is useful and expected.
>> 
>> But we absolutely need to get accounting correct, specifically the 
>> allocations
>> need to be correctly accounted in memcg.  And unless I'm missing something,
>> nothing in here shows anything related to memcg.
> 
> I tried out memcg after creating a separate cgroup for the qemu process. 
> Guest 
> memory is accounted in memcg.
> 
>   $ egrep -w "file|file_thp|unevictable" memory.stat
>   file 42978775040
>   file_thp 42949672960
>   unevictable 42953588736 
> 
> NUMA allocations are coming from right nodes as set by the numactl.
> 
>   $ egrep -w "file|file_thp|unevictable" memory.numa_stat
>   file N0=0 N1=20480 N2=21489377280 N3=21489377280
>   file_thp N0=0 N1=0 N2=21472739328 N3=21476933632
>   unevictable N0=0 N1=0 N2=21474697216 N3=21478891520
> 
>> 
>>>   Starting an SNP guest with 40G memory with memory interleave between
>>>   Node2 and Node3
>>>
>>>   $ numactl -i 2,3 ./bootg_snp.sh
>>>
>>> PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ 
>>> COMMAND
>>>  242179 root  20   0   40.4g  99580  51676 S  78.0   0.0   0:56.58 
>>> qemu-system-x86
>>>
>>>   -> Incorrect process resident memory and shared memory is reported
>> 
>> I don't know that I would call these "incorrect".  Shared memory definitely 
>> is
>> correct, because by definition guest_memfd isn't shared.  RSS is less clear 
>> cut;
>> gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up 
>> with
>> scenarios where RSS > VIRT, which will be quite confusing for unaware users 
>> (I'm
>> assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem
>> memslots).
> 
> I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming 
> all the
> memory is private)
> 
> As per my experiments with a hack below. MM_FILEPAGES does get accounted to 
> RSS/SHR in top
> 
> PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>4339 root  20   0   40.4g  40.1g  40.1g S  76.7  16.0   0:13.83 
> qemu-system-x86
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index f456f3b5049c..5b1f48a2e714 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -166,6 +166,7 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member)
>  {
> trace_rss_stat(mm, member);
>  }
> +EXPORT_SYMBOL(mm_trace_rss_stat);
> 
>  /*
>   * Note: this doesn't free the actual pages themselves. That
> diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> index a7e926af4255..e4f268bf9ce2 100644
> --- a/virt/kvm/guest_mem.c
> +++ b/virt/kvm/guest_mem.c
> @@ -91,6 +91,10 @@ static struct folio *kvm_gmem_get_folio(struct file *file, 
> pgoff_t index)
> clear_highpage(folio_page(folio, i));
> }
> 
> +   /* Account only once for the first time */
> +   if (!folio_test_dirty(folio))
> +   add_mm_counter(current->mm, MM_FILEPAGES, 
> folio_nr_pages(folio));

I think this alone would cause "Bad rss-counter" messages when the process
exits, because there's no corresponding decrement when page tables are torn
down. We would probably have to instantiate the page tables (i.e. with
PROT_NONE so userspace can't really do accesses through them) for this to
work properly.

So then it wouldn't technically be "unmapped private memory"

Re: [RFC PATCH v11 11/29] security: Export security_inode_init_security_anon() for use by KVM

2023-07-31 Thread Vlastimil Babka

On 7/19/23 01:44, Sean Christopherson wrote:
> Signed-off-by: Sean Christopherson 

Process wise this will probably be frowned upon when done separately, so I'd
fold it in the patch using the export, seems to be the next one.

> ---
>  security/security.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/security/security.c b/security/security.c
> index b720424ca37d..7fc78f0f3622 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1654,6 +1654,7 @@ int security_inode_init_security_anon(struct inode 
> *inode,
>   return call_int_hook(inode_init_security_anon, 0, inode, name,
>context_inode);
>  }
> +EXPORT_SYMBOL_GPL(security_inode_init_security_anon);
>  
>  #ifdef CONFIG_SECURITY_PATH
>  /**

Re: [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

2023-07-28 Thread Vlastimil Babka

On 7/25/23 14:51, Matthew Wilcox wrote:
> On Tue, Jul 25, 2023 at 01:24:03PM +0300, Kirill A . Shutemov wrote:
>> On Tue, Jul 18, 2023 at 04:44:53PM -0700, Sean Christopherson wrote:
>> > diff --git a/mm/compaction.c b/mm/compaction.c
>> > index dbc9f86b1934..a3d2b132df52 100644
>> > --- a/mm/compaction.c
>> > +++ b/mm/compaction.c
>> > @@ -1047,6 +1047,10 @@ isolate_migratepages_block(struct compact_control 
>> > *cc, unsigned long low_pfn,
>> >if (!mapping && (folio_ref_count(folio) - 1) > 
>> > folio_mapcount(folio))
>> >goto isolate_fail_put;
>> >  
>> > +  /* The mapping truly isn't movable. */
>> > +  if (mapping && mapping_unmovable(mapping))
>> > +  goto isolate_fail_put;
>> > +
>> 
>> I doubt that it is safe to dereference mapping here. I believe the folio
>> can be truncated from under us and the mapping freed with the inode.
>> 
>> The folio has to be locked to dereference mapping safely (given that the
>> mapping is still tied to the folio).
> 
> There's even a comment to that effect later on in the function:

Hmm, well spotted. But it wouldn't be so great if we now had to lock every
inspected page (and not just dirty pages), just to check the AS_ bit.

But I wonder if this is leftover from previous versions. Are the guest pages
even PageLRU currently? (and should they be, given how they can't be swapped
out or anything?) If not, isolate_migratepages_block will skip them anyway.

> 
> /*
>  * Only pages without mappings or that have a
>  * ->migrate_folio callback are possible to migrate
>  * without blocking. However, we can be racing with
>  * truncation so it's necessary to lock the page
>  * to stabilise the mapping as truncation holds
>  * the page lock until after the page is removed
>  * from the page cache.
>  */
> 
> (that could be reworded to make it clear how dangerous dereferencing
> ->mapping is without the lock ... and it does need to be changed to say
> "folio lock" instead of "page lock", so ...)

> How does this look?
> 
> /*
>  * Only folios without mappings or that have
>  * a ->migrate_folio callback are possible to
>  * migrate without blocking. However, we can
>  * be racing with truncation, which can free
>  * the mapping.  Truncation holds the folio lock
>  * until after the folio is removed from the page
>  * cache so holding it ourselves is sufficient.
>  */
>

Re: [PATCH v4 00/33] Per-VMA locks

2023-07-11 Thread Vlastimil Babka

On 7/11/23 12:35, Leon Romanovsky wrote:
> 
> On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote:
> 
> <...>
> 
>> Laurent Dufour (1):
>>   powerc/mm: try VMA lock-based page fault handling first
> 
> Hi,
> 
> This series and specifically the commit above broke docker over PPC.
> It causes to docker service stuck while trying to activate. Revert of
> this commit allows us to use docker again.

Hi,

there have been follow-up fixes, that are part of 6.4.3 stable (also
6.5-rc1) Does that version work for you?

Vlastimil

> [user@ppc-135-3-200-205 ~]# sudo systemctl status docker
> ● docker.service - Docker Application Container Engine
>  Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor 
> preset: disabled)
>  Active: activating (start) since Mon 2023-06-26 14:47:07 IDT; 3h 50min 
> ago
> TriggeredBy: ● docker.socket
>Docs: https://docs.docker.com
>Main PID: 276555 (dockerd)
>  Memory: 44.2M
>  CGroup: /system.slice/docker.service
>  └─ 276555 /usr/bin/dockerd -H fd:// 
> --containerd=/run/containerd/containerd.sock
> 
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: 
> time="2023-06-26T14:47:07.129383166+03:00" level=info msg="Graph migration to 
> content-addressability took 0.00 se>
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: 
> time="2023-06-26T14:47:07.129666160+03:00" level=warning msg="Your kernel 
> does not support cgroup cfs period"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: 
> time="2023-06-26T14:47:07.129684117+03:00" level=warning msg="Your kernel 
> does not support cgroup cfs quotas"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: 
> time="2023-06-26T14:47:07.129697085+03:00" level=warning msg="Your kernel 
> does not support cgroup rt period"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: 
> time="2023-06-26T14:47:07.129711513+03:00" level=warning msg="Your kernel 
> does not support cgroup rt runtime"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: 
> time="2023-06-26T14:47:07.129720656+03:00" level=warning msg="Unable to find 
> blkio cgroup in mounts"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: 
> time="2023-06-26T14:47:07.129805617+03:00" level=warning msg="mountpoint for 
> pids not found"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: 
> time="2023-06-26T14:47:07.130199070+03:00" level=info msg="Loading 
> containers: start."
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: 
> time="2023-06-26T14:47:07.132688568+03:00" level=warning msg="Running 
> modprobe bridge br_netfilter failed with me>
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: 
> time="2023-06-26T14:47:07.271014050+03:00" level=info msg="Default bridge 
> (docker0) is assigned with an IP addres>
> 
> Python script which we used for bisect:
> 
> import subprocess
> import time
> import sys
> 
> 
> def run_command(cmd):
> print('running:', cmd)
> 
> p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, 
> stderr=subprocess.PIPE)
> 
> try:
> stdout, stderr = p.communicate(timeout=30)
> 
> except subprocess.TimeoutExpired:
> return True
> 
> print(stdout.decode())
> print(stderr.decode())
> print('rc:', p.returncode)
> 
> return False
> 
> 
> def main():
> commands = [
> 'sudo systemctl stop docker',
> 'sudo systemctl status docker',
> 'sudo systemctl is-active docker',
> 'sudo systemctl start docker',
> 'sudo systemctl status docker',
> ]
> 
> for i in range(1000):
> title = f'Try no. {i + 1}'
> print('*' * 50, title, '*' * 50)
> 
> for cmd in commands:
> if run_command(cmd):
> print(f'Reproduced on try no. {i + 1}!')
> print(f'"{cmd}" is stuck!')
> 
> return 1
> 
> print('\n')
> time.sleep(30)
> return 0
> 
> if __name__ == '__main__':
> sys.exit(main())
> 
> Thanks

Re: [PATCH] mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED

2023-05-24 Thread Vlastimil Babka

On 5/24/23 02:29, David Rientjes wrote:
> On Tue, 23 May 2023, Vlastimil Babka wrote:
> 
>> As discussed at LSF/MM [1] [2] and with no objections raised there,
>> deprecate the SLAB allocator. Rename the user-visible option so that
>> users with CONFIG_SLAB=y get a new prompt with explanation during make
>> oldconfig, while make olddefconfig will just switch to SLUB.
>> 
>> In all defconfigs with CONFIG_SLAB=y remove the line so those also
>> switch to SLUB. Regressions due to the switch should be reported to
>> linux-mm and slab maintainers.
>> 
>> [1] https://lore.kernel.org/all/4b9fc9c6-b48c-198f-5f80-811a44737...@suse.cz/
>> [2] https://lwn.net/Articles/932201/
>> 
>> Signed-off-by: Vlastimil Babka 
> 
> Acked-by: David Rientjes 

Thanks.

> The Kconfig option says that SLAB will be removed in a few cycles.  I 
> think we should wait until at least the next LTS kernel is forked at the 
> end of the year so that users who upgrade to only the LTS releases can be 
> prompted for this change and surface any concerns.  Slab allocation is a 
> critical subsystem, so I presume this is the safest and most responsible 
> way to do the SLAB deprecation.  Hopefully that timeline works for 
> everybody.

Sure, and in fact looking at predicted release dates [1], if the deprecation
goes into 6.5 then 6.7 ("few" == 2) is already end of January 2024, anyway.

[1] https://hansen.beer/~dave/phb/

Re: [PATCH] mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED

2023-05-23 Thread Vlastimil Babka

On 5/23/23 11:22, Geert Uytterhoeven wrote:
> Hi Vlastimil,
> 
> Thanks for your patch!
> 
> On Tue, May 23, 2023 at 11:12 AM Vlastimil Babka  wrote:
>> As discussed at LSF/MM [1] [2] and with no objections raised there,
>> deprecate the SLAB allocator. Rename the user-visible option so that
>> users with CONFIG_SLAB=y get a new prompt with explanation during make
>> oldconfig, while make olddefconfig will just switch to SLUB.
>>
>> In all defconfigs with CONFIG_SLAB=y remove the line so those also
>> switch to SLUB. Regressions due to the switch should be reported to
>> linux-mm and slab maintainers.
> 
> Technically, removing these lines from the defconfig files does not
> have any impact, right?

Well, it doesn't, but I thought it's at least a useful heads-up for the
maintainers in case some have specific reasons for keeping SLAB there.

> And it removes one more sync point indicating the last time some
> defconfig files were (not) updated by their maintainers ;-)

Sure, I can exclude yours (and anyone else who'd prefer that), the ack on
the deprecation itself is sufficient.

>> [1] https://lore.kernel.org/all/4b9fc9c6-b48c-198f-5f80-811a44737...@suse.cz/
>> [2] https://lwn.net/Articles/932201/
>>
>> Signed-off-by: Vlastimil Babka 
> 
>>  arch/m68k/configs/amiga_defconfig   |  1 -
>>  arch/m68k/configs/apollo_defconfig  |  1 -
>>  arch/m68k/configs/atari_defconfig   |  1 -
>>  arch/m68k/configs/bvme6000_defconfig|  1 -
>>  arch/m68k/configs/hp300_defconfig   |  1 -
>>  arch/m68k/configs/mac_defconfig |  1 -
>>  arch/m68k/configs/multi_defconfig   |  1 -
>>  arch/m68k/configs/mvme147_defconfig |  1 -
>>  arch/m68k/configs/mvme16x_defconfig |  1 -
>>  arch/m68k/configs/q40_defconfig |  1 -
>>  arch/m68k/configs/sun3_defconfig|  1 -
>>  arch/m68k/configs/sun3x_defconfig   |  1 -
> 
> Regardless,
> Acked-by: Geert Uytterhoeven 

Thanks!

> Gr{oetje,eeting}s,
> 
> Geert
>

[PATCH] mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED

2023-05-23 Thread Vlastimil Babka

As discussed at LSF/MM [1] [2] and with no objections raised there,
deprecate the SLAB allocator. Rename the user-visible option so that
users with CONFIG_SLAB=y get a new prompt with explanation during make
oldconfig, while make olddefconfig will just switch to SLUB.

In all defconfigs with CONFIG_SLAB=y remove the line so those also
switch to SLUB. Regressions due to the switch should be reported to
linux-mm and slab maintainers.

[1] https://lore.kernel.org/all/4b9fc9c6-b48c-198f-5f80-811a44737...@suse.cz/
[2] https://lwn.net/Articles/932201/

Signed-off-by: Vlastimil Babka 
---
 arch/arc/configs/axs103_smp_defconfig   |  1 -
 arch/arc/configs/haps_hs_defconfig  |  1 -
 arch/arc/configs/haps_hs_smp_defconfig  |  1 -
 arch/arc/configs/hsdk_defconfig |  1 -
 arch/arc/configs/tb10x_defconfig|  1 -
 arch/arm/configs/am200epdkit_defconfig  |  1 -
 arch/arm/configs/at91_dt_defconfig  |  1 -
 arch/arm/configs/dove_defconfig |  1 -
 arch/arm/configs/ep93xx_defconfig   |  1 -
 arch/arm/configs/imx_v4_v5_defconfig|  1 -
 arch/arm/configs/lpc32xx_defconfig  |  1 -
 arch/arm/configs/mmp2_defconfig |  1 -
 arch/arm/configs/mvebu_v7_defconfig |  1 -
 arch/arm/configs/nhk8815_defconfig  |  1 -
 arch/arm/configs/omap2plus_defconfig|  1 -
 arch/arm/configs/pxa168_defconfig   |  1 -
 arch/arm/configs/pxa3xx_defconfig   |  1 -
 arch/arm/configs/pxa910_defconfig   |  1 -
 arch/arm/configs/realview_defconfig |  1 -
 arch/arm/configs/rpc_defconfig  |  1 -
 arch/arm/configs/sama5_defconfig|  1 -
 arch/arm/configs/sama7_defconfig|  1 -
 arch/arm/configs/shmobile_defconfig |  1 -
 arch/arm/configs/sp7021_defconfig   |  1 -
 arch/arm/configs/tegra_defconfig|  1 -
 arch/arm/configs/versatile_defconfig|  1 -
 arch/m68k/configs/amiga_defconfig   |  1 -
 arch/m68k/configs/apollo_defconfig  |  1 -
 arch/m68k/configs/atari_defconfig   |  1 -
 arch/m68k/configs/bvme6000_defconfig|  1 -
 arch/m68k/configs/hp300_defconfig   |  1 -
 arch/m68k/configs/mac_defconfig |  1 -
 arch/m68k/configs/multi_defconfig   |  1 -
 arch/m68k/configs/mvme147_defconfig |  1 -
 arch/m68k/configs/mvme16x_defconfig |  1 -
 arch/m68k/configs/q40_defconfig |  1 -
 arch/m68k/configs/sun3_defconfig|  1 -
 arch/m68k/configs/sun3x_defconfig   |  1 -
 arch/microblaze/configs/mmu_defconfig   |  1 -
 arch/mips/configs/ar7_defconfig |  1 -
 arch/mips/configs/bcm47xx_defconfig |  1 -
 arch/mips/configs/bigsur_defconfig  |  1 -
 arch/mips/configs/cavium_octeon_defconfig   |  1 -
 arch/mips/configs/ci20_defconfig|  1 -
 arch/mips/configs/cu1000-neo_defconfig  |  1 -
 arch/mips/configs/cu1830-neo_defconfig  |  1 -
 arch/mips/configs/db1xxx_defconfig  |  1 -
 arch/mips/configs/decstation_64_defconfig   |  1 -
 arch/mips/configs/decstation_defconfig  |  1 -
 arch/mips/configs/decstation_r4k_defconfig  |  1 -
 arch/mips/configs/fuloong2e_defconfig   |  1 -
 arch/mips/configs/gpr_defconfig |  1 -
 arch/mips/configs/ip22_defconfig|  1 -
 arch/mips/configs/ip27_defconfig|  1 -
 arch/mips/configs/ip28_defconfig|  1 -
 arch/mips/configs/ip32_defconfig|  1 -
 arch/mips/configs/jazz_defconfig|  1 -
 arch/mips/configs/malta_defconfig   |  1 -
 arch/mips/configs/malta_kvm_defconfig   |  1 -
 arch/mips/configs/malta_qemu_32r6_defconfig |  1 -
 arch/mips/configs/maltaaprp_defconfig   |  1 -
 arch/mips/configs/maltasmvp_defconfig   |  1 -
 arch/mips/configs/maltasmvp_eva_defconfig   |  1 -
 arch/mips/configs/maltaup_defconfig |  1 -
 arch/mips/configs/maltaup_xpa_defconfig |  1 -
 arch/mips/configs/mtx1_defconfig|  1 -
 arch/mips/configs/pic32mzda_defconfig   |  1 -
 arch/mips/configs/qi_lb60_defconfig |  1 -
 arch/mips/configs/rb532_defconfig   |  1 -
 arch/mips/configs/rbtx49xx_defconfig|  1 -
 arch/mips/configs/rm200_defconfig   |  1 -
 arch/mips/configs/rs90_defconfig|  1 -
 arch/mips/configs/sb1250_swarm_defconfig|  1 -
 arch/nios2/configs/10m50_defconfig  |  1 -
 arch/nios2/configs/3c120_defconfig  |  1 -
 arch/parisc/configs/generic-32bit_defconfig |  1 -
 arch/powerpc/configs/40x/klondike_defconfig |  1 -
 arch/powerpc/configs/52xx/pcm030_defconfig  |  1 -
 arch/powerpc/configs/83xx

Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-10 Thread Vlastimil Babka

On 1/9/23 21:53, Suren Baghdasaryan wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
> 1. Readers never wait. They try to take the vma_lock and fall back to
> mmap_lock if that fails.
> 2. Only one writer at a time will ever try to write-lock a vma_lock
> because writers first take mmap_lock in write mode.
> Because of these requirements, full rw_semaphore functionality is not
> needed and we can replace rw_semaphore with an atomic variable.
> When a reader takes read lock, it increments the atomic unless the
> value is negative. If that fails read-locking is aborted and mmap_lock
> is used instead.
> When writer takes write lock, it resets atomic value to -1 if the
> current value is 0 (no readers). Since all writers take mmap_lock in
> write mode first, there can be only one writer at a time. If there
> are readers, writer will place itself into a wait queue using new
> mm_struct.vma_writer_wait waitqueue head. The last reader to release
> the vma_lock will signal the writer to wake up.
> vm_lock_seq is also moved into vma_lock and along with atomic_t they
> are nicely packed and consume 8 bytes, bringing the overhead from
> vma_lock from 44 to 16 bytes:
> 
> slabinfo before the changes:
>  ...: ...
> vm_area_struct...152   532 : ...
> 
> slabinfo with vma_lock:
>  ...: ...
> rw_semaphore  ...  8  5121 : ...

I guess the cache is called vma_lock, not rw_semaphore?

> vm_area_struct...160   512 : ...
> 
> Assuming 4 vm_area_structs, memory consumption would be:
> baseline: 6040kB
> vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
> Total increase: 556kB
> 
> atomic_t might overflow if there are many competing readers, therefore
> vma_read_trylock() implements an overflow check and if that occurs it
> restors the previous value and exits with a failure to lock.
> 
> Signed-off-by: Suren Baghdasaryan 

This patch is indeed an interesting addition indeed, but I can't help but
think it obsoletes the previous one :) We allocate an extra 8 bytes slab
object for the lock, and the pointer to it is also 8 bytes, and requires an
indirection. The vma_lock cache is not cacheline aligned (otherwise it would
be a major waste), so we have potential false sharing with up to 7 other
vma_lock's.
I'd expect if the vma_lock was placed with the relatively cold fields of
vm_area_struct, it shouldn't cause much cache ping pong when working with
that vma. Even if we don't cache align the vma to save memory (would be 192
bytes instead of 160 when aligned) and place the vma_lock and the cold
fields at the end of the vma, it may be false sharing the cacheline with the
next vma in the slab. But that's a single vma, not up to 7, so it shouldn't
be worse?

Re: [PATCH mm-unstable v1 09/20] mm/gup: reliable R/O long-term pinning in COW mappings

2022-11-22 Thread Vlastimil Babka

On 11/16/22 11:26, David Hildenbrand wrote:
> We already support reliable R/O pinning of anonymous memory. However,
> assume we end up pinning (R/O long-term) a pagecache page or the shared
> zeropage inside a writable private ("COW") mapping. The next write access
> will trigger a write-fault and replace the pinned page by an exclusive
> anonymous page in the process page tables to break COW: the pinned page no
> longer corresponds to the page mapped into the process' page table.
> 
> Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a
> COW mapping, let's properly break COW first before R/O long-term
> pinning something that's not an exclusive anon page inside a COW
> mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page
> instead that can get pinned safely.
> 
> With this change, we can stop using FOLL_FORCE|FOLL_WRITE for reliable
> R/O long-term pinning in COW mappings.
> 
> With this change, the new R/O long-term pinning tests for non-anonymous
> memory succeed:
>   # [RUN] R/O longterm GUP pin ... with shared zeropage
>   ok 151 Longterm R/O pin is reliable
>   # [RUN] R/O longterm GUP pin ... with memfd
>   ok 152 Longterm R/O pin is reliable
>   # [RUN] R/O longterm GUP pin ... with tmpfile
>   ok 153 Longterm R/O pin is reliable
>   # [RUN] R/O longterm GUP pin ... with huge zeropage
>   ok 154 Longterm R/O pin is reliable
>   # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
>   ok 155 Longterm R/O pin is reliable
>   # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
>   ok 156 Longterm R/O pin is reliable
>   # [RUN] R/O longterm GUP-fast pin ... with shared zeropage
>   ok 157 Longterm R/O pin is reliable
>   # [RUN] R/O longterm GUP-fast pin ... with memfd
>   ok 158 Longterm R/O pin is reliable
>   # [RUN] R/O longterm GUP-fast pin ... with tmpfile
>   ok 159 Longterm R/O pin is reliable
>   # [RUN] R/O longterm GUP-fast pin ... with huge zeropage
>   ok 160 Longterm R/O pin is reliable
>   # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
>   ok 161 Longterm R/O pin is reliable
>   # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
>   ok 162 Longterm R/O pin is reliable
> 
> Note 1: We don't care about short-term R/O-pinning, because they have
> snapshot semantics: they are not supposed to observe modifications that
> happen after pinning.
> 
> As one example, assume we start direct I/O to read from a page and store
> page content into a file: modifications to page content after starting
> direct I/O are not guaranteed to end up in the file. So even if we'd pin
> the shared zeropage, the end result would be as expected -- getting zeroes
> stored to the file.
> 
> Note 2: For shared mappings we'll now always fallback to the slow path to
> lookup the VMA when R/O long-term pining. While that's the necessary price
> we have to pay right now, it's actually not that bad in practice: most
> FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with
> FOLL_FORCE because they tried dealing with COW mappings correctly ...
> 
> Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE,
> such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd
> populate exclusive anon pages that we can pin. There was a concern that
> this could affect the memlock limit of existing setups.
> 
> For example, a VM running with VFIO could run into the memlock limit and
> fail to run. However, we essentially had the same behavior already in
> commit 17839856fd58 ("gup: document and work around "COW can break either
> way" issue") which got merged into some enterprise distros, and there were
> not any such complaints. So most probably, we're fine.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Vlastimil Babka

Re: [PATCH mm-unstable v1 08/20] mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping

2022-11-22 Thread Vlastimil Babka

On 11/16/22 11:26, David Hildenbrand wrote:
> Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a
> COW (i.e., private writable) mapping and adjust the documentation
> accordingly.
> 
> FAULT_FLAG_UNSHARE will now also break COW when encountering the shared
> zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by
> properly replacing the mapped page/pfn by a private copy (an exclusive
> anonymous page).
> 
> Note that only do_wp_page() needs care: hugetlb_wp() already handles
> FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it
> correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE
> such that we can handle FAULT_FLAG_UNSHARE on the PTE level.
> 
> This change is a requirement for reliable long-term R/O pinning in
> COW mappings.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Vlastimil Babka 

> ---
>  include/linux/mm_types.h | 8 
>  mm/memory.c  | 4 
>  2 files changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5e7f4fac1e78..5e9aaad8c7b2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1037,9 +1037,9 @@ typedef struct {
>   * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
>   * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
>   * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal 
> signals.
> - * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and 
> mark
> - *  exclusive) a possibly shared anonymous page that is
> - *  mapped R/O.
> + * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to break COW in a
> + *  COW mapping, making sure that an exclusive anon page 
> is
> + *  mapped after the fault.
>   * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
>   *We should only access orig_pte if this flag set.
>   *
> @@ -1064,7 +1064,7 @@ typedef struct {
>   *
>   * The combination FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE is illegal.
>   * FAULT_FLAG_UNSHARE is ignored and treated like an ordinary read fault when
> - * no existing R/O-mapped anonymous page is encountered.
> + * applied to mappings that are not COW mappings.
>   */
>  enum fault_flag {
>   FAULT_FLAG_WRITE =  1 << 0,
> diff --git a/mm/memory.c b/mm/memory.c
> index d47ad33c6487..56b21ab1e4d2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3432,10 +3432,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>   }
>   wp_page_reuse(vmf);
>   return 0;
> - } else if (unshare) {
> - /* No anonymous page -> nothing to do. */
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> - return 0;
>   }
>  copy:
>   /*

Re: [PATCH mm-unstable v1 07/20] mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings

2022-11-22 Thread Vlastimil Babka

On 11/16/22 11:26, David Hildenbrand wrote:
> If we already have a PMD/PUD mapped write-protected in a private mapping
> and we want to break COW either due to FAULT_FLAG_WRITE or
> FAULT_FLAG_UNSHARE, there is no need to inform the file system just like on
> the PTE path.
> 
> Let's just split (->zap) + fallback in that case.
> 
> This is a preparation for more generic FAULT_FLAG_UNSHARE support in
> COW mappings.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Vlastimil Babka 

Nits:

> ---
>  mm/memory.c | 24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c35e6cd32b6a..d47ad33c6487 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4802,6 +4802,7 @@ static inline vm_fault_t create_huge_pmd(struct 
> vm_fault *vmf)
>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>  {
>   const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
> + vm_fault_t ret;
>  
>   if (vma_is_anonymous(vmf->vma)) {
>   if (likely(!unshare) &&
> @@ -4809,11 +4810,13 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault 
> *vmf)
>   return handle_userfault(vmf, VM_UFFD_WP);
>   return do_huge_pmd_wp_page(vmf);
>   }
> - if (vmf->vma->vm_ops->huge_fault) {
> - vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
>  
> - if (!(ret & VM_FAULT_FALLBACK))
> - return ret;
> + if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
> + if (vmf->vma->vm_ops->huge_fault) {

I guess it could have been a single if with && and the reduced identation
could fit keeping 'ret' declaration inside.
AFAICS the later patches don't build more on top of this anyway.
But also fine keeping as is.

(the hunk below same)

> + ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
> + if (!(ret & VM_FAULT_FALLBACK))
> + return ret;
> + }
>   }
>  
>   /* COW or write-notify handled on pte level: split pmd. */
> @@ -4839,14 +4842,17 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, 
> pud_t orig_pud)
>  {
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) &&  \
>   defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> + vm_fault_t ret;
> +
>   /* No support for anonymous transparent PUD pages yet */
>   if (vma_is_anonymous(vmf->vma))
>   goto split;
> - if (vmf->vma->vm_ops->huge_fault) {
> - vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
> -
> - if (!(ret & VM_FAULT_FALLBACK))
> - return ret;
> + if (vmf->vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
> + if (vmf->vma->vm_ops->huge_fault) {
> + ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
> + if (!(ret & VM_FAULT_FALLBACK))
> + return ret;
> + }
>   }
>  split:
>   /* COW or write-notify not handled on PUD level: split pud.*/

Re: [PATCH mm-unstable v1 06/20] mm: rework handling in do_wp_page() based on private vs. shared mappings

2022-11-22 Thread Vlastimil Babka

On 11/16/22 11:26, David Hildenbrand wrote:
> We want to extent FAULT_FLAG_UNSHARE support to anything mapped into a
> COW mapping (pagecache page, zeropage, PFN, ...), not just anonymous pages.
> Let's prepare for that by handling shared mappings first such that we can
> handle private mappings last.
> 
> While at it, use folio-based functions instead of page-based functions
> where we touch the code either way.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Vlastimil Babka

Re: [PATCH mm-unstable v1 05/20] mm: add early FAULT_FLAG_WRITE consistency checks

2022-11-18 Thread Vlastimil Babka

On 11/16/22 11:26, David Hildenbrand wrote:
> Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to
> care in all other handlers and might get "surprises" if we forget to do
> so.
> 
> Write faults without VM_MAYWRITE don't make any sense, and our
> maybe_mkwrite() logic could have hidden such abuse for now.
> 
> Write faults without VM_WRITE on something that is not a COW mapping is
> similarly broken, and e.g., do_wp_page() could end up placing an
> anonymous page into a shared mapping, which would be bad.
> 
> This is a preparation for reliable R/O long-term pinning of pages in
> private mappings, whereby we want to make sure that we will never break
> COW in a read-only private mapping.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Vlastimil Babka 

> ---
>  mm/memory.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e014435a87db..c4fa378ec2a0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5170,6 +5170,14 @@ static vm_fault_t sanitize_fault_flags(struct 
> vm_area_struct *vma,
>*/
>   if (!is_cow_mapping(vma->vm_flags))
>   *flags &= ~FAULT_FLAG_UNSHARE;
> + } else if (*flags & FAULT_FLAG_WRITE) {
> + /* Write faults on read-only mappings are impossible ... */
> + if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)))
> + return VM_FAULT_SIGSEGV;
> + /* ... and FOLL_FORCE only applies to COW mappings. */
> + if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) &&
> +  !is_cow_mapping(vma->vm_flags)))
> + return VM_FAULT_SIGSEGV;
>   }
>   return 0;
>  }

Re: [PATCH mm-unstable v1 04/20] mm: add early FAULT_FLAG_UNSHARE consistency checks

2022-11-18 Thread Vlastimil Babka

On 11/16/22 11:26, David Hildenbrand wrote:
> For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which
> implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not
> dealing with a COW mapping, such that we treat it like a read fault as
> documented and don't have to worry about the flag throughout all fault
> handlers.
> 
> While at it, centralize the check for mutual exclusion of
> FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that
> either flag is set in the WP handler.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/huge_memory.c |  3 ---
>  mm/hugetlb.c |  5 -
>  mm/memory.c  | 23 ---
>  3 files changed, 20 insertions(+), 11 deletions(-)

Reviewed-by: Vlastimil Babka

Re: [PATCH mm-unstable v1 01/20] selftests/vm: anon_cow: prepare for non-anonymous COW tests

2022-11-18 Thread Vlastimil Babka

On 11/16/22 11:26, David Hildenbrand wrote:
> Originally, the plan was to have a separate tests for testing COW of
> non-anonymous (e.g., shared zeropage) pages.
> 
> Turns out, that we'd need a lot of similar functionality and that there
> isn't a really good reason to separate it. So let's prepare for non-anon
> tests by renaming to "cow".
> 
> Signed-off-by: David Hildenbrand 

Acked-by: Vlastimil Babka

Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal

2022-09-29 Thread Vlastimil Babka

On 9/28/22 04:28, Suren Baghdasaryan wrote:
> On Sun, Sep 11, 2022 at 2:35 AM Vlastimil Babka  wrote:
>>
>> On 9/2/22 01:26, Suren Baghdasaryan wrote:
>> >
>> >>
>> >> Two complaints so far:
>> >>  - I don't like the vma_mark_locked() name. To me it says that the caller
>> >>already took or is taking the lock and this function is just marking 
>> >> that
>> >>we're holding the lock, but it's really taking a different type of 
>> >> lock. But
>> >>this function can block, it really is taking a lock, so it should say 
>> >> that.
>> >>
>> >>This is AFAIK a new concept, not sure I'm going to have anything good 
>> >> either,
>> >>but perhaps vma_lock_multiple()?
>> >
>> > I'm open to name suggestions but vma_lock_multiple() is a bit
>> > confusing to me. Will wait for more suggestions.
>>
>> Well, it does act like a vma_write_lock(), no? So why not that name. The
>> checking function for it is even called vma_assert_write_locked().
>>
>> We just don't provide a single vma_write_unlock(), but a
>> vma_mark_unlocked_all(), that could be instead named e.g.
>> vma_write_unlock_all().
>> But it's called on a mm, so maybe e.g. mm_vma_write_unlock_all()?
> 
> Thank you for your suggestions, Vlastimil! vma_write_lock() sounds
> good to me. For vma_mark_unlocked_all() replacement, I would prefer
> vma_write_unlock_all() which keeps the vma_write_XXX naming pattern to

OK.

> indicate that these are operating on the same locks. If the fact that
> it accepts mm_struct as a parameter is an issue then maybe
> vma_write_unlock_mm() ?

Sounds good!

>>
>>

Re: [RFC PATCH RESEND 00/28] per-VMA locks proposal

2022-09-11 Thread Vlastimil Babka

On 9/2/22 01:26, Suren Baghdasaryan wrote:
> On Thu, Sep 1, 2022 at 1:58 PM Kent Overstreet
>  wrote:
>>
>> On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
>> > Resending to fix the issue with the In-Reply-To tag in the original
>> > submission at [4].
>> >
>> > This is a proof of concept for per-vma locks idea that was discussed
>> > during SPF [1] discussion at LSF/MM this year [2], which concluded with
>> > suggestion that “a reader/writer semaphore could be put into the VMA
>> > itself; that would have the effect of using the VMA as a sort of range
>> > lock. There would still be contention at the VMA level, but it would be an
>> > improvement.” This patchset implements this suggested approach.
>> >
>> > When handling page faults we lookup the VMA that contains the faulting
>> > page under RCU protection and try to acquire its lock. If that fails we
>> > fall back to using mmap_lock, similar to how SPF handled this situation.
>> >
>> > One notable way the implementation deviates from the proposal is the way
>> > VMAs are marked as locked. Because during some of mm updates multiple
>> > VMAs need to be locked until the end of the update (e.g. vma_merge,
>> > split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
>> > and other complications would make the code more complex. Therefore we
>> > provide a way to "mark" VMAs as locked and then unmark all locked VMAs
>> > all at once. This is done using two sequence numbers - one in the
>> > vm_area_struct and one in the mm_struct. VMA is considered locked when
>> > these sequence numbers are equal. To mark a VMA as locked we set the
>> > sequence number in vm_area_struct to be equal to the sequence number
>> > in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
>> > This allows for an efficient way to track locked VMAs and to drop the
>> > locks on all VMAs at the end of the update.
>>
>> I like it - the sequence numbers are a stroke of genuius. For what it's doing
>> the patchset seems almost small.
> 
> Thanks for reviewing it!
> 
>>
>> Two complaints so far:
>>  - I don't like the vma_mark_locked() name. To me it says that the caller
>>already took or is taking the lock and this function is just marking that
>>we're holding the lock, but it's really taking a different type of lock. 
>> But
>>this function can block, it really is taking a lock, so it should say 
>> that.
>>
>>This is AFAIK a new concept, not sure I'm going to have anything good 
>> either,
>>but perhaps vma_lock_multiple()?
> 
> I'm open to name suggestions but vma_lock_multiple() is a bit
> confusing to me. Will wait for more suggestions.

Well, it does act like a vma_write_lock(), no? So why not that name. The
checking function for it is even called vma_assert_write_locked().

We just don't provide a single vma_write_unlock(), but a
vma_mark_unlocked_all(), that could be instead named e.g.
vma_write_unlock_all().
But it's called on a mm, so maybe e.g. mm_vma_write_unlock_all()?

Re: [PATCH v2 2/8] mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE

2022-04-20 Thread Vlastimil Babka

On 3/29/22 18:43, David Hildenbrand wrote:
> Let's test that __HAVE_ARCH_PTE_SWP_EXCLUSIVE works as expected.
> 
> Signed-off-by: David Hildenbrand 

Acked-by: Vlastimil Babka 

> ---
>  mm/debug_vm_pgtable.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index db2abd9e415b..55f1a8dc716f 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -837,6 +837,19 @@ static void __init pmd_soft_dirty_tests(struct 
> pgtable_debug_args *args) { }
>  static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args 
> *args) { }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static void __init pte_swap_exclusive_tests(struct pgtable_debug_args *args)
> +{
> +#ifdef __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> + pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
> +
> + pr_debug("Validating PTE swap exclusive\n");
> + pte = pte_swp_mkexclusive(pte);
> + WARN_ON(!pte_swp_exclusive(pte));

I guess only this WARN_ON must be guarded by the #ifdef, but doesn't matter
that much - won't gain significantly more test coverage.

> + pte = pte_swp_clear_exclusive(pte);
> + WARN_ON(pte_swp_exclusive(pte));
> +#endif /* __HAVE_ARCH_PTE_SWP_EXCLUSIVE */
> +}
> +
>  static void __init pte_swap_tests(struct pgtable_debug_args *args)
>  {
>   swp_entry_t swp;
> @@ -1288,6 +1301,8 @@ static int __init debug_vm_pgtable(void)
>   pte_swap_soft_dirty_tests();
>   pmd_swap_soft_dirty_tests();
>  
> + pte_swap_exclusive_tests();
> +
>   pte_swap_tests();
>   pmd_swap_tests();
>

Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit

2022-04-20 Thread Vlastimil Babka

On 3/29/22 18:43, David Hildenbrand wrote:
> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
> it. We do this, to keep fork() logic on swap entries easy and efficient:
> for example, if we wouldn't clear it when unmapping, we'd have to lookup
> the page in the swapcache for each and every swap entry during fork() and
> clear PG_anon_exclusive if set.
> 
> Instead, we want to store that information directly in the swap pte,
> protected by the page table lock, similarly to how we handle
> SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual
> swap entries, we don't want to mess with the swap type (e.g., still one
> bit) because it overcomplicates swap code.
> 
> In try_to_unmap(), we already reject to unmap in case the page might be
> pinned, because we must not lose PG_anon_exclusive on pinned pages ever.
> Checking if there are other unexpected references reliably *before*
> completely unmapping a page is unfortunately not really possible: THP
> heavily overcomplicate the situation. Once fully unmapped it's easier --
> we, for example, make sure that there are no unexpected references
> *after* unmapping a page before starting writeback on that page.
> 
> So, we currently might end up unmapping a page and clearing
> PG_anon_exclusive if that page has additional references, for example,
> due to a FOLL_GET.
> 
> do_swap_page() has to re-determine if a page is exclusive, which will
> easily fail if there are other references on a page, most prominently
> GUP references via FOLL_GET. This can currently result in memory
> corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even
> when fork() is never involved: try_to_unmap() will succeed, and when
> refaulting the page, it cannot be marked exclusive and will get replaced
> by a copy in the page tables on the next write access, resulting in writes
> via the GUP reference to the page being lost.
> 
> In an ideal world, everybody that uses GUP and wants to modify page
> content, such as O_DIRECT, would properly use FOLL_PIN. However, that
> conversion will take a while. It's easier to fix what used to work in the
> past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition,
> by remembering PG_anon_exclusive we can further reduce unnecessary COW
> in some cases, so it's the natural thing to do.
> 
> So let's transfer the PG_anon_exclusive information to the swap pte and
> store it via an architecture-dependant pte bit; use that information when
> restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we
> simply have to clear the pte bit and are done.
> 
> Of course, there is one corner case to handle: swap backends that don't
> support concurrent page modifications while the page is under writeback.
> Special case these, and drop the exclusive marker. Add a comment why that
> is just fine (also, reuse_swap_page() would have done the same in the
> past).
> 
> In the future, we'll hopefully have all architectures support
> __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty
> stubs and the define completely. Then, we can also convert
> SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to
> support: either simply use a yet unused pte bit that can be used for swap
> entries, steal one from the arch type bits if they exceed 5, or steal one
> from the offset bits.
> 
> Note: R/O FOLL_GET references were never really reliable, especially
> when taking one on a shared page and then writing to the page (e.g., GUP
> after fork()). FOLL_GET, including R/W references, were never really
> reliable once fork was involved (e.g., GUP before fork(),
> GUP during fork()). KSM steps back in case it stumbles over unexpected
> references and is, therefore, fine.
> 
> Signed-off-by: David Hildenbrand 

With the fixup as reportedy by Miaohe Lin

Acked-by: Vlastimil Babka 

(sent a separate mm-commits mail to inquire about the fix going missing from
mmotm)

https://lore.kernel.org/mm-commits/c3195d8a-2931-0749-973a-1d04e4bae...@suse.cz/T/#m4e98ccae6f747e11f45e4d0726427ba2fef740eb

Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.

2021-11-30 Thread Vlastimil Babka

On 11/29/21 23:08, Zi Yan wrote:
> On 23 Nov 2021, at 12:32, Vlastimil Babka wrote:
> 
>> On 11/23/21 17:35, Zi Yan wrote:
>>> On 19 Nov 2021, at 10:15, Zi Yan wrote:
>>>>>> From what my understanding, cma required alignment of
>>>>>> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was 
>>>>>> introduced,
>>>>>> __free_one_page() does not prevent merging two different pageblocks, when
>>>>>> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() 
>>>>>> implementation
>>>>>> does prevent that.
>>>>>
>>>>> But it does prevent that only for isolated pageblock, not CMA, and yout
>>>>> patchset doesn't seem to expand that to CMA? Or am I missing something.
>>>>
>>>> Yeah, you are right. Originally, I thought preventing merging isolated 
>>>> pageblock
>>>> with other types of pageblocks is sufficient, since MIGRATE_CMA is always
>>>> converted from MIGRATE_ISOLATE. But that is not true. I will rework the 
>>>> code.
>>>> Thanks for pointing this out.
>>>>
>>>
>>> I find that two pageblocks with different migratetypes, like 
>>> MIGRATE_RECLAIMABLE
>>> and MIGRATE_MOVABLE can be merged into a single free page after I checked
>>> __free_one_page() in detail and printed pageblock information during buddy 
>>> page
>>> merging.
>>
>> Yes, that can happen.
>>
>> I am not sure what consequence it will cause. Do you have any idea?
>>
>> For MIGRATE_RECLAIMABLE or MIGRATE_MOVABLE or even MIGRATE_UNMOVABLE it's
>> absolutely fine. As long as these pageblocks are fully free (and they are if
>> it's a single free page spanning 2 pageblocks), they can be of any of these
>> type, as they can be reused as needed without causing fragmentation.
>>
>> But in case of MIGRATE_CMA and MIGRATE_ISOLATE, uncontrolled merging would
>> break the specifics of those types. That's why the code is careful for
>> MIGRATE_ISOLATE, and MIGRATE_CMA was until now done in MAX_ORDER granularity.
> 
> Thanks for the explanation. Basically migratetypes that can fall back to each
> other can be merged into a single free page, right?

Yes.

> How about MIGRATE_HIGHATOMIC? It should not be merged with other migratetypes
> from my understanding.

Hmm it shouldn't minimally because it has an accounting that would become
broken. So it should prevent merging or make sure the reservations are with
MAX_ORDER granularity, but seems that neither is true? CCing Mel.

> --
> Best Regards,
> Yan, Zi
>

Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.

2021-11-23 Thread Vlastimil Babka

On 11/23/21 17:35, Zi Yan wrote:
> On 19 Nov 2021, at 10:15, Zi Yan wrote:
 From what my understanding, cma required alignment of
 max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was 
 introduced,
 __free_one_page() does not prevent merging two different pageblocks, when
 MAX_ORDER - 1 > pageblock_order. But current __free_one_page() 
 implementation
 does prevent that.
>>>
>>> But it does prevent that only for isolated pageblock, not CMA, and yout
>>> patchset doesn't seem to expand that to CMA? Or am I missing something.
>>
>> Yeah, you are right. Originally, I thought preventing merging isolated 
>> pageblock
>> with other types of pageblocks is sufficient, since MIGRATE_CMA is always
>> converted from MIGRATE_ISOLATE. But that is not true. I will rework the code.
>> Thanks for pointing this out.
>>
> 
> I find that two pageblocks with different migratetypes, like 
> MIGRATE_RECLAIMABLE
> and MIGRATE_MOVABLE can be merged into a single free page after I checked
> __free_one_page() in detail and printed pageblock information during buddy 
> page
> merging.

Yes, that can happen.

I am not sure what consequence it will cause. Do you have any idea?

For MIGRATE_RECLAIMABLE or MIGRATE_MOVABLE or even MIGRATE_UNMOVABLE it's
absolutely fine. As long as these pageblocks are fully free (and they are if
it's a single free page spanning 2 pageblocks), they can be of any of these
type, as they can be reused as needed without causing fragmentation.

But in case of MIGRATE_CMA and MIGRATE_ISOLATE, uncontrolled merging would
break the specifics of those types. That's why the code is careful for
MIGRATE_ISOLATE, and MIGRATE_CMA was until now done in MAX_ORDER granularity.

Re: [RFC PATCH 0/3] Use pageblock_order for cma and alloc_contig_range alignment.

2021-11-19 Thread Vlastimil Babka

On 11/15/21 20:37, Zi Yan wrote:
> From: Zi Yan 
> 
> Hi David,
> 
> You suggested to make alloc_contig_range() deal with pageblock_order instead 
> of
> MAX_ORDER - 1 and get rid of MAX_ORDER - 1 dependency in virtio_mem[1]. This
> patchset is my attempt to achieve that. Please take a look and let me know if
> I am doing it correctly or not.
> 
> From what my understanding, cma required alignment of
> max(MAX_ORDER - 1, pageblock_order), because when MIGRATE_CMA was introduced,
> __free_one_page() does not prevent merging two different pageblocks, when
> MAX_ORDER - 1 > pageblock_order. But current __free_one_page() implementation
> does prevent that.

But it does prevent that only for isolated pageblock, not CMA, and yout
patchset doesn't seem to expand that to CMA? Or am I missing something.


> It should be OK to just align cma to pageblock_order.
> alloc_contig_range() relies on MIGRATE_CMA to get free pages, so it can use
> pageblock_order as alignment too.
> 
> In terms of virtio_mem, if I understand correctly, it relies on
> alloc_contig_range() to obtain contiguous free pages and offlines them to 
> reduce
> guest memory size. As the result of alloc_contig_range() alignment change,
> virtio_mem should be able to just align PFNs to pageblock_order.
> 
> Thanks.
> 
> 
> [1] 
> https://lore.kernel.org/linux-mm/28b57903-fae6-47ac-7e1b-a1dd41421...@redhat.com/
> 
> Zi Yan (3):
>   mm: cma: alloc_contig_range: use pageblock_order as the single
> alignment.
>   drivers: virtio_mem: use pageblock size as the minimum virtio_mem
> size.
>   arch: powerpc: adjust fadump alignment to be pageblock aligned.
> 
>  arch/powerpc/include/asm/fadump-internal.h |  4 +---
>  drivers/virtio/virtio_mem.c|  6 ++
>  include/linux/mmzone.h |  5 +
>  kernel/dma/contiguous.c|  2 +-
>  mm/cma.c   |  6 ++
>  mm/page_alloc.c| 12 +---
>  6 files changed, 12 insertions(+), 23 deletions(-)
>

Re: [PATCH v5 1/5] mm: introduce debug_pagealloc_{map, unmap}_pages() helpers

2020-11-09 Thread Vlastimil Babka


On 11/8/20 7:57 AM, Mike Rapoport wrote:

--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1428,21 +1428,19 @@ static bool is_debug_pagealloc_cache(struct kmem_cache 
*cachep)
return false;
  }
  
-#ifdef CONFIG_DEBUG_PAGEALLOC

  static void slab_kernel_map(struct kmem_cache *cachep, void *objp, int map)
  {
if (!is_debug_pagealloc_cache(cachep))
return;


Hmm, I didn't notice earlier, sorry.
The is_debug_pagealloc_cache() above includes a debug_pagealloc_enabled_static() 
check, so it should be fine to use
__kernel_map_pages() directly below. Otherwise we generate two static key checks 
for the same key needlessly.


  
-	kernel_map_pages(virt_to_page(objp), cachep->size / PAGE_SIZE, map);

+   if (map)
+   debug_pagealloc_map_pages(virt_to_page(objp),
+ cachep->size / PAGE_SIZE);
+   else
+   debug_pagealloc_unmap_pages(virt_to_page(objp),
+   cachep->size / PAGE_SIZE);
  }
  
-#else

-static inline void slab_kernel_map(struct kmem_cache *cachep, void *objp,
-   int map) {}
-
-#endif
-
  static void poison_obj(struct kmem_cache *cachep, void *addr, unsigned char 
val)
  {
int size = cachep->object_size;
@@ -2062,7 +2060,7 @@ int __kmem_cache_create(struct kmem_cache *cachep, 
slab_flags_t flags)
  
  #if DEBUG

/*
-* If we're going to use the generic kernel_map_pages()
+* If we're going to use the generic debug_pagealloc_map_pages()
 * poisoning, then it's going to smash the contents of
 * the redzone and userword anyhow, so switch them off.
 */

Re: [PATCH v4 3/4] arch, mm: restore dependency of __kernel_map_pages() of DEBUG_PAGEALLOC

2020-11-04 Thread Vlastimil Babka


On 11/3/20 5:20 PM, Mike Rapoport wrote:

From: Mike Rapoport 


Subject should have "on DEBUG_PAGEALLOC" ?


The design of DEBUG_PAGEALLOC presumes that __kernel_map_pages() must never
fail. With this assumption is wouldn't be safe to allow general usage of
this function.

Moreover, some architectures that implement __kernel_map_pages() have this
function guarded by #ifdef DEBUG_PAGEALLOC and some refuse to map/unmap
pages when page allocation debugging is disabled at runtime.

As all the users of __kernel_map_pages() were converted to use
debug_pagealloc_map_pages() it is safe to make it available only when
DEBUG_PAGEALLOC is set.

Signed-off-by: Mike Rapoport 
Acked-by: David Hildenbrand 
Acked-by: Kirill A. Shutemov 
---
  arch/Kconfig |  3 +++
  arch/arm64/Kconfig   |  4 +---
  arch/arm64/mm/pageattr.c |  8 ++--
  arch/powerpc/Kconfig |  5 +
  arch/riscv/Kconfig   |  4 +---
  arch/riscv/include/asm/pgtable.h |  2 --
  arch/riscv/mm/pageattr.c |  2 ++
  arch/s390/Kconfig|  4 +---
  arch/sparc/Kconfig   |  4 +---
  arch/x86/Kconfig |  4 +---
  arch/x86/mm/pat/set_memory.c |  2 ++
  include/linux/mm.h   | 10 +++---
  12 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc0e32d..56d4752b6db6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1028,6 +1028,9 @@ config HAVE_STATIC_CALL_INLINE
bool
depends on HAVE_STATIC_CALL
  
+config ARCH_SUPPORTS_DEBUG_PAGEALLOC

+   bool
+
  source "kernel/gcov/Kconfig"
  
  source "scripts/gcc-plugins/Kconfig"

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1d466addb078..a932810cfd90 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -71,6 +71,7 @@ config ARM64
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_USE_SYM_ANNOTATIONS
+   select ARCH_SUPPORTS_DEBUG_PAGEALLOC
select ARCH_SUPPORTS_MEMORY_FAILURE
select ARCH_SUPPORTS_SHADOW_CALL_STACK if CC_HAVE_SHADOW_CALL_STACK
select ARCH_SUPPORTS_ATOMIC_RMW
@@ -1025,9 +1026,6 @@ config HOLES_IN_ZONE
  
  source "kernel/Kconfig.hz"
  
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC

-   def_bool y
-
  config ARCH_SPARSEMEM_ENABLE
def_bool y
select SPARSEMEM_VMEMMAP_ENABLE
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 1b94f5b82654..439325532be1 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -155,7 +155,7 @@ int set_direct_map_invalid_noflush(struct page *page)
.clear_mask = __pgprot(PTE_VALID),
};
  
-	if (!rodata_full)

+   if (!debug_pagealloc_enabled() && !rodata_full)
return 0;
  
  	return apply_to_page_range(_mm,

@@ -170,7 +170,7 @@ int set_direct_map_default_noflush(struct page *page)
.clear_mask = __pgprot(PTE_RDONLY),
};
  
-	if (!rodata_full)

+   if (!debug_pagealloc_enabled() && !rodata_full)
return 0;
  
  	return apply_to_page_range(_mm,


I don't understand these two hunks. Previous patch calls this for hibernation 
when CONFIG_ARCH_HAS_SET_DIRECT_MAP, which is true for arm64. Why suddenly this 
starts to depend on debug_pagealloc_enabled()?

Re: [PATCH v4 2/4] PM: hibernate: make direct map manipulations more explicit

2020-11-04 Thread Vlastimil Babka


On 11/3/20 5:20 PM, Mike Rapoport wrote:

From: Mike Rapoport 

When DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP is enabled a page may be
not present in the direct map and has to be explicitly mapped before it
could be copied.

Introduce hibernate_map_page() that will explicitly use
set_direct_map_{default,invalid}_noflush() for ARCH_HAS_SET_DIRECT_MAP case
and debug_pagealloc_map_pages() for DEBUG_PAGEALLOC case.

The remapping of the pages in safe_copy_page() presumes that it only
changes protection bits in an existing PTE and so it is safe to ignore
return value of set_direct_map_{default,invalid}_noflush().

Still, add a pr_warn() so that future changes in set_memory APIs will not
silently break hibernation.

Signed-off-by: Mike Rapoport 
Acked-by: Rafael J. Wysocki 
Reviewed-by: David Hildenbrand 
Acked-by: Kirill A. Shutemov 


Acked-by: Vlastimil Babka 

The bool param is a bit more acceptable here, being a private API. But if 
debug_pagealloc_map_pages() becomes split, then it might be easier to split this 
one too...



---
  include/linux/mm.h  | 12 
  kernel/power/snapshot.c | 32 ++--
  2 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1fc0609056dc..14e397f3752c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2927,16 +2927,6 @@ static inline bool debug_pagealloc_enabled_static(void)
  #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_ARCH_HAS_SET_DIRECT_MAP)
  extern void __kernel_map_pages(struct page *page, int numpages, int enable);
  
-/*

- * When called in DEBUG_PAGEALLOC context, the call should most likely be
- * guarded by debug_pagealloc_enabled() or debug_pagealloc_enabled_static()
- */
-static inline void
-kernel_map_pages(struct page *page, int numpages, int enable)
-{
-   __kernel_map_pages(page, numpages, enable);
-}
-
  static inline void debug_pagealloc_map_pages(struct page *page,
 int numpages, int enable)
  {
@@ -2948,8 +2938,6 @@ static inline void debug_pagealloc_map_pages(struct page 
*page,
  extern bool kernel_page_present(struct page *page);
  #endif/* CONFIG_HIBERNATION */
  #else /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
-static inline void
-kernel_map_pages(struct page *page, int numpages, int enable) {}
  static inline void debug_pagealloc_map_pages(struct page *page,
 int numpages, int enable) {}
  #ifdef CONFIG_HIBERNATION
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 46b1804c1ddf..57d54b9d84bb 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -76,6 +76,34 @@ static inline void hibernate_restore_protect_page(void 
*page_address) {}
  static inline void hibernate_restore_unprotect_page(void *page_address) {}
  #endif /* CONFIG_STRICT_KERNEL_RWX  && CONFIG_ARCH_HAS_SET_MEMORY */
  
+static inline void hibernate_map_page(struct page *page, int enable)

+{
+   if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
+   unsigned long addr = (unsigned long)page_address(page);
+   int ret;
+
+   /*
+* This should not fail because remapping a page here means
+* that we only update protection bits in an existing PTE.
+* It is still worth to have a warning here if something
+* changes and this will no longer be the case.
+*/
+   if (enable)
+   ret = set_direct_map_default_noflush(page);
+   else
+   ret = set_direct_map_invalid_noflush(page);
+
+   if (ret) {
+   pr_warn_once("Failed to remap page\n");
+   return;
+   }
+
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+   } else {
+   debug_pagealloc_map_pages(page, 1, enable);
+   }
+}
+
  static int swsusp_page_is_free(struct page *);
  static void swsusp_set_page_forbidden(struct page *);
  static void swsusp_unset_page_forbidden(struct page *);
@@ -1355,9 +1383,9 @@ static void safe_copy_page(void *dst, struct page *s_page)
if (kernel_page_present(s_page)) {
do_copy_page(dst, page_address(s_page));
} else {
-   kernel_map_pages(s_page, 1, 1);
+   hibernate_map_page(s_page, 1);
do_copy_page(dst, page_address(s_page));
-   kernel_map_pages(s_page, 1, 0);
+   hibernate_map_page(s_page, 0);
}
  }

Re: [PATCH v4 1/4] mm: introduce debug_pagealloc_map_pages() helper

2020-11-04 Thread Vlastimil Babka


On 11/3/20 5:20 PM, Mike Rapoport wrote:

From: Mike Rapoport 

When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel
direct mapping after free_pages(). The pages than need to be mapped back
before they could be used. Theese mapping operations use
__kernel_map_pages() guarded with with debug_pagealloc_enabled().

The only place that calls __kernel_map_pages() without checking whether
DEBUG_PAGEALLOC is enabled is the hibernation code that presumes
availability of this function when ARCH_HAS_SET_DIRECT_MAP is set.
Still, on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is
not enabled but set_direct_map_invalid_noflush() may render some pages not
present in the direct map and hibernation code won't be able to save such
pages.

To make page allocation debugging and hibernation interaction more robust,
the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be made
more explicit.

Start with combining the guard condition and the call to
__kernel_map_pages() into a single debug_pagealloc_map_pages() function to
emphasize that __kernel_map_pages() should not be called without
DEBUG_PAGEALLOC and use this new function to map/unmap pages when page
allocation debug is enabled.

Signed-off-by: Mike Rapoport 
Reviewed-by: David Hildenbrand 
Acked-by: Kirill A. Shutemov 


Acked-by: Vlastimil Babka 

But, the "enable" param is hideous. I would rather have map and unmap variants 
(and just did the same split for page poisoning) and this seems to be a good 
opportunity. If David didn't propose it already, I'm surprised ;)



---
  include/linux/mm.h  | 10 ++
  mm/memory_hotplug.c |  3 +--
  mm/page_alloc.c |  6 ++
  mm/slab.c   |  8 +++-
  4 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..1fc0609056dc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2936,12 +2936,22 @@ kernel_map_pages(struct page *page, int numpages, int 
enable)
  {
__kernel_map_pages(page, numpages, enable);
  }
+
+static inline void debug_pagealloc_map_pages(struct page *page,
+int numpages, int enable)
+{
+   if (debug_pagealloc_enabled_static())
+   __kernel_map_pages(page, numpages, enable);
+}
+
  #ifdef CONFIG_HIBERNATION
  extern bool kernel_page_present(struct page *page);
  #endif/* CONFIG_HIBERNATION */
  #else /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
  static inline void
  kernel_map_pages(struct page *page, int numpages, int enable) {}
+static inline void debug_pagealloc_map_pages(struct page *page,
+int numpages, int enable) {}
  #ifdef CONFIG_HIBERNATION
  static inline bool kernel_page_present(struct page *page) { return true; }
  #endif/* CONFIG_HIBERNATION */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b44d4c7ba73b..e2b6043a4428 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -614,8 +614,7 @@ void generic_online_page(struct page *page, unsigned int 
order)
 * so we should map it first. This is better than introducing a special
 * case in page freeing fast path.
 */
-   if (debug_pagealloc_enabled_static())
-   kernel_map_pages(page, 1 << order, 1);
+   debug_pagealloc_map_pages(page, 1 << order, 1);
__free_pages_core(page, order);
totalram_pages_add(1UL << order);
  #ifdef CONFIG_HIGHMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23f5066bd4a5..9a66a1ff9193 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1272,8 +1272,7 @@ static __always_inline bool free_pages_prepare(struct 
page *page,
 */
arch_free_page(page, order);
  
-	if (debug_pagealloc_enabled_static())

-   kernel_map_pages(page, 1 << order, 0);
+   debug_pagealloc_map_pages(page, 1 << order, 0);
  
  	kasan_free_nondeferred_pages(page, order);
  
@@ -2270,8 +2269,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,

set_page_refcounted(page);
  
  	arch_alloc_page(page, order);

-   if (debug_pagealloc_enabled_static())
-   kernel_map_pages(page, 1 << order, 1);
+   debug_pagealloc_map_pages(page, 1 << order, 1);
kasan_alloc_pages(page, order);
kernel_poison_pages(page, 1 << order, 1);
set_page_owner(page, order, gfp_flags);
diff --git a/mm/slab.c b/mm/slab.c
index b1113561b98b..340db0ce74c4 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1431,10 +1431,8 @@ static bool is_debug_pagealloc_cache(struct kmem_cache 
*cachep)
  #ifdef CONFIG_DEBUG_PAGEALLOC
  static void slab_kernel_map(struct kmem_cache *cachep, void *objp, int map)
  {
-   if (!is_debug_pagealloc_cache(cachep))
-   return;
-
-   kernel_map_pages(virt_to_page(objp), cachep->size / PAGE_SI

Re: mm: Question about the use of 'accessed' flags and pte_young() helper

2020-10-20 Thread Vlastimil Babka


On 10/8/20 11:49 AM, Christophe Leroy wrote:

In a 10 years old commit
(https://github.com/linuxppc/linux/commit/d069cb4373fe0d451357c4d3769623a7564dfa9f),
 powerpc 8xx has
made the handling of PTE accessed bit conditional to CONFIG_SWAP.
Since then, this has been extended to some other powerpc variants.

That commit means that when CONFIG_SWAP is not selected, the accessed bit is 
not set by SW TLB miss
handlers, leading to pte_young() returning garbage, or should I say possibly 
returning false
allthough a page has been accessed since its access flag was reset.

Looking at various mm/ places, pte_young() is used independent of CONFIG_SWAP

Is it still valid the not manage accessed flags when CONFIG_SWAP is not 
selected ?


AFAIK it's wrong, reclaim needs it to detect accessed pages on inactive list, 
via page_referenced(), including file pages (page cache) where CONFIG_SWAP plays 
no role. Maybe it was different 10 years ago.



If yes, should pte_young() always return true in that case ?


It should best work as intended. If not possible, true is maybe better, as false 
will lead to inactive file list thrashing.



While we are at it, I'm wondering whether powerpc should redefine 
arch_faults_on_old_pte()
On some variants of powerpc, accessed flag is managed by HW. On others, it is 
managed by SW TLB miss
handlers via page fault handling.

Thanks
Christophe

Re: ppc64 early slub caches have zero random value

2020-04-22 Thread Vlastimil Babka

On 4/21/20 10:39 AM, Nicolai Stange wrote:
> Hi
> 
> [adding some drivers/char/random folks + LKML to CC]
> 
> Vlastimil Babka  writes:
> 
>> On 4/17/20 6:53 PM, Michal Suchánek wrote:
>>> Hello,
>>
>> Hi, thanks for reproducing on latest upstream!
>>
>>> instrumenting the kernel with the following patch
>>> 
>>> ---
>>>  mm/slub.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>> 
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index d6787bbe0248..d40995d5f8ff 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -3633,6 +3633,7 @@ static int kmem_cache_open(struct kmem_cache *s, 
>>> slab_flags_t flags)
>>> s->flags = kmem_cache_flags(s->size, flags, s->name, s->ctor);
>>>  #ifdef CONFIG_SLAB_FREELIST_HARDENED
>>> s->random = get_random_long();
>>> +   pr_notice("Creating cache %s with s->random=%ld\n", s->name, s->random);
>>>  #endif
>>>  
>>> if (!calculate_sizes(s, -1))
>>> 
>>> I get:
>>> 
>>> [0.00] random: get_random_u64 called from kmem_cache_open+0x3c/0x5b0
>> with crng_init=0
>>> [0.00] Creating cache kmem_cache_node with s->random=0
>>> [0.00] Creating cache kmem_cache with s->random=0
>>> [0.00] Creating cache kmalloc-8 with s->random=0
>>> [0.00] Creating cache kmalloc-16 with s->random=0
>>> [0.00] Creating cache kmalloc-32 with s->random=0
>>> [0.00] Creating cache kmalloc-64 with s->random=0
>>> [0.00] Creating cache kmalloc-96 with s->random=0
>>> [0.00] Creating cache kmalloc-128 with s->random=0
>>> [0.00] Creating cache kmalloc-192 with s->random=-682532147323126958
>>> 
>>> The earliest caches created invariably end up with s->random of zero.

I have also realized that the rest of the early created caches is initialized
with albeit non-zero, but in fact deterministically same same sequence values.

>> It seems that reliably it's the first 8 calls get_random_u64(), which sounds
>> more like some off-by-X bug than a genuine lack entropy that would become 
>> fixed
>> in the meanwhile?
>>
>>> This is a problem for crash which does not recognize these as randomized
>>> and fails to read them. While this can be addressed in crash is it
>>> intended to create caches with zero random value in the kernel?
>>
>> Definitely not. The question is more likely what guarantees we have with
>> crng_init=0. Probably we can't expect cryptographically strong randomness, 
>> but
>> zeroes still do look like a bug to me?
>>
>>> This is broken at least in the 5.4~5.7 range but it is not clear if this
>>> ever worked. All examples of earlier kernels I have at hand use slab mm.
>>> 
>>> Thanks
>>> 
>>> Michal
>>>
> 
> FWIW, I've seen something similar in a slightly different context,
> c.f. [1].
> 
> Basically, the issue is that on anything but x86_64 (and perhaps arm64
> IIRC), arch_get_random_long() is unavailable and thus, get_random_u64()
> falls through to that batched extract_crng() extraction. That is, it
> extracts eight random longs from the chacha20 based RNG at once and
> batches them up for consumption by the current and subsequent
> get_random_u64() invocations. Which is in line with your observation
> that get_random_u64() returned zero exactly eight times in a row.
> 
> The fact that extract_crng() actually extracted eight successive zero
> values surprised me though. But from looking at chacha20_block(), called
> from _extract_crng() with the primary_crng instance's state buffer as
> input, it seems like a zeroed state buffer gets identity transformed and
> that all this fancy shifting and rolling and whatnot in chacha_permute()
> would have no effect at all. So I suppose that the primary_crng's state
> buffer is still zeroed out at that point during boot.

Looks so, thanks for explanation. So there's simply no entropy and the kmalloc-X
caches have deterministic s->random. Zeroes are just more obvious and may fail
e.g. to prevent detection for random corruptions (so we should still fix
those?), while for an attacker anything predictable is bad, and I don't know
what to do about it. If we were to e.g. reseed the early created kmalloc caches
later where crng is fully inited, we would have to basically "stop the world"
and rewrite existing freelists.

> Thanks,
> 
> Nicolai
> 
> [1] https://lkml.kernel.org/r/87d08rbbg9@suse.de
>

Re: ppc64 early slub caches have zero random value

2020-04-17 Thread Vlastimil Babka

On 4/17/20 6:53 PM, Michal Suchánek wrote:
> Hello,

Hi, thanks for reproducing on latest upstream!

> instrumenting the kernel with the following patch
> 
> ---
>  mm/slub.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index d6787bbe0248..d40995d5f8ff 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3633,6 +3633,7 @@ static int kmem_cache_open(struct kmem_cache *s, 
> slab_flags_t flags)
>   s->flags = kmem_cache_flags(s->size, flags, s->name, s->ctor);
>  #ifdef CONFIG_SLAB_FREELIST_HARDENED
>   s->random = get_random_long();
> + pr_notice("Creating cache %s with s->random=%ld\n", s->name, s->random);
>  #endif
>  
>   if (!calculate_sizes(s, -1))
> 
> I get:
> 
> [0.00] random: get_random_u64 called from kmem_cache_open+0x3c/0x5b0
with crng_init=0
> [0.00] Creating cache kmem_cache_node with s->random=0
> [0.00] Creating cache kmem_cache with s->random=0
> [0.00] Creating cache kmalloc-8 with s->random=0
> [0.00] Creating cache kmalloc-16 with s->random=0
> [0.00] Creating cache kmalloc-32 with s->random=0
> [0.00] Creating cache kmalloc-64 with s->random=0
> [0.00] Creating cache kmalloc-96 with s->random=0
> [0.00] Creating cache kmalloc-128 with s->random=0
> [0.00] Creating cache kmalloc-192 with s->random=-682532147323126958
> 
> The earliest caches created invariably end up with s->random of zero.

It seems that reliably it's the first 8 calls get_random_u64(), which sounds
more like some off-by-X bug than a genuine lack entropy that would become fixed
in the meanwhile?

> This is a problem for crash which does not recognize these as randomized
> and fails to read them. While this can be addressed in crash is it
> intended to create caches with zero random value in the kernel?

Definitely not. The question is more likely what guarantees we have with
crng_init=0. Probably we can't expect cryptographically strong randomness, but
zeroes still do look like a bug to me?

> This is broken at least in the 5.4~5.7 range but it is not clear if this
> ever worked. All examples of earlier kernels I have at hand use slab mm.
> 
> Thanks
> 
> Michal
>

[PATCH 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-20 Thread Vlastimil Babka

Sachin reports [1] a crash in SLUB __slab_alloc():

BUG: Kernel NULL pointer dereference on read at 0x73b0
Faulting instruction address: 0xc03d55f4
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
NIP:  c03d55f4 LR: c03d5b94 CTR: 
REGS: c008b37836d0 TRAP: 0300   Not tainted  
(5.6.0-rc2-next-20200218-autotest)
MSR:  80009033   CR: 24004844  XER: 
CFAR: c000dec4 DAR: 73b0 DSISR: 4000 IRQMASK: 1
GPR00: c03d5b94 c008b3783960 c155d400 c008b301f500
GPR04: 0dc0 0002 c03443d8 c008bb398620
GPR08: 0008ba2f 0001  
GPR12: 24004844 c0001ec52a00  
GPR16: c008a1b20048 c1595898 c1750c18 0002
GPR20: c1750c28 c1624470 000fffe0 5deadbeef122
GPR24: 0001 0dc0 0002 c03443d8
GPR28: c008b301f500 c008bb398620  c00c02287180
NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
LR [c03d5b94] __slab_alloc+0x34/0x60
Call Trace:
[c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
[c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
[c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
[c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
[c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
[c008b3783b90] [c0235aa8] online_css+0x48/0xd0
[c008b3783bc0] [c023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
[c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
[c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
[c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
[c008b3783e20] [c000b278] system_call+0x5c/0x68

This is a PowerPC platform with following NUMA topology:

available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

possible numa nodes: 0-31

This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map on
appropriate NUMA node" [2] which effectively calls kmalloc_node for each
possible node. SLUB however only allocates kmem_cache_node on online
N_NORMAL_MEMORY nodes, and relies on node_to_mem_node to return such valid node
for other nodes since commit a561ce00b09e ("slub: fall back to
node_to_mem_node() node if allocating on memoryless node"). This is however not
true in this configuration where the _node_numa_mem_ array is not initialized
for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up
accessing non-allocated kmem_cache_node.

A related issue was reported by Bharata (originally by Ramachandran) [3] where
a similar PowerPC configuration, but with mainline kernel without patch [2]
ends up allocating large amounts of pages by kmalloc-1k kmalloc-512. This seems
to have the same underlying issue with node_to_mem_node() not behaving as
expected, and might probably also lead to an infinite loop with
CONFIG_SLUB_CPU_PARTIAL [4].

This patch should fix both issues by not relying on node_to_mem_node() anymore
and instead simply falling back to NUMA_NO_NODE, when kmalloc_node(node) is
attempted for a node that's not online, or has no usable memory. The "usable
memory" condition is also changed from node_present_pages() to N_NORMAL_MEMORY
node state, as that is exactly the condition that SLUB uses to allocate
kmem_cache_node structures. The check in get_partial() is removed completely,
as the checks in ___slab_alloc() are now sufficient to prevent get_partial()
being reached with an invalid node.

[1] 
https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/
[2] 
https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c...@virtuozzo.com/
[3] https://lore.kernel.org/linux-mm/20200317092624.gb22...@in.ibm.com/
[4] 
https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125a...@suse.cz/

Reported-and-tested-by: Sachin Sant 
Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN 
Tested-by: Bharata B Rao 
Debugged-by: Srikar Dronamraju 
Signed-off-by: Vlastimil Babka 
Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating 
on memoryless node")
Cc: sta...@vger.kernel.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc:

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-20 Thread Vlastimil Babka

On 3/20/20 8:46 AM, Srikar Dronamraju wrote:
> * Vlastimil Babka  [2020-03-19 15:10:19]:
> 
>> On 3/19/20 3:05 PM, Srikar Dronamraju wrote:
>> > * Vlastimil Babka  [2020-03-19 14:47:58]:
>> > 
>> 
>> No, but AFAICS, such node values are already handled in ___slab_alloc, and
>> cannot reach get_partial(). If you see something I missed, please do tell.
>> 
> 
> Ah I probably got confused with your previous version where
> alloc_slab_page() was modified. I see no problems with this version.

Thanks!

> Sorry for the noise.

No problem.

> A question just for my better understanding,
> How worse would it be to set node to numa_mem_id() instead of NUMA_NODE_ID
> when the current node is !N_NORMAL_MEMORY?

(I'm assuming you mean s/NUMA_NODE_ID/NUMA_NO_NODE/)

Well, numa_mem_id() should work too, but it would make the allocation
constrained to the node of current cpu, with all the consequences (deactivating
percpu slab if it was from a different node etc).

There's no reason why this cpu's node should be the closest node to the one that
was originally requested (but has no memory), so it's IMO pointless or even
suboptimal to constraint to it. This can be revisited in case we get guaranteed
existence of node data with zonelists for all possible nodes, but for now
NUMA_NO_NODE seems the most reasonable fix to me.

>> >>   if (object || node != NUMA_NO_NODE)
>> > 
>> 
>

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-20 Thread Vlastimil Babka

On 3/20/20 4:42 AM, Bharata B Rao wrote:
> On Thu, Mar 19, 2020 at 02:47:58PM +0100, Vlastimil Babka wrote:
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 17dc00e33115..7113b1f9cd77 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
>> flags, int node,
>>  
>>  if (node == NUMA_NO_NODE)
>>  searchnode = numa_mem_id();
>> -else if (!node_present_pages(node))
>> -searchnode = node_to_mem_node(node);
>>  
>>  object = get_partial_node(s, get_node(s, searchnode), c, flags);
>>  if (object || node != NUMA_NO_NODE)
>> @@ -2563,17 +2561,27 @@ static void *___slab_alloc(struct kmem_cache *s, 
>> gfp_t gfpflags, int node,
>>  struct page *page;
>>  
>>  page = c->page;
>> -if (!page)
>> +if (!page) {
>> +/*
>> + * if the node is not online or has no normal memory, just
>> + * ignore the node constraint
>> + */
>> +if (unlikely(node != NUMA_NO_NODE &&
>> + !node_state(node, N_NORMAL_MEMORY)))
>> +node = NUMA_NO_NODE;
>>  goto new_slab;
>> +}
>>  redo:
>>  
>>  if (unlikely(!node_match(page, node))) {
>> -int searchnode = node;
>> -
>> -if (node != NUMA_NO_NODE && !node_present_pages(node))
>> -searchnode = node_to_mem_node(node);
>> -
>> -if (unlikely(!node_match(page, searchnode))) {
>> +/*
>> + * same as above but node_match() being false already
>> + * implies node != NUMA_NO_NODE
>> + */
>> +if (!node_state(node, N_NORMAL_MEMORY)) {
>> +node = NUMA_NO_NODE;
>> +goto redo;
>> +} else {
>>  stat(s, ALLOC_NODE_MISMATCH);
>>  deactivate_slab(s, page, c->freelist, c);
>>  goto new_slab;
> 
> This fixes the problem I reported at
> https://lore.kernel.org/linux-mm/20200317092624.gb22...@in.ibm.com/

Thanks, I hope it means I can make it Reported-and-tested-by: you

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-19 Thread Vlastimil Babka

On 3/19/20 3:05 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka  [2020-03-19 14:47:58]:
> 
>> 8<
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 17dc00e33115..7113b1f9cd77 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
>> flags, int node,
>> 
>>  if (node == NUMA_NO_NODE)
>>  searchnode = numa_mem_id();
>> -else if (!node_present_pages(node))
>> -searchnode = node_to_mem_node(node);
>> 
>>  object = get_partial_node(s, get_node(s, searchnode), c, flags);
> 
> Are we okay with passing a node to get_partial_node with !NUMA_NO_NODE and
> !N_MEMORY including possible nodes?

No, but AFAICS, such node values are already handled in ___slab_alloc, and
cannot reach get_partial(). If you see something I missed, please do tell.

>>  if (object || node != NUMA_NO_NODE)
>

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-19 Thread Vlastimil Babka

On 3/19/20 2:26 PM, Sachin Sant wrote:
> 
> 
>> On 19-Mar-2020, at 6:53 PM, Vlastimil Babka  wrote:
>> 
>> On 3/19/20 9:52 AM, Sachin Sant wrote:
>>> 
>>>> OK how about this version? It's somewhat ugly, but important is that the 
>>>> fast
>>>> path case (c->page exists) is unaffected and another common case (c->page 
>>>> is
>>>> NULL, but node is NUMA_NO_NODE) is just one extra check - impossible to 
>>>> avoid at
>>>> some point anyway.
>>>> 
>>> 
>>> I attempted the suggested tests.
>>> 
>>> Test 1: March 18 linux-next + Patch 1 [1] + Patch  2 [2]
>>> 
>>> Machine boots fine. numactl o/p after boot:
>> 
>> Great, thanks! Can I add your Tested-by: then?
> 
> Sure.
> Tested-by: Sachin Sant 
> 
> Thank you for the fix.

Thanks! Sorry to bother, but in the end I decided to do further change so I
would appreciate verification if it still works as intended.
The point is to use node_state(N_NORMAL_MEMORY) instead of node_present_pages(),
as that is really what SLUB uses to decide whether to allocate the
kmem_cache_node. So we should match this properly given the opportunity.
I have also again removed the node_online() check in alloc_slab_page() as it
really shouldn't be reachable with an offline node - everything is taken care of
in ___slab_alloc, or callers use NUMA_NO_NODE.

8<
diff --git a/mm/slub.c b/mm/slub.c
index 17dc00e33115..7113b1f9cd77 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
 
if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();
-   else if (!node_present_pages(node))
-   searchnode = node_to_mem_node(node);
 
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
@@ -2563,17 +2561,27 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t 
gfpflags, int node,
struct page *page;
 
page = c->page;
-   if (!page)
+   if (!page) {
+   /*
+* if the node is not online or has no normal memory, just
+* ignore the node constraint
+*/
+   if (unlikely(node != NUMA_NO_NODE &&
+!node_state(node, N_NORMAL_MEMORY)))
+   node = NUMA_NO_NODE;
goto new_slab;
+   }
 redo:
 
if (unlikely(!node_match(page, node))) {
-   int searchnode = node;
-
-   if (node != NUMA_NO_NODE && !node_present_pages(node))
-   searchnode = node_to_mem_node(node);
-
-   if (unlikely(!node_match(page, searchnode))) {
+   /*
+* same as above but node_match() being false already
+* implies node != NUMA_NO_NODE
+*/
+   if (!node_state(node, N_NORMAL_MEMORY)) {
+   node = NUMA_NO_NODE;
+   goto redo;
+   } else {
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);
goto new_slab;
-- 
2.25.1

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-19 Thread Vlastimil Babka

On 3/19/20 9:52 AM, Sachin Sant wrote:
> 
>> OK how about this version? It's somewhat ugly, but important is that the fast
>> path case (c->page exists) is unaffected and another common case (c->page is
>> NULL, but node is NUMA_NO_NODE) is just one extra check - impossible to 
>> avoid at
>> some point anyway.
>> 
> 
> I attempted the suggested tests.
> 
> Test 1: March 18 linux-next + Patch 1 [1] + Patch  2 [2]
> 
> Machine boots fine. numactl o/p after boot:

Great, thanks! Can I add your Tested-by: then?

If Bharata confirms his scenario with updated patch, I will send it properly.

> # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31
> node 1 size: 35631 MB
> node 1 free: 32724 MB
> node distances:
> node   0   1
>   0:  10  40
>   1:  40  10
> #
> 
> Test 2: Code base as used in Test 1 + 3 patch series from Srikar [3]
> 
> Machine boots fine. numactl o/p after boot:
> 
> # numactl -H
> available: 1 nodes (1)
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31
> node 1 size: 35631 MB
> node 1 free: 32711 MB
> node distances:
> node   1
>   1:  10
> #
> 
> Thanks!
> -Sachin
> 
> 
> [1] 
> https://lore.kernel.org/linux-mm/e060ad43-ff4e-0e59-2e64-ce8a4916e...@suse.cz/T/#mb8d0a358dc5c5fb9661fa45047072a92c510faf2
> [2] 
> https://lore.kernel.org/linux-mm/e060ad43-ff4e-0e59-2e64-ce8a4916e...@suse.cz/T/#mce342e54a95ce2ee3043901e52d70ee2fd94c516
> [3] 
> https://lore.kernel.org/linux-mm/2020030237.5731-1-sri...@linux.vnet.ibm.com/
>

Re: [PATCH v2 1/4] mm: Check for node_online in node_present_pages

2020-03-19 Thread Vlastimil Babka

On 3/19/20 1:32 AM, Michael Ellerman wrote:
> Seems like a nice solution to me

Thanks :)

>> 8<
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 17dc00e33115..1d4f2d7a0080 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1511,7 +1511,7 @@ static inline struct page *alloc_slab_page(struct 
>> kmem_cache *s,
>>  struct page *page;
>>  unsigned int order = oo_order(oo);
>>  
>> -if (node == NUMA_NO_NODE)
>> +if (node == NUMA_NO_NODE || !node_online(node))
> 
> Why don't we need the node_present_pages() check here?

Page allocator is fine with a node without present pages, as long as there's a
zonelist, which online nodes must have (ideally all possible nodes should have,
and then we can remove this).

SLUB on the other hand doesn't allocate cache per-cpu structures for nodes
without present pages (understandably) that's why the other place includes the
node_present_pages() check.

Thanks

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-18 Thread Vlastimil Babka



On 3/18/20 5:06 PM, Bharata B Rao wrote:
>> @@ -2568,12 +2566,15 @@ static void *___slab_alloc(struct kmem_cache *s, 
>> gfp_t gfpflags, int node,
>>  redo:
>>  
>>  if (unlikely(!node_match(page, node))) {
>> -int searchnode = node;
>> -
>> -if (node != NUMA_NO_NODE && !node_present_pages(node))
>> -searchnode = node_to_mem_node(node);
>> -
>> -if (unlikely(!node_match(page, searchnode))) {
>> +/*
>> + * node_match() false implies node != NUMA_NO_NODE
>> + * but if the node is not online or has no pages, just
>> + * ignore the constraint
>> + */
>> +if ((!node_online(node) || !node_present_pages(node))) {
>> +node = NUMA_NO_NODE;
>> +goto redo;
> 
> Many calls for allocating slab object from memory-less node 0 in my case
> don't even hit the above check because they get short circuited by
> goto new_slab label which is present a few lines above.  Hence I don't see
> any reduction in the amount of slab memory with this fix.
> 
> Regards,
> Bharata.
 
OK how about this version? It's somewhat ugly, but important is that the fast
path case (c->page exists) is unaffected and another common case (c->page is
NULL, but node is NUMA_NO_NODE) is just one extra check - impossible to avoid at
some point anyway.

8<
>From d1675363c2ddc3758e5c0d0f435ca46818219194 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka 
Date: Wed, 18 Mar 2020 14:47:33 +0100
Subject: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

Sachin reports [1] a crash in SLUB __slab_alloc():

BUG: Kernel NULL pointer dereference on read at 0x73b0
Faulting instruction address: 0xc03d55f4
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
NIP:  c03d55f4 LR: c03d5b94 CTR: 
REGS: c008b37836d0 TRAP: 0300   Not tainted  
(5.6.0-rc2-next-20200218-autotest)
MSR:  80009033   CR: 24004844  XER: 
CFAR: c000dec4 DAR: 73b0 DSISR: 4000 IRQMASK: 1
GPR00: c03d5b94 c008b3783960 c155d400 c008b301f500
GPR04: 0dc0 0002 c03443d8 c008bb398620
GPR08: 0008ba2f 0001  
GPR12: 24004844 c0001ec52a00  
GPR16: c008a1b20048 c1595898 c1750c18 0002
GPR20: c1750c28 c1624470 000fffe0 5deadbeef122
GPR24: 0001 0dc0 0002 c03443d8
GPR28: c008b301f500 c008bb398620  c00c02287180
NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
LR [c03d5b94] __slab_alloc+0x34/0x60
Call Trace:
[c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
[c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
[c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
[c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
[c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
[c008b3783b90] [c0235aa8] online_css+0x48/0xd0
[c008b3783bc0] [c023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
[c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
[c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
[c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
[c008b3783e20] [c000b278] system_call+0x5c/0x68

This is a PowerPC platform with following NUMA topology:

available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

possible numa nodes: 0-31

This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map on
appropriate NUMA node" [2] which effectively calls kmalloc_node for each
possible node. SLUB however only allocates kmem_cache_node on online nodes
with present memory, and relies on node_to_mem_node to return such valid node
for other nodes since commit a561ce00b09e ("slub: fall back to
node_to_mem_node() node if allocating on memoryless node"). This is however not
true in this configuration where the _node_numa_mem_ array is not initialized
for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up
accessing non-allocated kmem_cache_node.

A related issue was reported by Bharata [3] where a similar PowerPC
configura

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-18 Thread Vlastimil Babka

On 3/18/20 5:06 PM, Bharata B Rao wrote:
> On Wed, Mar 18, 2020 at 03:42:19PM +0100, Vlastimil Babka wrote:
>> This is a PowerPC platform with following NUMA topology:
>> 
>> available: 2 nodes (0-1)
>> node 0 cpus:
>> node 0 size: 0 MB
>> node 0 free: 0 MB
>> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
>> 24 25 26 27 28 29 30 31
>> node 1 size: 35247 MB
>> node 1 free: 30907 MB
>> node distances:
>> node   0   1
>>   0:  10  40
>>   1:  40  10
>> 
>> possible numa nodes: 0-31
>> 
>> A related issue was reported by Bharata [3] where a similar PowerPC
>> configuration, but without patch [2] ends up allocating large amounts of 
>> pages
>> by kmalloc-1k kmalloc-512. This seems to have the same underlying issue with
>> node_to_mem_node() not behaving as expected, and might probably also lead
>> to an infinite loop with CONFIG_SLUB_CPU_PARTIAL.
> 
> This patch doesn't fix the issue of kmalloc caches consuming more
> memory for the above mentioned topology. Also CONFIG_SLUB_CPU_PARTIAL is set
> here and I have not observed infinite loop till now.

OK that means something is wrong with my analysis.

> Or, are you expecting your fix to work on top of Srikar's other patchset
> https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#u
>  ?

No, I hoped it would work on mainline.

> With the above patchset, no fix is required to address increased memory
> consumption of kmalloc caches because this patchset prevents such
> topology from occuring thereby making it impossible for the problem
> to surface (or at least impossible for the specific topology that I
> mentioned)

Right, I hope to fix it nevertheless.

>> diff --git a/mm/slub.c b/mm/slub.c
>> index 17dc00e33115..4d798cacdae1 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1511,7 +1511,7 @@ static inline struct page *alloc_slab_page(struct 
>> kmem_cache *s,
>>  struct page *page;
>>  unsigned int order = oo_order(oo);
>>  
>> -if (node == NUMA_NO_NODE)
>> +if (node == NUMA_NO_NODE || !node_online(node))
>>  page = alloc_pages(flags, order);
>>  else
>>  page = __alloc_pages_node(node, flags, order);
>> @@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
>> flags, int node,
>>  
>>  if (node == NUMA_NO_NODE)
>>  searchnode = numa_mem_id();
>> -else if (!node_present_pages(node))
>> -searchnode = node_to_mem_node(node);
> 
> We still come here with memory-less node=0 (and not NUMA_NO_NODE), fail to
> find partial slab, go back and allocate a new one thereby continuosly
> increasing the number of newly allocated slabs.
> 
>>  
>>  object = get_partial_node(s, get_node(s, searchnode), c, flags);
>>  if (object || node != NUMA_NO_NODE)
>> @@ -2568,12 +2566,15 @@ static void *___slab_alloc(struct kmem_cache *s, 
>> gfp_t gfpflags, int node,
>>  redo:
>>  
>>  if (unlikely(!node_match(page, node))) {
>> -int searchnode = node;
>> -
>> -if (node != NUMA_NO_NODE && !node_present_pages(node))
>> -searchnode = node_to_mem_node(node);
>> -
>> -if (unlikely(!node_match(page, searchnode))) {
>> +/*
>> + * node_match() false implies node != NUMA_NO_NODE
>> + * but if the node is not online or has no pages, just
>> + * ignore the constraint
>> + */
>> +if ((!node_online(node) || !node_present_pages(node))) {
>> +node = NUMA_NO_NODE;
>> +goto redo;
> 
> Many calls for allocating slab object from memory-less node 0 in my case
> don't even hit the above check because they get short circuited by
> goto new_slab label which is present a few lines above.  Hence I don't see
> any reduction in the amount of slab memory with this fix.

Thanks a lot for the info, I will try again :)

> Regards,
> Bharata.
>

[RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-18 Thread Vlastimil Babka

Sachin reports [1] a crash in SLUB __slab_alloc():

BUG: Kernel NULL pointer dereference on read at 0x73b0
Faulting instruction address: 0xc03d55f4
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
NIP:  c03d55f4 LR: c03d5b94 CTR: 
REGS: c008b37836d0 TRAP: 0300   Not tainted  
(5.6.0-rc2-next-20200218-autotest)
MSR:  80009033   CR: 24004844  XER: 
CFAR: c000dec4 DAR: 73b0 DSISR: 4000 IRQMASK: 1
GPR00: c03d5b94 c008b3783960 c155d400 c008b301f500
GPR04: 0dc0 0002 c03443d8 c008bb398620
GPR08: 0008ba2f 0001  
GPR12: 24004844 c0001ec52a00  
GPR16: c008a1b20048 c1595898 c1750c18 0002
GPR20: c1750c28 c1624470 000fffe0 5deadbeef122
GPR24: 0001 0dc0 0002 c03443d8
GPR28: c008b301f500 c008bb398620  c00c02287180
NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
LR [c03d5b94] __slab_alloc+0x34/0x60
Call Trace:
[c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
[c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
[c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
[c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
[c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
[c008b3783b90] [c0235aa8] online_css+0x48/0xd0
[c008b3783bc0] [c023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
[c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
[c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
[c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
[c008b3783e20] [c000b278] system_call+0x5c/0x68

This is a PowerPC platform with following NUMA topology:

available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

possible numa nodes: 0-31

This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map on
appropriate NUMA node" [2] which effectively calls kmalloc_node for each
possible node. SLUB however only allocates kmem_cache_node on online nodes
with present memory, and relies on node_to_mem_node to return such valid node
for other nodes since commit a561ce00b09e ("slub: fall back to
node_to_mem_node() node if allocating on memoryless node"). This is however not
true in this configuration where the _node_numa_mem_ array is not initialized
for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up
accessing non-allocated kmem_cache_node.

A related issue was reported by Bharata [3] where a similar PowerPC
configuration, but without patch [2] ends up allocating large amounts of pages
by kmalloc-1k kmalloc-512. This seems to have the same underlying issue with
node_to_mem_node() not behaving as expected, and might probably also lead
to an infinite loop with CONFIG_SLUB_CPU_PARTIAL.

This patch should fix both issues by not relying on node_to_mem_node() anymore
and instead simply falling back to NUMA_NO_NODE, when kmalloc_node(node) is
attempted for a node that's not online or has no pages. Also in case
alloc_slab_page() is reached with a non-online node, fallback as well, until
we have a guarantee that all possible nodes have valid NODE_DATA with proper
zonelist for fallback.

[1] 
https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/
[2] 
https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c...@virtuozzo.com/
[3] https://lore.kernel.org/linux-mm/20200317092624.gb22...@in.ibm.com/
[4] 
https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125a...@suse.cz/

Reported-by: Sachin Sant 
Reported-by: Bharata B Rao 
Debugged-by: Srikar Dronamraju 
Signed-off-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Nathan Lynch 
---
Hi, this is my alternate solution of the SLUB issues to the series [1]. Could
Sachin and Bharata please test whether it fixes the issues?
1) on plain mainline (Bharata) or next (Sachin)
2) the same but with [PATCH 0/3] Offline memoryless cpuless node 0 [2] as I
   assume the series was not related to the SLUB issues and will be kept?

Thanks!

[1] 
https://lore.kernel.org/linux-mm/20200

Re: [PATCH v2 1/4] mm: Check for node_online in node_present_pages

2020-03-18 Thread Vlastimil Babka

On 3/18/20 11:02 AM, Michal Hocko wrote:
> On Wed 18-03-20 12:58:07, Srikar Dronamraju wrote:
>> Calling a kmalloc_node on a possible node which is not yet onlined can
>> lead to panic. Currently node_present_pages() doesn't verify the node is
>> online before accessing the pgdat for the node. However pgdat struct may
>> not be available resulting in a crash.
>> 
>> NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
>> LR [c03d5b94] __slab_alloc+0x34/0x60
>> Call Trace:
>> [c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
>> [c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
>> [c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
>> [c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
>> [c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
>> [c008b3783b90] [c0235aa8] online_css+0x48/0xd0
>> [c008b3783bc0] [c023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
>> [c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
>> [c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
>> [c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
>> [c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
>> [c008b3783e20] [c000b278] system_call+0x5c/0x68
>> 
>> Fix this by verifying the node is online before accessing the pgdat
>> structure. Fix the same for node_spanned_pages() too.
>> 
>> Cc: Andrew Morton 
>> Cc: linux...@kvack.org
>> Cc: Mel Gorman 
>> Cc: Michael Ellerman 
>> Cc: Sachin Sant 
>> Cc: Michal Hocko 
>> Cc: Christopher Lameter 
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: Joonsoo Kim 
>> Cc: Kirill Tkhai 
>> Cc: Vlastimil Babka 
>> Cc: Srikar Dronamraju 
>> Cc: Bharata B Rao 
>> Cc: Nathan Lynch 
>> 
>> Reported-by: Sachin Sant 
>> Tested-by: Sachin Sant 
>> Signed-off-by: Srikar Dronamraju 
>> ---
>>  include/linux/mmzone.h | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index f3f264826423..88078a3b95e5 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -756,8 +756,10 @@ typedef struct pglist_data {
>>  atomic_long_t   vm_stat[NR_VM_NODE_STAT_ITEMS];
>>  } pg_data_t;
>>  
>> -#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
>> -#define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages)
>> +#define node_present_pages(nid) \
>> +(node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
>> +#define node_spanned_pages(nid) \
>> +(node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
> 
> I believe this is a wrong approach. We really do not want to special
> case all the places which require NODE_DATA. Can we please go and
> allocate pgdat for all possible nodes?
> 
> The current state of memory less hacks subtle bugs poping up here and
> there just prove that we should have done that from the very begining
> IMHO.

Yes. So here's an alternative proposal for fixing the current situation in SLUB,
before the long-term solution of having all possible nodes provide valid pgdat
with zonelists:

- fix SLUB with the hunk at the end of this mail - the point is to use 
NUMA_NO_NODE
  as fallback instead of node_to_mem_node()
- this removes all uses of node_to_mem_node (luckily it's just SLUB),
  kill it completely instead of trying to fix it up
- patch 1/4 is not needed with the fix
- perhaps many of your other patches are alss not needed 
- once we get the long-term solution, some of the !node_online() checks can be 
removed

8<
diff --git a/mm/slub.c b/mm/slub.c
index 17dc00e33115..1d4f2d7a0080 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1511,7 +1511,7 @@ static inline struct page *alloc_slab_page(struct 
kmem_cache *s,
struct page *page;
unsigned int order = oo_order(oo);
 
-   if (node == NUMA_NO_NODE)
+   if (node == NUMA_NO_NODE || !node_online(node))
page = alloc_pages(flags, order);
else
page = __alloc_pages_node(node, flags, order);
@@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
 
if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();
-   else if (!node_present_pages(node))
-   searchnode = node_to_mem_node(node);
 
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
@@ -2568,12 +2566,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t

Re: Slub: Increased mem consumption on cpu,mem-less node powerpc guest

2020-03-18 Thread Vlastimil Babka

On 3/18/20 4:20 AM, Srikar Dronamraju wrote:
> * Vlastimil Babka  [2020-03-17 17:45:15]:
>> 
>> Yes, that Kirill's patch was about the memcg shrinker map allocation. But the
>> patch hunk that Bharata posted as a "hack" that fixes the problem, it follows
>> that there has to be something else that calls kmalloc_node(node) where node 
>> is
>> one that doesn't have present pages.
>> 
>> He mentions alloc_fair_sched_group() which has:
>> 
>> for_each_possible_cpu(i) {
>> cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
>>   GFP_KERNEL, cpu_to_node(i));
>> ...
>> se = kzalloc_node(sizeof(struct sched_entity),
>>   GFP_KERNEL, cpu_to_node(i));
>> 
> 
> 
> Sachin's experiment.
> Upstream-next/ memcg /
> possible nodes were 0-31
> online nodes were 0-1
> kmalloc_node called for_each_node / for_each_possible_node.
> This would crash while allocating slab from !N_ONLINE nodes.

So you're saying the crash was actually for allocation on e.g. node 2, not node 
0?
But I believe it was on node 0, because init_kmem_cache_nodes() will only
allocate kmem_cache_node on nodes with N_NORMAL_MEMORY (which doesn't include
0), and slab_mem_going_online_callback() was probably not called for node 0 (it
was not dynamically onlined).
Also if node 0 was fine, node_to_mem_node(2-31) (not initialized explicitly)
would have returned 0 and thus not crash as well.

> Bharata's experiment.
> Upstream
> possible nodes were 0-1
> online nodes were 0-1
> kmalloc_node called for_each_online_node/ for_each_possible_cpu
> i.e kmalloc is called for N_ONLINE nodes.
> So wouldn't crash
> 
> Even if his possible nodes were 0-256. I don't think we have kmalloc_node
> being called in !N_ONLINE nodes. Hence its not crashing.
> If we see the above code that you quote, kzalloc_node is using cpu_to_node
> which in Bharata's case will always return 1.

Are you sure that for_each_possible_cpu(), cpu_to_node() will be 1? Are all of
them properly initialized or is there a similar issue as with
node_to_mem_node(), that some were not initialized and thus cpu_to_node() will
return 0?

Because AFAICS, if kzalloc_node() was always called 1, then
node_present_pages(1) is true, and the "hack" that Bharata reports to work in
his original mail would make no functional difference.

> 
>> I assume one of these structs is 1k and other 512 bytes (rounded) and that 
>> for
>> some possible cpu's cpu_to_node(i) will be 0, which has no present pages. 
>> And as
>> Bharata pasted, node_to_mem_node(0) = 0
>> So this looks like the same scenario, but it doesn't crash? Is the node 0
>> actually online here, and/or does it have N_NORMAL_MEMORY state?
> 
> I still dont have any clue on the leak though.

Let's assume that kzalloc_node() was called with 0 for some of the possible
CPU's. I still wonder why it won't crash, but let's assume kmem_cache_node does
exist for node 0 here.
So the execution AFAICS goes like this:

slab_alloc_node(0)
  c = raw_cpu_ptr(s->cpu_slab);
  object = c->freelist;
  page = c->page;
  if (unlikely(!object || !node_match(page, node))) {
  // whatever we have in the per-cpu cache must be from node 1
  // because node 0 has no memory, so there's no node_match and thus
   __slab_alloc(node == 0)
___slab_alloc(node == 0)
  page = c->page;
 redo:
  if (unlikely(!node_match(page, node))) { // still no match
int searchnode = node;

if (node != NUMA_NO_NODE && !node_present_pages(node))
   //  true && true for node 0
  searchnode = node_to_mem_node(node);
  // searchnode is 0, not 1

  if (unlikely(!node_match(page, searchnode))) {
  // page still from node 1, searchnode is 0, no match

stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);
// we removed the slab from cpu's cache
goto new_slab;
  }

 new_slab:
  if (slub_percpu_partial(c)) {
page = c->page = slub_percpu_partial(c);
slub_set_percpu_partial(c, page);
stat(s, CPU_PARTIAL_ALLOC);
goto redo;
// huh, so with CONFIG_SLUB_CPU_PARTIAL
// this can become an infinite loop actually?
  }
// Bharata's slub stats don't include cpu_partial_alloc so I assume
// CONFIG_SLUB_CPU_PARTIAL is not enabled and we don't loop
  freelist = new_slab_objects(s, gfpflags, node, );
freelist = new_slab_objects(s, gfpflags, node, );

 if (node == NUMA_NO_NODE) // false, it's 0
 else if (!node_present_pages(node)) // true for 0
searchnode = node_to_mem_node(node); // still 0

 object =

Re: Slub: Increased mem consumption on cpu,mem-less node powerpc guest

2020-03-17 Thread Vlastimil Babka

On 3/17/20 5:25 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka  [2020-03-17 16:56:04]:
> 
>> 
>> I wonder why do you get a memory leak while Sachin in the same situation [1]
>> gets a crash? I don't understand anything anymore.
> 
> Sachin was testing on linux-next which has Kirill's patch which modifies
> slub to use kmalloc_node instead of kmalloc. While Bharata is testing on
> upstream, which doesn't have this. 

Yes, that Kirill's patch was about the memcg shrinker map allocation. But the
patch hunk that Bharata posted as a "hack" that fixes the problem, it follows
that there has to be something else that calls kmalloc_node(node) where node is
one that doesn't have present pages.

He mentions alloc_fair_sched_group() which has:

for_each_possible_cpu(i) {
cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
  GFP_KERNEL, cpu_to_node(i));
...
se = kzalloc_node(sizeof(struct sched_entity),
  GFP_KERNEL, cpu_to_node(i));

I assume one of these structs is 1k and other 512 bytes (rounded) and that for
some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as
Bharata pasted, node_to_mem_node(0) = 0
So this looks like the same scenario, but it doesn't crash? Is the node 0
actually online here, and/or does it have N_NORMAL_MEMORY state?

>> 
>> [1]
>> https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/
>> 
>

Re: Slub: Increased mem consumption on cpu,mem-less node powerpc guest

2020-03-17 Thread Vlastimil Babka

On 3/17/20 12:53 PM, Bharata B Rao wrote:
> On Tue, Mar 17, 2020 at 02:56:28PM +0530, Bharata B Rao wrote:
>> Case 1: 2 node NUMA, node0 empty
>> 
>> # numactl -H
>> available: 2 nodes (0-1)
>> node 0 cpus:
>> node 0 size: 0 MB
>> node 0 free: 0 MB
>> node 1 cpus: 0 1 2 3 4 5 6 7
>> node 1 size: 16294 MB
>> node 1 free: 15453 MB
>> node distances:
>> node   0   1 
>>   0:  10  40 
>>   1:  40  10 
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 17dc00e33115..888e4d245444 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1971,10 +1971,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
>> flags, int node,
>>  void *object;
>>  int searchnode = node;
>>  
>> -if (node == NUMA_NO_NODE)
>> +if (node == NUMA_NO_NODE || !node_present_pages(node))
>>  searchnode = numa_mem_id();
>> -else if (!node_present_pages(node))
>> -searchnode = node_to_mem_node(node);
> 
> For the above topology, I see this:
> 
> node_to_mem_node(1) = 1
> node_to_mem_node(0) = 0
> node_to_mem_node(NUMA_NO_NODE) = 0
> 
> Looks like the last two cases (returning memory-less node 0) is the
> problem here?

I wonder why do you get a memory leak while Sachin in the same situation [1]
gets a crash? I don't understand anything anymore.

[1]
https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/

> Regards,
> Bharata.
> 
>

Re: [PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-17 Thread Vlastimil Babka

On 3/17/20 3:51 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka  [2020-03-17 14:53:26]:
> 
>> >> > 
>> >> > Mitigate this by allocating the new slab from the node_numa_mem.
>> >> 
>> >> Are you sure this is really needed and the other 3 patches are not enough 
>> >> for
>> >> the current SLUB code to work as needed? It seems you are changing the 
>> >> semantics
>> >> here...
>> >> 
>> > 
>> > The other 3 patches are not enough because we don't carry the searchnode
>> > when the actual alloc_pages_node gets called. 
>> > 
>> > With only the 3 patches, we see the above Panic, its signature is slightly
>> > different from what Sachin first reported and which I have carried in 1st
>> > patch.
>> 
>> Ah, I see. So that's the missing pgdat after your series [1] right?
> 
> Yes the pgdat would be missing after my cpuless, memoryless node patchset.
> However..
>> 
>> That sounds like an argument for Michal's suggestions that pgdats exist and 
>> have
>> correctly populated zonelists for all possible nodes.
> 
> Only the first patch in this series would be affected by pgdat existing or
> not.  Even if the pgdat existed, the NODE_DATA[nid]->node_present_pages
> would be 0. Right? So it would look at node_to_mem_node(). And since node 0 is
> cpuless it would return 0.

I thought the point was to return 1 for node 0.

> If we pass this node 0 (which is memoryless/cpuless) to
> alloc_pages_node. Please note I am only setting node_numa_mem only
> for offline nodes. However we could change this to set for all offline and
> memoryless nodes.

That would indeed make sense.

But I guess that alloc_pages would still crash as the result of
numa_to_mem_node() is not passed down to alloc_pages() without this patch. In
__alloc_pages_node() we currently have "The node must be valid and online" so
offline nodes don't have zonelists. Either they get them, or we indeed need
something like this patch. But in order to not make get_any_partial() dead code,
the final replacement of invalid node with a valid one should be done in
alloc_slab_page() I guess?

>> node_to_mem_node() could be just a shortcut for the first zone's node in the
>> zonelist, so that fallback follows the topology.
>> 
>> [1]
>> https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#m76e5b4c4084380b1d4b193d5aa0359b987f2290e
>> 
> 
> 
>

Re: [PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-17 Thread Vlastimil Babka

On 3/17/20 2:45 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka  [2020-03-17 14:34:25]:
> 
>> On 3/17/20 2:17 PM, Srikar Dronamraju wrote:
>> > Currently while allocating a slab for a offline node, we use its
>> > associated node_numa_mem to search for a partial slab. If we don't find
>> > a partial slab, we try allocating a slab from the offline node using
>> > __alloc_pages_node. However this is bound to fail.
>> > 
>> > NIP [c039a300] __alloc_pages_nodemask+0x130/0x3b0
>> > LR [c039a3c4] __alloc_pages_nodemask+0x1f4/0x3b0
>> > Call Trace:
>> > [c008b36837f0] [c039a3b4] __alloc_pages_nodemask+0x1e4/0x3b0 
>> > (unreliable)
>> > [c008b3683870] [c03d1ff8] new_slab+0x128/0xcf0
>> > [c008b3683950] [c03d6060] ___slab_alloc+0x410/0x820
>> > [c008b3683a40] [c03d64a4] __slab_alloc+0x34/0x60
>> > [c008b3683a70] [c03d78b0] __kmalloc_node+0x110/0x490
>> > [c008b3683af0] [c0343a08] kvmalloc_node+0x58/0x110
>> > [c008b3683b30] [c03ffd44] mem_cgroup_css_online+0x104/0x270
>> > [c008b3683b90] [c0234e08] online_css+0x48/0xd0
>> > [c008b3683bc0] [c023dedc] 
>> > cgroup_apply_control_enable+0x2ec/0x4d0
>> > [c008b3683ca0] [c02416f8] cgroup_mkdir+0x228/0x5f0
>> > [c008b3683d10] [c0520360] kernfs_iop_mkdir+0x90/0xf0
>> > [c008b3683d50] [c043e400] vfs_mkdir+0x110/0x230
>> > [c008b3683da0] [c0441ee0] do_mkdirat+0xb0/0x1a0
>> > [c008b3683e20] [c000b278] system_call+0x5c/0x68
>> > 
>> > Mitigate this by allocating the new slab from the node_numa_mem.
>> 
>> Are you sure this is really needed and the other 3 patches are not enough for
>> the current SLUB code to work as needed? It seems you are changing the 
>> semantics
>> here...
>> 
> 
> The other 3 patches are not enough because we don't carry the searchnode
> when the actual alloc_pages_node gets called. 
> 
> With only the 3 patches, we see the above Panic, its signature is slightly
> different from what Sachin first reported and which I have carried in 1st
> patch.

Ah, I see. So that's the missing pgdat after your series [1] right?

That sounds like an argument for Michal's suggestions that pgdats exist and have
correctly populated zonelists for all possible nodes.
node_to_mem_node() could be just a shortcut for the first zone's node in the
zonelist, so that fallback follows the topology.

[1]
https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#m76e5b4c4084380b1d4b193d5aa0359b987f2290e

>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, 
>> > gfp_t flags, int node,
>> >struct kmem_cache_cpu *c)
>> >  {
>> >void *object;
>> > -  int searchnode = node;
>> >  
>> > -  if (node == NUMA_NO_NODE)
>> > -  searchnode = numa_mem_id();
>> > -  else if (!node_present_pages(node))
>> > -  searchnode = node_to_mem_node(node);
>> > -
>> > -  object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> > +  object = get_partial_node(s, get_node(s, node), c, flags);
>> >if (object || node != NUMA_NO_NODE)>return object;
>> >
>> >   return get_any_partial(s, flags, c);
>> 
>> I.e. here in this if(), now node will never equal NUMA_NO_NODE (thanks to the
>> hunk below), thus the get_any_partial() call becomes dead code?
>> 
>> > @@ -2470,6 +2464,11 @@ static inline void *new_slab_objects(struct 
>> > kmem_cache *s, gfp_t flags,
>> >  
>> >WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
>> >  
>> > +  if (node == NUMA_NO_NODE)
>> > +  node = numa_mem_id();
>> > +  else if (!node_present_pages(node))
>> > +  node = node_to_mem_node(node);
>> > +
>> >freelist = get_partial(s, flags, node, c);
>> >  
>> >if (freelist)
>> > @@ -2569,12 +2568,10 @@ static void *___slab_alloc(struct kmem_cache *s, 
>> > gfp_t gfpflags, int node,
>> >  redo:
>> >  
>> >if (unlikely(!node_match(page, node))) {
>> > -  int searchnode = node;
>> > -
>> >if (node != NUMA_NO_NODE && !node_present_pages(node))
>> > -  searchnode = node_to_mem_node(node);
>> > +  node = node_to_mem_node(node);
>> >  
>> > -  if (unlikely(!node_match(page, searchnode))) {
>> > +  if (unlikely(!node_match(page, node))) {
>> >stat(s, ALLOC_NODE_MISMATCH);
>> >deactivate_slab(s, page, c->freelist, c);
>> >goto new_slab;
>> > 
>> 
>

Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-17 Thread Vlastimil Babka

On 3/16/20 10:06 AM, Michal Hocko wrote:
> On Thu 12-03-20 17:41:58, Vlastimil Babka wrote:
> [...]
>> with nid present in:
>> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some 
>> online
> 
> I would rather have a dummy pgdat for those. Have a look at 
> $ git grep "NODE_DATA.*->" | wc -l
> 63
> 
> Who knows how many else we have there. I haven't looked more closely.
> Besides that what is a real reason to not have pgdat ther and force all
> users of a $random node from those that the platform considers possible
> for special casing? Is that a memory overhead? Is that really a thing?

I guess we can ignore memory overhead. I guess there only might be some concern
that for nodes that are initially offline, we will allocate the pgdat on a
different node, and after they are online, it will stay on a different node with
more access latency from local cpus. If we only allocate for online nodes, it
can always be local? But I guess it doesn't matter that much.

> Somebody has suggested to tweak some of the low level routines to do the
> special casing but I really have to say I do not like that. We shouldn't
> use the first online node or anything like that. We should simply always
> follow the topology presented by FW and of that we need to have a pgdat.
>

Re: [PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-17 Thread Vlastimil Babka

On 3/17/20 2:17 PM, Srikar Dronamraju wrote:
> Currently while allocating a slab for a offline node, we use its
> associated node_numa_mem to search for a partial slab. If we don't find
> a partial slab, we try allocating a slab from the offline node using
> __alloc_pages_node. However this is bound to fail.
> 
> NIP [c039a300] __alloc_pages_nodemask+0x130/0x3b0
> LR [c039a3c4] __alloc_pages_nodemask+0x1f4/0x3b0
> Call Trace:
> [c008b36837f0] [c039a3b4] __alloc_pages_nodemask+0x1e4/0x3b0 
> (unreliable)
> [c008b3683870] [c03d1ff8] new_slab+0x128/0xcf0
> [c008b3683950] [c03d6060] ___slab_alloc+0x410/0x820
> [c008b3683a40] [c03d64a4] __slab_alloc+0x34/0x60
> [c008b3683a70] [c03d78b0] __kmalloc_node+0x110/0x490
> [c008b3683af0] [c0343a08] kvmalloc_node+0x58/0x110
> [c008b3683b30] [c03ffd44] mem_cgroup_css_online+0x104/0x270
> [c008b3683b90] [c0234e08] online_css+0x48/0xd0
> [c008b3683bc0] [c023dedc] cgroup_apply_control_enable+0x2ec/0x4d0
> [c008b3683ca0] [c02416f8] cgroup_mkdir+0x228/0x5f0
> [c008b3683d10] [c0520360] kernfs_iop_mkdir+0x90/0xf0
> [c008b3683d50] [c043e400] vfs_mkdir+0x110/0x230
> [c008b3683da0] [c0441ee0] do_mkdirat+0xb0/0x1a0
> [c008b3683e20] [c000b278] system_call+0x5c/0x68
> 
> Mitigate this by allocating the new slab from the node_numa_mem.

Are you sure this is really needed and the other 3 patches are not enough for
the current SLUB code to work as needed? It seems you are changing the semantics
here...

> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> flags, int node,
>   struct kmem_cache_cpu *c)
>  {
>   void *object;
> - int searchnode = node;
>  
> - if (node == NUMA_NO_NODE)
> - searchnode = numa_mem_id();
> - else if (!node_present_pages(node))
> - searchnode = node_to_mem_node(node);
> -
> - object = get_partial_node(s, get_node(s, searchnode), c, flags);
> + object = get_partial_node(s, get_node(s, node), c, flags);
>   if (object || node != NUMA_NO_NODE)>return object;
>
>   return get_any_partial(s, flags, c);

I.e. here in this if(), now node will never equal NUMA_NO_NODE (thanks to the
hunk below), thus the get_any_partial() call becomes dead code?

> @@ -2470,6 +2464,11 @@ static inline void *new_slab_objects(struct kmem_cache 
> *s, gfp_t flags,
>  
>   WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
>  
> + if (node == NUMA_NO_NODE)
> + node = numa_mem_id();
> + else if (!node_present_pages(node))
> + node = node_to_mem_node(node);
> +
>   freelist = get_partial(s, flags, node, c);
>  
>   if (freelist)
> @@ -2569,12 +2568,10 @@ static void *___slab_alloc(struct kmem_cache *s, 
> gfp_t gfpflags, int node,
>  redo:
>  
>   if (unlikely(!node_match(page, node))) {
> - int searchnode = node;
> -
>   if (node != NUMA_NO_NODE && !node_present_pages(node))
> - searchnode = node_to_mem_node(node);
> + node = node_to_mem_node(node);
>  
> - if (unlikely(!node_match(page, searchnode))) {
> + if (unlikely(!node_match(page, node))) {
>   stat(s, ALLOC_NODE_MISMATCH);
>   deactivate_slab(s, page, c->freelist, c);
>   goto new_slab;
>

Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-13 Thread Vlastimil Babka

On 3/13/20 12:04 PM, Srikar Dronamraju wrote:
>> I lost all the memory about it. :)
>> Anyway, how about this?
>> 
>> 1. make node_present_pages() safer
>> static inline node_present_pages(nid)
>> {
>> if (!node_online(nid)) return 0;
>> return (NODE_DATA(nid)->node_present_pages);
>> }
>> 
> 
> Yes this would help.

Looks good, yeah.

>> 2. make node_to_mem_node() safer for all cases
>> In ppc arch's mem_topology_setup(void)
>> for_each_present_cpu(cpu) {
>>  numa_setup_cpu(cpu);
>>  mem_node = node_to_mem_node(numa_mem_id());
>>  if (!node_present_pages(mem_node)) {
>>   _node_numa_mem_[numa_mem_id()] = first_online_node;
>>  }
>> }
>> 
> 
> But here as discussed above, we miss the case of possible but not present 
> nodes.
> For such nodes, the above change may not update, resulting in they still
> having 0. And node 0 can be only possible but not present.

So is there other way to do the setup so that node_to_mem_node() returns an
online+present node when called for any possible node?

Re: [5.6.0-rc2-next-20200218/powerpc] Boot failure on POWER9

2020-03-13 Thread Vlastimil Babka

On 3/13/20 12:12 PM, Srikar Dronamraju wrote:
> * Michael Ellerman  [2020-03-13 21:48:06]:
> 
>> Sachin Sant  writes:
>> >> The patch below might work. Sachin can you test this? I tried faking up
>> >> a system with a memoryless node zero but couldn't get it to even start
>> >> booting.
>> >> 
>> > The patch did not help. The kernel crashed during
>> > the boot with the same call trace.
>> >
>> > BUG_ON() introduced with the patch was not triggered.
>> 
>> OK, that's weird.
>> 
>> I eventually managed to get a memoryless node going in sim, and it
>> appears to work there.
>> 
>> eg in dmesg:
>> 
>>   [0.00][T0] numa:   NODE_DATA [mem 
>> 0x2000fffa2f80-0x2000fffa7fff]
>>   [0.00][T0] numa: NODE_DATA(0) on node 1
>>   [0.00][T0] numa:   NODE_DATA [mem 
>> 0x2000fff9df00-0x2000fffa2f7f]
>>   ...
>>   [0.00][T0] Early memory node ranges
>>   [0.00][T0]   node   1: [mem 
>> 0x-0x]
>>   [0.00][T0]   node   1: [mem 
>> 0x2000-0x2000]
>>   [0.00][T0] Could not find start_pfn for node 0
>>   [0.00][T0] Initmem setup node 0 [mem 
>> 0x-0x]
>>   [0.00][T0] On node 0 totalpages: 0
>>   [0.00][T0] Initmem setup node 1 [mem 
>> 0x-0x2000]
>>   [0.00][T0] On node 1 totalpages: 131072
>>   
>>   # dmesg | grep set_numa
>>   [0.00][T0] set_numa_mem: mem node for 0 = 1
>>   [0.005654][T0] set_numa_mem: mem node for 1 = 1
>> 
>> So is the problem more than just node zero having no memory?
>> 
> 
> The problem would happen with possible nodes which are not yet present. i.e
> no cpus, no memory attached to those nodes.
> 
> Please look at
> http://lore.kernel.org/lkml/20200312131438.gb3...@linux.vnet.ibm.com/t/#u
> for more details.
> 
> The summary being: pgdat/Node_Data for such nodes is not allocated. Hence

Michael's log shows that his pgdat is still allocated. But perhaps Sachin had
also your 3 patches from the other thread applied, in addition to Michael's
patch. So in his case pgdat for node 0 would indeed be no longer allocated, and
thus SLUB code was crashing in node_present_pages() instead.

> the node_present_pages(nid) called  where nid is a possible but not yet
> present node fails. Currently node_present_pages(nid) and node_to_mem_node
> don't seem to be equipped to handle possible but not present nodes.
> 
>> cheers
>

Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-12 Thread Vlastimil Babka

On 3/12/20 5:13 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka  [2020-03-12 14:51:38]:
> 
>> > * Vlastimil Babka  [2020-03-12 10:30:50]:
>> > 
>> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
>> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju 
>> >> >>  wrote:
>> >> >> * Michal Hocko  [2020-03-11 12:57:35]:
>> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc 
>> >> >>>> need
>> >> >>>> to make sure all possible but not present cpu_to_node are set to a
>> >> >>>> proper node.
>> >> >>> 
>> >> >>> Just curious, is this somehow related to
>> >> >>> http://lkml.kernel.org/r/20200227182650.gg3...@dhcp22.suse.cz?
>> >> >>> 
>> >> >> 
>> >> >> The issue I am trying to fix is a known issue in Powerpc since many 
>> >> >> years.
>> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: 
>> >> >> allocate
>> >> >> shrinker_map on appropriate NUMA node"). 
>> >> >> 
>> > 
>> > While I am not an expert in the slub area, I looked at the patch
>> > a75056fc1e7c and had some thoughts on why this could be causing this issue.
>> > 
>> > On the system where the crash happens, the possible number of nodes is much
>> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is 
>> > only
>> > available for onlined nodes.
>> > 
>> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
>> > for all possible nodes and in ___slab_alloc we end up looking at the
>> > node_present_pages which is NODE_DATA(nid)->node_present_pages.
>> > i.e for a node whose pdgat struct is not allocated, we are trying to
>> > dereference.
>> 
>> From what we saw, the pgdat does exist, the problem is that slab's per-node 
>> data
>> doesn't exist for a node that doesn't have present pages, as it would be a 
>> waste
>> of memory.
> 
> Just to be clear
> Before my 3 patches to fix dummy node:
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> 0-31
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> 0-1

OK

>> 
>> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
>> Sachin's first report [1] we have
>> 
>> [0.00] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
>> [0.00] numa: NODE_DATA(0) on node 1
>> [0.00] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]
>> 
> 
> So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the
> rest 30 nodes.

I see. Perhaps node_present_pages(node) is not safe in SLUB then and it should
check online first, as you suggested.

>> But in this thread, with your patches Sachin reports:
> 
> and with my patches
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> 0-31
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> 1
> 
>> 
>> [0.00] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
>> 
> 
> so we only see one pgdat.
> 
>> So I assume it's just node 1. In that case, node_present_pages is really 
>> dangerous.
>> 
>> [1]
>> https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/
>> 
>> > Also for a memoryless/cpuless node or possible but not present nodes,
>> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
>> 
>> I think that's the place where this would be best to fix.
>> 
> 
> Maybe. I thought about it but the current set_numa_mem semantics are apt
> for memoryless cpu node and not for possible nodes.  We could have upto 256
> possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
> node_to_mem_node seems to return what is set in set_numa_mem().
> set_numa_mem() seems to say set my numa_mem node for the current memoryless
> node to the param passed.
> 
> But how do we set numa_mem for all the other 253 possible nodes, which
> probably will have 0 as default?
> 
> Should we introduce another API such that we could update for all possible
> nodes?

If we want to rely on node_to_mem_node() to give us something safe for each
possible node, then probably it would have to be like that, yeah.

>> > I tried with this hunk below and it works.
>>

Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-12 Thread Vlastimil Babka

On 3/12/20 2:14 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka  [2020-03-12 10:30:50]:
> 
>> On 3/12/20 9:23 AM, Sachin Sant wrote:
>> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju 
>> >>  wrote:
>> >> * Michal Hocko  [2020-03-11 12:57:35]:
>> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
>> >>>> to make sure all possible but not present cpu_to_node are set to a
>> >>>> proper node.
>> >>> 
>> >>> Just curious, is this somehow related to
>> >>> http://lkml.kernel.org/r/20200227182650.gg3...@dhcp22.suse.cz?
>> >>> 
>> >> 
>> >> The issue I am trying to fix is a known issue in Powerpc since many years.
>> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
>> >> shrinker_map on appropriate NUMA node"). 
>> >> 
>> >> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
>> >> kernel. Will work with Sachin/Abdul (reporters of the issue).
> 
> I had used v1 and not v2. So my mistake.
> 
>> > I applied this 3 patch series on top of March 11 next tree (commit 
>> > d44a64766795 )
>> > The kernel still fails to boot with same call trace.
>> 
> 
> While I am not an expert in the slub area, I looked at the patch
> a75056fc1e7c and had some thoughts on why this could be causing this issue.
> 
> On the system where the crash happens, the possible number of nodes is much
> greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
> available for onlined nodes.
> 
> With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
> for all possible nodes and in ___slab_alloc we end up looking at the
> node_present_pages which is NODE_DATA(nid)->node_present_pages.
> i.e for a node whose pdgat struct is not allocated, we are trying to
> dereference.

>From what we saw, the pgdat does exist, the problem is that slab's per-node 
>data
doesn't exist for a node that doesn't have present pages, as it would be a waste
of memory.

Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
Sachin's first report [1] we have

[0.00] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
[0.00] numa: NODE_DATA(0) on node 1
[0.00] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]

But in this thread, with your patches Sachin reports:

[0.00] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]

So I assume it's just node 1. In that case, node_present_pages is really 
dangerous.

[1]
https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/

> Also for a memoryless/cpuless node or possible but not present nodes,
> node_to_mem_node(node) will still end up as node (atleast on powerpc).

I think that's the place where this would be best to fix.

> I tried with this hunk below and it works.
> 
> But I am not sure if we need to check at other places were
> node_present_pages is being called.

I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
return only nodes that are online with present memory?
CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
support for node_to_mem_node() to determine the fallback node")

I think we do need well defined and documented rules around node_to_mem_node(),
cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
everyone handles it the same, safe way.

> diff --git a/mm/slub.c b/mm/slub.c
> index 626cbcbd977f..bddb93bed55e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t 
> gfpflags, int node,
>   if (unlikely(!node_match(page, node))) {
>   int searchnode = node;
>  
> - if (node != NUMA_NO_NODE && !node_present_pages(node))
> - searchnode = node_to_mem_node(node);
> -
> + if (node != NUMA_NO_NODE) {
> + if (!node_online(node) || !node_present_pages(node)) {
> + searchnode = node_to_mem_node(node);
> + if (!node_online(searchnode))
> + searchnode = first_online_node;
> + }
> + }
>   if (unlikely(!node_match(page, searchnode))) {
>   stat(s, ALLOC_NODE_MISMATCH);
>   deactivate_slab(s, page, c->freelist, c);
> 
>> > 
>> 
>

Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-12 Thread Vlastimil Babka

On 3/12/20 9:23 AM, Sachin Sant wrote:
> 
> 
>> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju  
>> wrote:
>> 
>> * Michal Hocko  [2020-03-11 12:57:35]:
>> 
>>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
 A Powerpc system with multiple possible nodes and with CONFIG_NUMA
 enabled always used to have a node 0, even if node 0 does not any cpus
 or memory attached to it. As per PAPR, node affinity of a cpu is only
 available once its present / online. For all cpus that are possible but
 not present, cpu_to_node() would point to node 0.
 
 To ensure a cpuless, memoryless dummy node is not online, powerpc need
 to make sure all possible but not present cpu_to_node are set to a
 proper node.
>>> 
>>> Just curious, is this somehow related to
>>> http://lkml.kernel.org/r/20200227182650.gg3...@dhcp22.suse.cz?
>>> 
>> 
>> The issue I am trying to fix is a known issue in Powerpc since many years.
>> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
>> shrinker_map on appropriate NUMA node"). 
>> 
>> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
>> kernel. Will work with Sachin/Abdul (reporters of the issue).
>> 
> 
> I applied this 3 patch series on top of March 11 next tree (commit 
> d44a64766795 )
> The kernel still fails to boot with same call trace.

Yeah when I skimmed the patches, I don't think they address the issue where
node_to_mem_node(0) = 0 [1]. You could reapply the debug print patch to verify,
but it seems very likely. So I'm not surprised you get the same trace.

[1] 
https://lore.kernel.org/linux-next/9a86f865-50b5-7483-9257-dbb08fecd...@suse.cz/

> [6.159357] BUG: Kernel NULL pointer dereference on read at 0x73b0
> [6.159363] Faulting instruction address: 0xc03d7174
> [6.159368] Oops: Kernel access of bad area, sig: 11 [#1]
> [6.159372] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [6.159378] Modules linked in:
> [6.159382] CPU: 17 PID: 1 Comm: systemd Not tainted 
> 5.6.0-rc5-next-20200311-autotest+ #1
> [6.159388] NIP:  c03d7174 LR: c03d7714 CTR: 
> c0400e70
> [6.159393] REGS: c008b36836d0 TRAP: 0300   Not tainted  
> (5.6.0-rc5-next-20200311-autotest+)
> [6.159398] MSR:  80009033   CR: 24004848  
> XER: 
> [6.159406] CFAR: c000dec4 DAR: 73b0 DSISR: 4000 
> IRQMASK: 1
> [6.159406] GPR00: c03d7714 c008b3683960 c155e300 
> c008b301f500
> [6.159406] GPR04: 0dc0  c03456f8 
> c008bb198620
> [6.159406] GPR08: 0008ba0f 0001  
> 
> [6.159406] GPR12: 24004848 c0001ec55e00  
> 
> [6.159406] GPR16: c008b0a82048 c1595898 c1750ca8 
> 0002
> [6.159406] GPR20: c1750cb8 c1624478 000fffe0 
> 5deadbeef122
> [6.159406] GPR24: 0001 0dc0  
> c03456f8
> [6.159406] GPR28: c008b301f500 c008bb198620  
> c00c02285a40
> [6.159453] NIP [c03d7174] ___slab_alloc+0x1f4/0x760
> [6.159458] LR [c03d7714] __slab_alloc+0x34/0x60
> [6.159462] Call Trace:
> [6.159465] [c008b3683a40] [c008b3683a70] 0xc008b3683a70
> [6.159471] [c008b3683a70] [c03d8b20] 
> __kmalloc_node+0x110/0x490
> [6.159477] [c008b3683af0] [c03456f8] kvmalloc_node+0x58/0x110
> [6.159483] [c008b3683b30] [c0400f78] 
> mem_cgroup_css_online+0x108/0x270
> [6.159489] [c008b3683b90] [c0236ed8] online_css+0x48/0xd0
> [6.159494] [c008b3683bc0] [c023ffac] 
> cgroup_apply_control_enable+0x2ec/0x4d0
> [6.159501] [c008b3683ca0] [c02437c8] cgroup_mkdir+0x228/0x5f0
> [6.159506] [c008b3683d10] [c0521780] 
> kernfs_iop_mkdir+0x90/0xf0
> [6.159512] [c008b3683d50] [c043f670] vfs_mkdir+0x110/0x230
> [6.159517] [c008b3683da0] [c0443150] do_mkdirat+0xb0/0x1a0
> [6.159523] [c008b3683e20] [c000b278] system_call+0x5c/0x68
> [6.159527] Instruction dump:
> [6.159531] 7c421378 e95f 714a0001 4082fff0 4b64 6000 6000 
> faa10088
> [6.159538] 3ea2000c 3ab56178 7b4a1f24 7d55502a  2faa 
> 409e0394 3d02002a
> [6.159545] ---[ end trace 36d65cb66091a5b6 ]—
> 
> Boot log attached.
> 
> Thanks
> -Sachin
>

Re: [PATCH 2/2] mm/vma: Introduce VM_ACCESS_FLAGS

2020-03-05 Thread Vlastimil Babka

On 3/5/20 7:50 AM, Anshuman Khandual wrote:
> There are many places where all basic VMA access flags (read, write, exec)
> are initialized or checked against as a group. One such example is during
> page fault. Existing vma_is_accessible() wrapper already creates the notion
> of VMA accessibility as a group access permissions. Hence lets just create
> VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which will not only reduce code
> duplication but also extend the VMA accessibility concept in general.
> 
> Cc: Russell King 
> CC: Catalin Marinas 
> CC: Mark Salter 
> Cc: Nick Hu 
> CC: Ley Foon Tan 
> Cc: Michael Ellerman 
> Cc: Heiko Carstens 
> Cc: Yoshinori Sato 
> Cc: Guan Xuetao 
> Cc: Dave Hansen 
> Cc: Thomas Gleixner 
> Cc: Rob Springer 
> Cc: Greg Kroah-Hartman 
> Cc: Andrew Morton 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: de...@driverdev.osuosl.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Anshuman Khandual 

Reviewed-by: Vlastimil Babka 

Thanks.

Re: [RFC 2/3] mm/vma: Introduce VM_ACCESS_FLAGS

2020-03-03 Thread Vlastimil Babka

On 3/2/20 7:47 AM, Anshuman Khandual wrote:
> There are many places where all basic VMA access flags (read, write, exec)
> are initialized or checked against as a group. One such example is during
> page fault. Existing vma_is_accessible() wrapper already creates the notion
> of VMA accessibility as a group access permissions. Hence lets just create
> VM_ACCESS_FLAGS (VM_READ|VM_WRITE|VM_EXEC) which will not only reduce code
> duplication but also extend the VMA accessibility concept in general.
> 
> Cc: Russell King 
> CC: Catalin Marinas 
> CC: Mark Salter 
> Cc: Nick Hu 
> CC: Ley Foon Tan 
> Cc: Michael Ellerman 
> Cc: Heiko Carstens 
> Cc: Yoshinori Sato 
> Cc: Guan Xuetao 
> Cc: Dave Hansen 
> Cc: Thomas Gleixner 
> Cc: Rob Springer 
> Cc: Greg Kroah-Hartman 
> Cc: Andrew Morton 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c6x-...@linux-c6x.org
> Cc: nios2-...@lists.rocketboards.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: de...@driverdev.osuosl.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Anshuman Khandual 

Dunno. Such mask seems ok for testing flags, but it's a bit awkward when
initializing flags, where it covers just one of many combinations that seem
used. But no strong opinions, patch looks correct.

Re: [RFC 1/3] mm/vma: Define a default value for VM_DATA_DEFAULT_FLAGS

2020-03-03 Thread Vlastimil Babka

On 3/2/20 7:47 AM, Anshuman Khandual wrote:
> There are many platforms with exact same value for VM_DATA_DEFAULT_FLAGS
> This creates a default value for VM_DATA_DEFAULT_FLAGS in line with the
> existing VM_STACK_DEFAULT_FLAGS. While here, also define some more macros
> with standard VMA access flag combinations that are used frequently across
> many platforms. Apart from simplification, this reduces code duplication
> as well.
> 
> Cc: Richard Henderson 
> Cc: Vineet Gupta 
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Mark Salter 
> Cc: Guo Ren 
> Cc: Yoshinori Sato 
> Cc: Brian Cain 
> Cc: Tony Luck 
> Cc: Geert Uytterhoeven 
> Cc: Michal Simek 
> Cc: Ralf Baechle 
> Cc: Paul Burton 
> Cc: Nick Hu 
> Cc: Ley Foon Tan 
> Cc: Jonas Bonn 
> Cc: "James E.J. Bottomley" 
> Cc: Michael Ellerman 
> Cc: Paul Walmsley 
> Cc: Heiko Carstens 
> Cc: Rich Felker 
> Cc: "David S. Miller" 
> Cc: Guan Xuetao 
> Cc: Thomas Gleixner 
> Cc: Jeff Dike 
> Cc: Chris Zankel 
> Cc: Andrew Morton 
> Cc: linux-al...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c6x-...@linux-c6x.org
> Cc: uclinux-h8-de...@lists.sourceforge.jp
> Cc: linux-hexa...@vger.kernel.org
> Cc: linux-i...@vger.kernel.org
> Cc: linux-m...@lists.linux-m68k.org
> Cc: linux-m...@vger.kernel.org
> Cc: nios2-...@lists.rocketboards.org
> Cc: openr...@lists.librecores.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ri...@lists.infradead.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@lists.infradead.org
> Cc: linux-xte...@linux-xtensa.org
> Cc: linux...@kvack.org
> Signed-off-by: Anshuman Khandual 

Nit:

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b0e53ef13ff1..7a764ae6ab68 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -342,6 +342,21 @@ extern unsigned int kobjsize(const void *objp);
>  /* Bits set in the VMA until the stack is in its final location */
>  #define VM_STACK_INCOMPLETE_SETUP(VM_RAND_READ | VM_SEQ_READ)
>  
> +#define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
> +
> +/* Common data flag combinations */
> +#define VM_DATA_FLAGS_TSK_EXEC   (VM_READ | VM_WRITE | TASK_EXEC | \
> +  VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
> +#define VM_DATA_FLAGS_NON_EXEC   (VM_READ | VM_WRITE | VM_MAYREAD | \
> +  VM_MAYWRITE | VM_MAYEXEC)
> +#define VM_DATA_FLAGS_EXEC   (VM_READ | VM_WRITE | VM_EXEC | \
> +  VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
> +
> +#ifndef VM_DATA_DEFAULT_FLAGS/* arch can override this */
> +#define VM_DATA_DEFAULT_FLAGS(VM_READ | VM_WRITE | VM_EXEC | \
> +  VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)

Should you use VM_DATA_FLAGS_EXEC here? Yeah one more macro to expand, but it's
right above this.

> +#endif
> +
>  #ifndef VM_STACK_DEFAULT_FLAGS   /* arch can override this */
>  #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
>  #endif
>

Re: [5.6.0-rc2-next-20200218/powerpc] Boot failure on POWER9

2020-02-27 Thread Vlastimil Babka

On 2/27/20 5:00 PM, Sachin Sant wrote:
> 
> 
>> On 27-Feb-2020, at 5:42 PM, Michal Hocko  wrote:
>> 
>> A very good hint indeed. I would do this
>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index eb2fe6edd73c..d9f1b6737e4d 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -137,6 +137,8 @@ static inline void set_numa_mem(int node)
>> {
>>  this_cpu_write(_numa_mem_, node);
>>  _node_numa_mem_[numa_node_id()] = node;
>> +pr_info("%s %d -> %d\n", __FUNCTION__, numa_node_id(), node);
>> +dump_stack();
>> }
>> #endif
>> 
>> Btw. it would be also helpful to get
>> `faddr2line ___slab_alloc+0x334' from your kernel Sachin.
> 
> [linux-next]# ./scripts/faddr2line ./vmlinux ___slab_alloc+0x334 
> ___slab_alloc+0x334/0x760:
> new_slab_objects at mm/slub.c:2478
> (inlined by) ___slab_alloc at mm/slub.c:2628
> [linux-next]# 

Hmm that doesn't look relevant, but that address was marked as unreliable, no?
Don't we actually need this one?

[8.768727] NIP [c03d55f4] ___slab_alloc+0x1f4/0x760

> I have also attached boot log with a kernel that include about change.
> I see the following o/p during boot:
> 
> [0.005269] set_numa_mem 1 -> 1

So there's no "set_numa_mem 0 -> X", specifically not
"set_numa_mem 0 -> 1" which I would have expected. That seems to confirm my
suspicion that the arch code doesn't set up the memoryless node 0 properly.

> [0.005270] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 
> 5.6.0-rc3-next-20200227-autotest+ #6
> [0.005271] Call Trace:
> [0.005272] [c008b37dfe80] [c0b5d948] dump_stack+0xbc/0x104 
> (unreliable)
> [0.005274] [c008b37dfec0] [c0059320] 
> start_secondary+0x600/0x6e0
> [0.005277] [c008b37dff90] [c000ac54] 
> start_secondary_prolog+0x10/0x14
> 
> Thanks
> -Sachin
>

Re: [5.6.0-rc2-next-20200218/powerpc] Boot failure on POWER9

2020-02-26 Thread Vlastimil Babka

On 2/26/20 10:45 PM, Vlastimil Babka wrote:
> 
> 
> if (node == NUMA_NO_NODE)
> page = alloc_pages(flags, order);
> else
> page = __alloc_pages_node(node, flags, order);
> 
> So yeah looks like SLUB's kmalloc_node() is supposed to behave like the
> page allocator's __alloc_pages_node() and respect __GFP_THISNODE but not
> enforce it by itself. There's probably just some missing data structure
> initialization somewhere right now for memoryless nodes.

Upon more digging, I think the problem could manifest if
node_to_mem_node(0) (_node_numa_mem_[0]) returned 0 instead of 1,
because it wasn't initialized properly for a memoryless node. Can you
e.g. print it somewhere?

Re: [5.6.0-rc2-next-20200218/powerpc] Boot failure on POWER9

2020-02-26 Thread Vlastimil Babka

On 2/26/20 7:41 PM, Michal Hocko wrote:
> On Wed 26-02-20 18:25:28, Cristopher Lameter wrote:
>> On Mon, 24 Feb 2020, Michal Hocko wrote:
>>
>>> Hmm, nasty. Is there any reason why kmalloc_node behaves differently
>>> from the page allocator?
>>
>> The page allocator will do the same thing if you pass GFP_THISNODE and
>> insist on allocating memory from a node that does not exist.
> 
> I do not think that the page allocator would blow up even with
> GFP_THISNODE. The allocation would just fail on memory less node.
> 
> Besides that kmalloc_node shouldn't really have an implicit GFP_THISNODE
> semantic right? At least I do not see anything like that documented
> anywhere.

Seems like SLAB at least behaves like the page allocator. See
cache_alloc_node() where it basically does:

page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid);
...
if (!page)
fallback_alloc(cachep, flags)

gfp_exact_node() adds __GFP_THISNODE among other things, so the initial
attempt does try to stick only to the given node. But fallback_alloc()
doesn't. In fact, even if kmalloc_node() was called with __GFP_THISNODE
then it wouldn't work as intended, as fallback_alloc() doesn't get the
nodeid, but instead will use numa_mem_id(). That part could probably be
improved.

SLUB's ___slab_alloc() has for example this:
if (node != NUMA_NO_NODE && !node_present_pages(node))
searchnode = node_to_mem_node(node);

That's from Joonsoo's 2014 commit a561ce00b09e ("slub: fall back to
node_to_mem_node() node if allocating on memoryless node"), suggesting
that the scenario in this bug report should work. Perhaps it just got
broken unintentionally later.

And AFAICS the whole path leading to alloc_slab_page() also doesn't add
__GFP_THISNODE, but will keep it if caller passed it, and ultimately it
does:

if (node == NUMA_NO_NODE)
page = alloc_pages(flags, order);
else
page = __alloc_pages_node(node, flags, order);

So yeah looks like SLUB's kmalloc_node() is supposed to behave like the
page allocator's __alloc_pages_node() and respect __GFP_THISNODE but not
enforce it by itself. There's probably just some missing data structure
initialization somewhere right now for memoryless nodes.

Re: [PATCH V2 3/4] mm/vma: Replace all remaining open encodings with is_vm_hugetlb_page()

2020-02-26 Thread Vlastimil Babka

On 2/24/20 6:03 AM, Anshuman Khandual wrote:
> This replaces all remaining open encodings with is_vm_hugetlb_page().
> 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> Cc: Alexander Viro 
> Cc: Will Deacon 
> Cc: "Aneesh Kumar K.V" 
> Cc: Andrew Morton 
> Cc: Nick Piggin 
> Cc: Peter Zijlstra 
> Cc: Arnd Bergmann 
> Cc: Ingo Molnar 
> Cc: Arnaldo Carvalho de Melo 
> Cc: kvm-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-a...@vger.kernel.org
> Cc: linux...@kvack.org
> Signed-off-by: Anshuman Khandual 

Meh, why is there _page in the function's name... but too many users to bother
changing it now, I guess.

Acked-by: Vlastimil Babka

Re: [PATCH V2 2/4] mm/vma: Make vma_is_accessible() available for general use

2020-02-26 Thread Vlastimil Babka

On 2/24/20 6:03 AM, Anshuman Khandual wrote:
> Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
> available for general use. While here, this replaces all remaining open
> encodings for VMA access check with vma_is_accessible().
> 
> Cc: Guo Ren 
> Cc: Geert Uytterhoeven 
> Cc: Ralf Baechle 
> Cc: Paul Burton 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Yoshinori Sato 
> Cc: Rich Felker 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Andrew Morton 
> Cc: Steven Rostedt 
> Cc: Mel Gorman 
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-m...@lists.linux-m68k.org
> Cc: linux-m...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux...@vger.kernel.org
> Cc: linux...@kvack.org
> Acked-by: Geert Uytterhoeven 
> Acked-by: Guo Ren 
> Signed-off-by: Anshuman Khandual 

Acked-by: Vlastimil Babka

Re: [PATCH 2/3] mm/vma: Make vma_is_foreign() available for general use

2020-02-26 Thread Vlastimil Babka

On 2/26/20 5:50 AM, Anshuman Khandual wrote:
> Idea of a foreign VMA with respect to the present context is very generic.
> But currently there are two identical definitions for this in powerpc and
> x86 platforms. Lets consolidate those redundant definitions while making
> vma_is_foreign() available for general use later. This should not cause
> any functional change.
> 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Andrew Morton 
> Cc: x...@kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org
> Signed-off-by: Anshuman Khandual 

Some comment for the function wouln't hurt, but perhaps it is self-explanatory
enough.

Acked-by: Vlastimil Babka

Re: [PATCH] btrfs: fix allocation of bitmap pages.

2019-08-20 Thread Vlastimil Babka

On 8/20/19 4:30 AM, Christoph Hellwig wrote:
> On Mon, Aug 19, 2019 at 07:46:00PM +0200, David Sterba wrote:
>> Another thing that is lost is the slub debugging support for all
>> architectures, because get_zeroed_pages lacking the red zones and sanity
>> checks.
>> 
>> I find working with raw pages in this code a bit inconsistent with the
>> rest of btrfs code, but that's rather minor compared to the above.
>> 
>> Summing it up, I think that the proper fix should go to copy_page
>> implementation on architectures that require it or make it clear what
>> are the copy_page constraints.
> 
> The whole point of copy_page is to copy exactly one page and it makes
> sense to assume that is aligned.  A sane memcpy would use the same
> underlying primitives as well after checking they fit.  So I think the
> prime issue here is btrfs' use of copy_page instead of memcpy.  The
> secondary issue is slub fucking up alignments for no good reason.  We
> just got bitten by that crap again in XFS as well :(

Meh, I should finally get back to https://lwn.net/Articles/787740/ right

Re: [PATCH v5 3/4] mm: Simplify MEMORY_ISOLATION && COMPACTION || CMA into CONTIG_ALLOC

2019-03-06 Thread Vlastimil Babka

On 3/6/19 8:00 PM, Alexandre Ghiti wrote:
> This condition allows to define alloc_contig_range, so simplify
> it into a more accurate naming.
> 
> Suggested-by: Vlastimil Babka 
> Signed-off-by: Alexandre Ghiti 

Acked-by: Vlastimil Babka 

(you could have sent this with my ack from v4 as there wasn't
significant change, just the one I suggested :)

Re: [PATCH v4 4/4] hugetlb: allow to free gigantic pages regardless of the configuration

2019-03-01 Thread Vlastimil Babka

On 3/1/19 2:21 PM, Alexandre Ghiti wrote:
> I collected mistakes here: domain name expired and no mailing list added :)
> Really sorry about that, I missed the whole discussion (if any).
> Could someone forward it to me (if any) ? Thanks !

Bounced you David and Mike's discussion (4 messages total). AFAICS that
was all.

Re: [PATCH v8 1/4] mm/cma: Add PF flag to force non cma alloc

2019-02-28 Thread Vlastimil Babka

On 2/27/19 3:47 PM, Aneesh Kumar K.V wrote:
> This patch adds PF_MEMALLOC_NOCMA which make sure any allocation in that 
> context
> is marked non-movable and hence cannot be satisfied by CMA region.
> 
> This is useful with get_user_pages_longterm where we want to take a page pin 
> by
> migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures
> that we avoid unnecessary page migration later.
> 
> Suggested-by: Andrea Arcangeli 
> Reviewed-by: Andrea Arcangeli 
> Signed-off-by: Aneesh Kumar K.V 

+CC scheduler guys

Do we really take the last available PF flag just so that "we avoid
unnecessary page migration later"?
If yes, that's a third PF_MEMALLOC flag, should we get separate variable
for gfp context at this point?
Also I don't like the name PF_MEMALLOC_NOCMA, as it's unnecessarily tied
to CMA. If anything it should be e.g. PF_MEMALLOC_NOMOVABLE.

Thanks.

> ---
>  include/linux/sched.h|  1 +
>  include/linux/sched/mm.h | 48 +---
>  2 files changed, 41 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f9b43c989577..dfa90088ba08 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1403,6 +1403,7 @@ extern struct pid *cad_pid;
>  #define PF_UMH   0x0200  /* I'm an 
> Usermodehelper process */
>  #define PF_NO_SETAFFINITY0x0400  /* Userland is not allowed to 
> meddle with cpus_allowed */
>  #define PF_MCE_EARLY 0x0800  /* Early kill for mce process 
> policy */
> +#define PF_MEMALLOC_NOCMA0x1000 /* All allocation request will have 
> _GFP_MOVABLE cleared */
>  #define PF_MUTEX_TESTER  0x2000  /* Thread belongs to 
> the rt mutex tester */
>  #define PF_FREEZER_SKIP  0x4000  /* Freezer should not 
> count it as freezable */
>  #define PF_SUSPEND_TASK  0x8000  /* This thread called 
> freeze_processes() and should not be frozen */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 3bfa6a0cbba4..0cd9f10423fb 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -148,17 +148,25 @@ static inline bool in_vfork(struct task_struct *tsk)
>   * Applies per-task gfp context to the given allocation flags.
>   * PF_MEMALLOC_NOIO implies GFP_NOIO
>   * PF_MEMALLOC_NOFS implies GFP_NOFS
> + * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
>   */
>  static inline gfp_t current_gfp_context(gfp_t flags)
>  {
> - /*
> -  * NOIO implies both NOIO and NOFS and it is a weaker context
> -  * so always make sure it makes precedence
> -  */
> - if (unlikely(current->flags & PF_MEMALLOC_NOIO))
> - flags &= ~(__GFP_IO | __GFP_FS);
> - else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
> - flags &= ~__GFP_FS;
> + if (unlikely(current->flags &
> +  (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | 
> PF_MEMALLOC_NOCMA))) {
> + /*
> +  * NOIO implies both NOIO and NOFS and it is a weaker context
> +  * so always make sure it makes precedence
> +  */
> + if (current->flags & PF_MEMALLOC_NOIO)
> + flags &= ~(__GFP_IO | __GFP_FS);
> + else if (current->flags & PF_MEMALLOC_NOFS)
> + flags &= ~__GFP_FS;
> +#ifdef CONFIG_CMA
> + if (current->flags & PF_MEMALLOC_NOCMA)
> + flags &= ~__GFP_MOVABLE;
> +#endif
> + }
>   return flags;
>  }
>  
> @@ -248,6 +256,30 @@ static inline void memalloc_noreclaim_restore(unsigned 
> int flags)
>   current->flags = (current->flags & ~PF_MEMALLOC) | flags;
>  }
>  
> +#ifdef CONFIG_CMA
> +static inline unsigned int memalloc_nocma_save(void)
> +{
> + unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
> +
> + current->flags |= PF_MEMALLOC_NOCMA;
> + return flags;
> +}
> +
> +static inline void memalloc_nocma_restore(unsigned int flags)
> +{
> + current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
> +}
> +#else
> +static inline unsigned int memalloc_nocma_save(void)
> +{
> + return 0;
> +}
> +
> +static inline void memalloc_nocma_restore(unsigned int flags)
> +{
> +}
> +#endif
> +
>  #ifdef CONFIG_MEMCG
>  /**
>   * memalloc_use_memcg - Starts the remote memcg charging scope.
>

Re: [PATCH v3] hugetlb: allow to free gigantic pages regardless of the configuration

2019-02-15 Thread Vlastimil Babka

On 2/14/19 8:31 PM, Alexandre Ghiti wrote:
> On systems without CMA or (MEMORY_ISOLATION && COMPACTION) activated but
> that support gigantic pages, boottime reserved gigantic pages can not be
> freed at all. This patch simply enables the possibility to hand back
> those pages to memory allocator.
> 
> This patch also renames:
> 
> - the triplet CMA or (MEMORY_ISOLATION && COMPACTION) into CONTIG_ALLOC,
> and gets rid of all use of it in architecture specific code (and then
> removes ARCH_HAS_GIGANTIC_PAGE config).
> - gigantic_page_supported to make it more accurate: this value being false
> does not mean that the system cannot use gigantic pages, it just means that
> runtime allocation of gigantic pages is not supported, one can still
> allocate boottime gigantic pages if the architecture supports it.
> 
> Signed-off-by: Alexandre Ghiti 

Acked-by: Vlastimil Babka 

Thanks!

...

> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -252,12 +252,17 @@ config MIGRATION
> pages as migration can relocate pages to satisfy a huge page
> allocation instead of reclaiming.
>  
> +

Stray newline? No need to resend, Andrew can fix up.
Ah, he wasn't in To:, adding.

>  config ARCH_ENABLE_HUGEPAGE_MIGRATION
>   bool
>  
>  config ARCH_ENABLE_THP_MIGRATION
>   bool
>  
> +config CONTIG_ALLOC
> + def_bool y
> + depends on (MEMORY_ISOLATION && COMPACTION) || CMA
> +
>  config PHYS_ADDR_T_64BIT
>   def_bool 64BIT
>

Re: [PATCH v2] hugetlb: allow to free gigantic pages regardless of the configuration

2019-02-14 Thread Vlastimil Babka

On 2/13/19 8:30 PM, Dave Hansen wrote:
>> -#if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || 
>> defined(CONFIG_CMA)
>> +#ifdef CONFIG_COMPACTION_CORE
>>  static __init int gigantic_pages_init(void)
>>  {
>>  /* With compaction or CMA we can allocate gigantic pages at runtime */
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index ac474a61be37..8fecd3ea5563 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -207,8 +207,9 @@ config HUGETLB_PAGE
>>  config MEMFD_CREATE
>>  def_bool TMPFS || HUGETLBFS
>>  
>> -config ARCH_HAS_GIGANTIC_PAGE
>> +config COMPACTION_CORE
>>  bool
>> +default y if (MEMORY_ISOLATION && MIGRATION) || CMA
> 
> This takes a hard dependency (#if) and turns it into a Kconfig *default*
> that can be overridden.  That seems like trouble.
> 
> Shouldn't it be:
> 
> config COMPACTION_CORE
>   def_bool y
>   depends on (MEMORY_ISOLATION && MIGRATION) || CMA

Agreed. Also I noticed that it now depends on MIGRATION instead of
COMPACTION. That intention is correct IMHO, but will fail to
compile/link when both COMPACTION and CMA are disabled, and would need
more changes in mm/internal.h and mm/compaction.c (possibly just
replacing CMA in all "if defined CONFIG_COMPACTION || defined
CONFIG_CMA" instances with COMPACTION_CORE, but there might be more
problems, wanna try? :)

Also, I realized that COMPACTION_CORE is a wrong name, sorry about that.
What the config really provides is alloc_contig_range(), so it should be
named either CONFIG_CMA_CORE (as it provides contiguous memory
allocation, but not the related reservation and accounting), or
something like CONFIG_CONTIG_ALLOC. I would also move it from fs/Kconfig
to mm/Kconfig.

Thanks!

Re: [PATCH] hugetlb: allow to free gigantic pages regardless of the configuration

2019-02-13 Thread Vlastimil Babka

On 1/17/19 7:39 PM, Alexandre Ghiti wrote:
> From: Alexandre Ghiti 
> 
> On systems without CMA or (MEMORY_ISOLATION && COMPACTION) activated but
> that support gigantic pages, boottime reserved gigantic pages can not be
> freed at all. This patchs simply enables the possibility to hand back
> those pages to memory allocator.
> 
> This commit then renames gigantic_page_supported and
> ARCH_HAS_GIGANTIC_PAGE to make them more accurate. Indeed, those values
> being false does not mean that the system cannot use gigantic pages: it
> just means that runtime allocation of gigantic pages is not supported,
> one can still allocate boottime gigantic pages if the architecture supports
> it.
> 
> Signed-off-by: Alexandre Ghiti 

I'm fine with the change, but wonder if this can be structured better in a way
which would remove the duplicated "if (MEMORY_ISOLATION && COMPACTION) || CMA"
from all arches, as well as the duplicated
gigantic_page_runtime_allocation_supported()

something like:

- "select ARCH_HAS_GIGANTIC_PAGE" has no conditions, it just says the arch can
support them either at boottime or runtime (but runtime is usable only if other
conditions are met)
- gigantic_page_runtime_allocation_supported() is a function that returns true
if ARCH_HAS_GIGANTIC_PAGE && ((MEMORY_ISOLATION && COMPACTION) || CMA) and
there's a single instance, not per-arch.
- code for freeing gigantic pages can probably still be conditional on
ARCH_HAS_GIGANTIC_PAGE

BTW I wanted also to do something about the "(MEMORY_ISOLATION && COMPACTION) ||
CMA" ugliness itself, i.e. put the common parts behind some new kconfig
(COMPACTION_CORE ?) and expose it better to users, but I can take a stab on that
once the above part is settled.

Vlastimil

Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-17 Thread Vlastimil Babka

On 10/16/18 9:43 PM, Joel Fernandes wrote:
> On Tue, Oct 16, 2018 at 01:29:52PM +0200, Vlastimil Babka wrote:
>> On 10/16/18 12:33 AM, Joel Fernandes wrote:
>>> On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote:
>>>> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
>>>>> Android needs to mremap large regions of memory during memory management
>>>>> related operations.
>>>>
>>>> Just curious: why?
>>>
>>> In Android we have a requirement of moving a large (up to a GB now, but may
>>> grow bigger in future) memory range from one location to another.
>>
>> I think Christoph's "why?" was about the requirement, not why it hurts
>> applications. I admit I'm now also curious :)
> 
> This issue was discovered when we wanted to be able to move the physical
> pages of a memory range to another location quickly so that, after the
> application threads are resumed, UFFDIO_REGISTER_MODE_MISSING userfaultfd
> faults can be received on the original memory range. The actual operations
> performed on the memory range are beyond the scope of this discussion. The
> user threads continue to refer to the old address which will now fault. The
> reason we want retain the old memory range and receives faults there is to
> avoid the need to fix the addresses all over the address space of the threads
> after we finish with performing operations on them in the fault handlers, so
> we mremap it and receive faults at the old addresses.
> 
> Does that answer your question?

Yes, interesting, thanks!

Vlastimil

> thanks,
> 
> - Joel
>

Re: [PATCH 2/4] mm: speed up mremap by 500x on large regions (v2)

2018-10-16 Thread Vlastimil Babka

On 10/16/18 12:33 AM, Joel Fernandes wrote:
> On Mon, Oct 15, 2018 at 02:42:09AM -0700, Christoph Hellwig wrote:
>> On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
>>> Android needs to mremap large regions of memory during memory management
>>> related operations.
>>
>> Just curious: why?
> 
> In Android we have a requirement of moving a large (up to a GB now, but may
> grow bigger in future) memory range from one location to another.

I think Christoph's "why?" was about the requirement, not why it hurts
applications. I admit I'm now also curious :)

> This move
> operation has to happen when the application threads are paused for this
> operation. Therefore, an inefficient move like it is now (for example 250ms
> on arm64) will cause response time issues for applications, which is not
> acceptable. Huge pages cannot be used in such memory ranges to avoid this
> inefficiency as (when the application threads are running) our fault handlers
> are designed to process 4KB pages at a time, to keep response times low. So
> using huge pages in this context can, again, cause response time issues.
> 
> Also, the mremap syscall waiting for quarter of a second for a large mremap
> is quite weird and we ought to improve it where possible.

Re: [PATCH 2/2] dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()

2018-07-16 Thread Vlastimil Babka

On 07/09/2018 02:19 PM, Marek Szyprowski wrote:
> The CMA memory allocator doesn't support standard gfp flags for memory
> allocation, so there is no point having it as a parameter for
> dma_alloc_from_contiguous() function. Replace it by a boolean no_warn
> argument, which covers all the underlaying cma_alloc() function supports.
> 
> This will help to avoid giving false feeling that this function supports
> standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
> what has already been an issue: see commit dd65a941f6ba ("arm64:
> dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
> 
> Signed-off-by: Marek Szyprowski 

Acked-by: Vlastimil Babka

Re: [PATCH 1/2] mm/cma: remove unsupported gfp_mask parameter from cma_alloc()

2018-07-16 Thread Vlastimil Babka

On 07/09/2018 02:19 PM, Marek Szyprowski wrote:
> cma_alloc() function doesn't really support gfp flags other than
> __GFP_NOWARN, so convert gfp_mask parameter to boolean no_warn parameter.
> 
> This will help to avoid giving false feeling that this function supports
> standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
> what has already been an issue: see commit dd65a941f6ba ("arm64:
> dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
> 
> Signed-off-by: Marek Szyprowski 

Acked-by: Vlastimil Babka

Re: [PATCH V2 1/2] mm/page_alloc: Replace set_dma_reserve to set_memory_reserve

2016-08-05 Thread Vlastimil Babka


On 08/05/2016 09:24 AM, Srikar Dronamraju wrote:

* Vlastimil Babka <vba...@suse.cz> [2016-08-05 08:45:03]:


@@ -5493,10 +5493,10 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat)
}

/* Account for reserved pages */
-   if (j == 0 && freesize > dma_reserve) {
-   freesize -= dma_reserve;
+   if (j == 0 && freesize > nr_memory_reserve) {


Will this really work (together with patch 2) as intended?
This j == 0 means that we are doing this only for the first zone, which is
ZONE_DMA (or ZONE_DMA32) on node 0 on many systems. I.e. I don't think it's
really true that "dma_reserve has nothing to do with DMA or ZONE_DMA".

This zone will have limited amount of memory, so the "freesize >
nr_memory_reserve" will easily be false once you set this to many gigabytes,
so in fact nothing will get subtracted.

On the other hand if the kernel has both CONFIG_ZONE_DMA and
CONFIG_ZONE_DMA32 disabled, then j == 0 will be true for ZONE_NORMAL. This
zone might be present on multiple nodes (unless they are configured as
movable) and then the value intended to be global will be subtracted from
several nodes.

I don't know what's the exact ppc64 situation here, perhaps there are indeed
no DMA/DMA32 zones, and the fadump kernel only uses one node, so it works in
the end, but it doesn't seem much robust to me?



At the page initialization time, powerpc seems to have just one zone
spread across the 16 nodes.

From the dmesg.

[0.00] Memory hole size: 0MB
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x-0x1f5c8fff]
[0.00]   DMA32empty
[0.00]   Normal   empty
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x-0x01fb4fff]
[0.00]   node   1: [mem 0x01fb5000-0x03fa8fff]
[0.00]   node   2: [mem 0x03fa9000-0x05f9cfff]
[0.00]   node   3: [mem 0x05f9d000-0x07f8efff]
[0.00]   node   4: [mem 0x07f8f000-0x09f81fff]
[0.00]   node   5: [mem 0x09f82000-0x0bf77fff]
[0.00]   node   6: [mem 0x0bf78000-0x0df6dfff]
[0.00]   node   7: [mem 0x0df6e000-0x0ff63fff]
[0.00]   node   8: [mem 0x0ff64000-0x11f58fff]
[0.00]   node   9: [mem 0x11f59000-0x13644fff]
[0.00]   node  10: [mem 0x13645000-0x1563afff]
[0.00]   node  11: [mem 0x1563b000-0x17630fff]
[0.00]   node  12: [mem 0x17631000-0x19625fff]
[0.00]   node  13: [mem 0x19626000-0x1b5dcfff]
[0.00]   node  14: [mem 0x1b5dd000-0x1d5d2fff]
[0.00]   node  15: [mem 0x1d5d3000-0x1f5c8fff]


Hmm so it will work for ppc64 and its fadump, but I'm not happy that we 
made the function name sound like it's generic (unlike when the name 
contained "dma"), while it only works as intended in specific corner 
cases. The next user might be surprised...

Re: [PATCH V2 1/2] mm/page_alloc: Replace set_dma_reserve to set_memory_reserve

2016-08-05 Thread Vlastimil Babka


On 08/04/2016 07:12 PM, Srikar Dronamraju wrote:

Expand the scope of the existing dma_reserve to accommodate other memory
reserves too. Accordingly rename variable dma_reserve to
nr_memory_reserve.

set_memory_reserve also takes a new parameter that helps to identify if
the current value needs to be incremented.

Suggested-by: Mel Gorman 
Signed-off-by: Srikar Dronamraju 
---
 arch/x86/kernel/e820.c |  2 +-
 include/linux/mm.h |  2 +-
 mm/page_alloc.c| 20 
 3 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 621b501..d935983 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1188,6 +1188,6 @@ void __init memblock_find_dma_reserve(void)
nr_free_pages += end_pfn - start_pfn;
}

-   set_dma_reserve(nr_pages - nr_free_pages);
+   set_memory_reserve(nr_pages - nr_free_pages, false);
 #endif
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8f468e0..c884ffb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1886,7 +1886,7 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
struct mminit_pfnnid_cache *state);
 #endif

-extern void set_dma_reserve(unsigned long new_dma_reserve);
+extern void set_memory_reserve(unsigned long nr_reserve, bool inc);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
unsigned long, enum memmap_context);
 extern void setup_per_zone_wmarks(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c1069ef..a154c2f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -253,7 +253,7 @@ int watermark_scale_factor = 10;

 static unsigned long __meminitdata nr_kernel_pages;
 static unsigned long __meminitdata nr_all_pages;
-static unsigned long __meminitdata dma_reserve;
+static unsigned long __meminitdata nr_memory_reserve;

 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
@@ -5493,10 +5493,10 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat)
}

/* Account for reserved pages */
-   if (j == 0 && freesize > dma_reserve) {
-   freesize -= dma_reserve;
+   if (j == 0 && freesize > nr_memory_reserve) {


Will this really work (together with patch 2) as intended?
This j == 0 means that we are doing this only for the first zone, which 
is ZONE_DMA (or ZONE_DMA32) on node 0 on many systems. I.e. I don't 
think it's really true that "dma_reserve has nothing to do with DMA or 
ZONE_DMA".


This zone will have limited amount of memory, so the "freesize > 
nr_memory_reserve" will easily be false once you set this to many 
gigabytes, so in fact nothing will get subtracted.


On the other hand if the kernel has both CONFIG_ZONE_DMA and 
CONFIG_ZONE_DMA32 disabled, then j == 0 will be true for ZONE_NORMAL. 
This zone might be present on multiple nodes (unless they are configured 
as movable) and then the value intended to be global will be subtracted 
from several nodes.


I don't know what's the exact ppc64 situation here, perhaps there are 
indeed no DMA/DMA32 zones, and the fadump kernel only uses one node, so 
it works in the end, but it doesn't seem much robust to me?



+   freesize -= nr_memory_reserve;
printk(KERN_DEBUG "  %s zone: %lu pages reserved\n",
-   zone_names[0], dma_reserve);
+   zone_names[0], nr_memory_reserve);
}

if (!is_highmem_idx(j))
@@ -6186,8 +6186,9 @@ void __init mem_init_print_info(const char *str)
 }

 /**
- * set_dma_reserve - set the specified number of pages reserved in the first 
zone
- * @new_dma_reserve: The number of pages to mark reserved
+ * set_memory_reserve - set number of pages reserved in the first zone
+ * @nr_reserve: The number of pages to mark reserved
+ * @inc: true increment to existing value; false set new value.
  *
  * The per-cpu batchsize and zone watermarks are determined by managed_pages.
  * In the DMA zone, a significant percentage may be consumed by kernel image
@@ -6196,9 +6197,12 @@ void __init mem_init_print_info(const char *str)
  * first zone (e.g., ZONE_DMA). The effect will be lower watermarks and
  * smaller per-cpu batchsize.
  */
-void __init set_dma_reserve(unsigned long new_dma_reserve)
+void __init set_memory_reserve(unsigned long nr_reserve, bool inc)
 {
-   dma_reserve = new_dma_reserve;
+   if (inc)
+   nr_memory_reserve += nr_reserve;
+   else
+   nr_memory_reserve = nr_reserve;
 }

 void __init free_area_init(unsigned long *zones_size)

Re: [PATCH 2/2] fadump: Disable deferred page struct initialisation

2016-08-03 Thread Vlastimil Babka


On 08/03/2016 07:20 AM, Balbir Singh wrote:

On Tue, 2016-08-02 at 18:49 +0530, Srikar Dronamraju wrote:

Fadump kernel reserves significant number of memory blocks. On a multi-node
machine, with CONFIG_DEFFERRED_STRUCT_PAGE support, fadump kernel fails to
boot. Fix this by disabling deferred page struct initialisation.



How much memory does a fadump kernel need? Can we bump up the limits depending
on the config. I presume when you say fadump kernel you mean kernel with
FADUMP in the config?

BTW, I would much rather prefer a config based solution that does not select
DEFERRED_INIT if FADUMP is enabled.


IIRC the kdump/fadump kernel is typically the same vmlinux as the main 
kernel, just with special initrd and boot params. So if you want 
deferred init for the main kernel, this would be impractical.



Balbir



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] mm: meminit: initialise more memory for inode/dentry hash tables in early boot

2016-03-08 Thread Vlastimil Babka

On 03/08/2016 04:55 AM, Li Zhang wrote:
> From: Li Zhang <zhlci...@linux.vnet.ibm.com>
> 
> This patch is based on Mel Gorman's old patch in the mailing list,
> https://lkml.org/lkml/2015/5/5/280 which is discussed but it is
> fixed with a completion to wait for all memory initialised in
> page_alloc_init_late(). It is to fix the OOM problem on X86
> with 24TB memory which allocates memory in late initialisation.
> But for Power platform with 32TB memory, it causes a call trace
> in vfs_caches_init->inode_init() and inode hash table needs more
> memory.
> So this patch allocates 1GB for 0.25TB/node for large system
> as it is mentioned in https://lkml.org/lkml/2015/5/1/627
> 
> This call trace is found on Power with 32TB memory, 1024CPUs, 16nodes.
> Currently, it only allocates 2GB*16=32GB for early initialisation. But
> Dentry cache hash table needes 16GB and Inode cache hash table needs
> 16GB. So the system have no enough memory for it.
> The log from dmesg as the following:
> 
> Dentry cache hash table entries: 2147483648 (order: 18,17179869184 bytes)
> vmalloc: allocation failure, allocated 16021913600 of 17179934720 bytes
> swapper/0: page allocation failure: order:0,mode:0x2080020
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-0-ppc64
> Call Trace:
> [c12bfa00] [c07c4a50].dump_stack+0xb4/0xb664 (unreliable)
> [c12bfa80] [c01f93d4].warn_alloc_failed+0x114/0x160
> [c12bfb30] [c023c204].__vmalloc_area_node+0x1a4/0x2b0
> [c12bfbf0] [c023c3f4].__vmalloc_node_range+0xe4/0x110
> [c12bfc90] [c023c460].__vmalloc_node+0x40/0x50
> [c12bfd10] [c0b67d60].alloc_large_system_hash+0x134/0x2a4
> [c12bfdd0] [c0b70924].inode_init+0xa4/0xf0
> [c12bfe60] [c0b706a0].vfs_caches_init+0x80/0x144
> [c12bfef0] [c0b35208].start_kernel+0x40c/0x4e0
> [c12bff90] [c0008cfc]start_here_common+0x20/0x4a4
> Mem-Info:
> 
> Acked-by: Mel Gorman <mgor...@techsingularity.net>
> Signed-off-by: Li Zhang <zhlci...@linux.vnet.ibm.com>

Acked-by: Vlastimil Babka <vba...@suse.cz>

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 1/2] mm: meminit: initialise more memory for inode/dentry hash tables in early boot

2016-03-04 Thread Vlastimil Babka

On 03/03/2016 08:01 AM, Li Zhang wrote:
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -293,13 +293,20 @@ static inline bool update_defer_init(pg_data_t *pgdat,
>   unsigned long pfn, unsigned long zone_end,
>   unsigned long *nr_initialised)
>  {
> + unsigned long max_initialise;
> +
>   /* Always populate low zones for address-contrained allocations */
>   if (zone_end < pgdat_end_pfn(pgdat))
>   return true;
> + /*
> + * Initialise at least 2G of a node but also take into account that
> + * two large system hashes that can take up 1GB for 0.25TB/node.
> + */

The indentation is wrong here.

> + max_initialise = max(2UL << (30 - PAGE_SHIFT),
> + (pgdat->node_spanned_pages >> 8));
>  
> - /* Initialise at least 2G of the highest zone */
>   (*nr_initialised)++;
> - if (*nr_initialised > (2UL << (30 - PAGE_SHIFT)) &&
> + if ((*nr_initialised > max_initialise) &&
>   (pfn & (PAGES_PER_SECTION - 1)) == 0) {
>   pgdat->first_deferred_pfn = pfn;
>   return false;
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/3] mm: unify checks in alloc_pages_node() and __alloc_pages_node()

2015-08-20 Thread Vlastimil Babka

Perform the same debug checks in alloc_pages_node() as are done in
__alloc_pages_node(), by making the former function a wrapper of the latter
one.

In addition to better diagnostics in DEBUG_VM builds for situations which
have been already fatal (e.g. out-of-bounds node id), there are two visible
changes for potential existing buggy callers of alloc_pages_node():

- calling alloc_pages_node() with any negative nid (e.g. due to arithmetic
  overflow) was treated as passing NUMA_NO_NODE and fallback to local node was
  applied. This will now be fatal.
- calling alloc_pages_node() with an offline node will now be checked for
  DEBUG_VM builds. Since it's not fatal if the node has been previously online,
  and this patch may expose some existing buggy callers, change the VM_BUG_ON
  in __alloc_pages_node() to VM_WARN_ON.

Signed-off-by: Vlastimil Babka vba...@suse.cz
Acked-by: David Rientjes rient...@google.com
Acked-by: Johannes Weiner han...@cmpxchg.org
Acked-by: Christoph Lameter c...@linux.com
---
 include/linux/gfp.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index d2c142b..4a12cae2 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -310,23 +310,23 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 {
-   VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));
+   VM_BUG_ON(nid  0 || nid = MAX_NUMNODES);
+   VM_WARN_ON(!node_online(nid));
 
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
 /*
  * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
- * prefer the current CPU's node.
+ * prefer the current CPU's node. Otherwise node must be valid and online.
  */
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
-   /* Unknown node is current node */
-   if (nid  0)
+   if (nid == NUMA_NO_NODE)
nid = numa_node_id();
 
-   return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+   return __alloc_pages_node(nid, gfp_mask, order);
 }
 
 #ifdef CONFIG_NUMA
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node

2015-08-20 Thread Vlastimil Babka

The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 (page
allocator: do not check NUMA node ID when the caller knows the node is valid)
as an optimized variant of alloc_pages_node(), that doesn't fallback to current
node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily
suggest that the allocation is restricted to the given node and fails
otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is
passed among the gfp flags.

The misleading name has lead to mistakes in the past, see 5265047ac301 (mm,
thp: really limit transparent hugepage allocation to local node) and
b360edb43f8e (mm, mempolicy: migrate_to_node should only migrate to node).

Another issue with the name is that there's a family of alloc_pages_exact*()
functions where 'exact' means exact size (instead of page order), which leads
to more confusion.

To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's
an optimized variant of alloc_pages_node() not intended for general usage.
Both functions get described in comments.

It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't duplicate
the API needlessly. The number of users would be small anyway.

Existing callers of alloc_pages_exact_node() are simply converted to call
__alloc_pages_node(), with the exception of sba_alloc_coherent() which
open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON
checks, and since the current check for nid in alloc_pages_node() uses a 'nid 
0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which
would be previously exposed. Both differences will be rectified by the next
patch.

To sum up, this patch makes no functional changes, except temporarily hiding
potentially buggy callers. Restricting the checks in alloc_pages_node() is
left for the next patch which can in turn expose more existing buggy callers.

Signed-off-by: Vlastimil Babka vba...@suse.cz
Acked-by: Johannes Weiner han...@cmpxchg.org
Cc: Mel Gorman mgor...@suse.de
Cc: David Rientjes rient...@google.com
Cc: Greg Thelen gthe...@google.com
Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Acked-by: Christoph Lameter c...@linux.com
Cc: Pekka Enberg penb...@kernel.org
Cc: Joonsoo Kim iamjoonsoo@lge.com
Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Arnd Bergmann a...@arndb.de
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Acked-by: Michael Ellerman m...@ellerman.id.au
Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Cliff Whickman c...@sgi.com
Acked-by: Robin Holt robinmh...@gmail.com
---
 v3: fixed the slob part (Christoph Lameter), removed forgotten hunk from
 huge_memory.c (DavidR)

 arch/ia64/hp/common/sba_iommu.c   |  6 +-
 arch/ia64/kernel/uncached.c   |  2 +-
 arch/ia64/sn/pci/pci_dma.c|  2 +-
 arch/powerpc/platforms/cell/ras.c |  2 +-
 arch/x86/kvm/vmx.c|  2 +-
 drivers/misc/sgi-xp/xpc_uv.c  |  2 +-
 include/linux/gfp.h   | 23 +++
 kernel/profile.c  |  8 
 mm/filemap.c  |  2 +-
 mm/huge_memory.c  |  2 +-
 mm/hugetlb.c  |  4 ++--
 mm/memory-failure.c   |  2 +-
 mm/mempolicy.c|  4 ++--
 mm/migrate.c  |  4 ++--
 mm/page_alloc.c   |  2 --
 mm/slab.c |  2 +-
 mm/slob.c |  4 ++--
 mm/slub.c |  2 +-
 18 files changed, 38 insertions(+), 37 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 344387a..a6d6190 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1140,13 +1140,9 @@ sba_alloc_coherent(struct device *dev, size_t size, 
dma_addr_t *dma_handle,
 
 #ifdef CONFIG_NUMA
{
-   int node = ioc-node;
struct page *page;
 
-   if (node == NUMA_NO_NODE)
-   node = numa_node_id();
-
-   page = alloc_pages_exact_node(node, flags, get_order(size));
+   page = alloc_pages_node(ioc-node, flags, get_order(size));
if (unlikely(!page))
return NULL;
 
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 20e8a9b..f3976da 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -97,7 +97,7 @@ static int uncached_add_chunk(struct

[PATCH 3/3] mm: use numa_mem_id() in alloc_pages_node()

2015-08-20 Thread Vlastimil Babka

alloc_pages_node() might fail when called with NUMA_NO_NODE and
__GFP_THISNODE on a CPU belonging to a memoryless node. To make the
local-node fallback more robust and prevent such situations, use
numa_mem_id(), which was introduced for similar scenarios in the slab
context.

Suggested-by: Christoph Lameter c...@linux.com
Signed-off-by: Vlastimil Babka vba...@suse.cz
Acked-by: David Rientjes rient...@google.com
Acked-by: Mel Gorman mgor...@techsingularity.net
Acked-by: Christoph Lameter c...@linux.com
---
 v3: better commit message

 include/linux/gfp.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4a12cae2..f92cbd2 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -318,13 +318,14 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int 
order)
 
 /*
  * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
- * prefer the current CPU's node. Otherwise node must be valid and online.
+ * prefer the current CPU's closest node. Otherwise node must be valid and
+ * online.
  */
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
if (nid == NUMA_NO_NODE)
-   nid = numa_node_id();
+   nid = numa_mem_id();
 
return __alloc_pages_node(nid, gfp_mask, order);
 }
-- 
2.5.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V6 4/6] mm: mlock: Add mlock flags to enable VM_LOCKONFAULT usage

2015-08-07 Thread Vlastimil Babka


On 07/29/2015 05:42 PM, Eric B Munson wrote:

The previous patch introduced a flag that specified pages in a VMA
should be placed on the unevictable LRU, but they should not be made
present when the area is created.  This patch adds the ability to set
this state via the new mlock system calls.

We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED.
MCL_ONFAULT should be used as a modifier to the two other mlockall
flags.  When used with MCL_CURRENT, all current mappings will be marked
with VM_LOCKED | VM_LOCKONFAULT.  When used with MCL_FUTURE, the
mm-def_flags will be marked with VM_LOCKED | VM_LOCKONFAULT.  When used
with both MCL_CURRENT and MCL_FUTURE, all current mappings and
mm-def_flags will be marked with VM_LOCKED | VM_LOCKONFAULT.

Prior to this patch, mlockall() will unconditionally clear the
mm-def_flags any time it is called without MCL_FUTURE.  This behavior
is maintained after adding MCL_ONFAULT.  If a call to
mlockall(MCL_FUTURE) is followed by mlockall(MCL_CURRENT), the
mm-def_flags will be cleared and new VMAs will be unlocked.  This
remains true with or without MCL_ONFAULT in either mlockall()
invocation.

munlock() will unconditionally clear both vma flags.  munlockall()
unconditionally clears for VMA flags on all VMAs and in the
mm-def_flags field.

Signed-off-by: Eric B Munson emun...@akamai.com
Cc: Michal Hocko mho...@suse.cz
Cc: Vlastimil Babka vba...@suse.cz


The logic seems ok, although the fact that apply_mlockall_flags() is 
shared by both mlockall and munlockall makes it even more subtle than 
before :)


Anyway, just some nitpicks below.

Acked-by: Vlastimil Babka vba...@suse.cz

[...]


+/*
+ * Take the MCL_* flags passed into mlockall (or 0 if called from munlockall)
+ * and translate into the appropriate modifications to mm-def_flags and/or the
+ * flags for all current VMAs.
+ *
+ * There are a couple of sublties with this.  If mlockall() is called multiple


^ typo


+ * times with different flags, the values do not necessarily stack.  If 
mlockall
+ * is called once including the MCL_FUTURE flag and then a second time without
+ * it, VM_LOCKED and VM_LOCKONFAULT will be cleared from mm-def_flags.
+ */
  static int apply_mlockall_flags(int flags)
  {
struct vm_area_struct * vma, * prev = NULL;
+   vm_flags_t to_add = 0;

-   if (flags  MCL_FUTURE)
+   current-mm-def_flags = ~(VM_LOCKED | VM_LOCKONFAULT);
+   if (flags  MCL_FUTURE) {
current-mm-def_flags |= VM_LOCKED;
-   else
-   current-mm-def_flags = ~VM_LOCKED;

-   if (flags == MCL_FUTURE)
-   goto out;
+   if (flags  MCL_ONFAULT)
+   current-mm-def_flags |= VM_LOCKONFAULT;
+
+   /*
+* When there were only two flags, we used to early out if only
+* MCL_FUTURE was set.  Now that we have MCL_ONFAULT, we can
+* only early out if MCL_FUTURE is set, but MCL_CURRENT is not.


Describing the relation to history of individual code lines in such 
detail is noise imho. The stacking subtleties is already described above.



+* This is done, even though it promotes odd behavior, to
+* maintain behavior from older kernels
+*/
+   if (!(flags  MCL_CURRENT))
+   goto out;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 3/3] mm: use numa_mem_id() in alloc_pages_node()

2015-08-06 Thread Vlastimil Babka


On 07/30/2015 07:41 PM, Johannes Weiner wrote:

On Thu, Jul 30, 2015 at 06:34:31PM +0200, Vlastimil Babka wrote:

numa_mem_id() is able to handle allocation from CPUs on memory-less nodes,
so it's a more robust fallback than the currently used numa_node_id().


Won't it fall through to the next closest memory node in the zonelist
anyway?


Right, I would expect the zonelist of memoryless node to be the same as 
of the closest node. Documentation/vm/numa seems to agree.


Is this for callers doing NUMA_NO_NODE with __GFP_THISZONE?

I guess that's the only scenario where that matters, yeah. And there 
might well be no such caller now, but maybe some will sneak in without 
the author testing on a system with memoryless node.


Note that with !CONFIG_HAVE_MEMORYLESS_NODES, numa_mem_id() just does 
numa_node_id().


So yeah I think a more robust fallback is correct :) But let's put it 
explicitly in changelog then:


8

alloc_pages_node() might fail when called with NUMA_NO_NODE and 
__GFP_THISNODE on a CPU belonging to a memoryless node. To make the 
local-node fallback more robust and prevent such situations, use 
numa_mem_id(), which was introduced for similar scenarios in the slab 
context.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V6 2/6] mm: mlock: Add new mlock system call

2015-08-06 Thread Vlastimil Babka


On 07/29/2015 05:42 PM, Eric B Munson wrote:

With the refactored mlock code, introduce a new system call for mlock.
The new call will allow the user to specify what lock states are being
added.  mlock2 is trivial at the moment, but a follow on patch will add
a new mlock state making it useful.

Signed-off-by: Eric B Munson emun...@akamai.com
Cc: Michal Hocko mho...@suse.cz
Cc: Vlastimil Babka vba...@suse.cz


Acked-by: Vlastimil Babka vba...@suse.cz

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node

2015-07-31 Thread Vlastimil Babka

On 31.7.2015 23:25, David Rientjes wrote:
 On Thu, 30 Jul 2015, Vlastimil Babka wrote:
 
 diff --git a/mm/huge_memory.c b/mm/huge_memory.c
 index aa58a32..56355f2 100644
 --- a/mm/huge_memory.c
 +++ b/mm/huge_memory.c
 @@ -2469,7 +2469,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, 
 struct mm_struct *mm,
   */
  up_read(mm-mmap_sem);
  
 -*hpage = alloc_pages_exact_node(node, gfp, HPAGE_PMD_ORDER);
 +*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
  if (unlikely(!*hpage)) {
  count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
  *hpage = ERR_PTR(-ENOMEM);
 @@ -2568,9 +2568,7 @@ static void collapse_huge_page(struct mm_struct *mm,
  
  VM_BUG_ON(address  ~HPAGE_PMD_MASK);
  
 -/* Only allocate from the target node */
 -gfp = alloc_hugepage_gfpmask(khugepaged_defrag(), __GFP_OTHER_NODE) |
 -__GFP_THISNODE;
 +gfp = alloc_hugepage_gfpmask(khugepaged_defrag(), 0);
  
  /* release the mmap_sem read lock. */
  new_page = khugepaged_alloc_page(hpage, gfp, mm, vma, address, node);
 
 Hmm, where is the __GFP_THISNODE enforcement in khugepaged_alloc_page() 
 that is removed in collapse_huge_page()?  I also don't see what happened 
 to the __GFP_OTHER_NODE.

Crap, I messed up with git, this hunk was supposed to be gone. Thanks for
noticing. Please apply without the collapse_huge_page hunk, or tell me to resend
once more.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node

2015-07-30 Thread Vlastimil Babka

On 07/30/2015 07:58 PM, Christoph Lameter wrote:
 On Thu, 30 Jul 2015, Vlastimil Babka wrote:
 
 --- a/mm/slob.c
 +++ b/mm/slob.c
  void *page;

 -#ifdef CONFIG_NUMA
 -if (node != NUMA_NO_NODE)
 -page = alloc_pages_exact_node(node, gfp, order);
 -else
 -#endif
 -page = alloc_pages(gfp, order);
 +page = alloc_pages_node(node, gfp, order);
 
 NAK. This is changing slob behavior. With no node specified it must use
 alloc_pages because that obeys NUMA memory policies etc etc. It should not
 force allocation from the current node like what is happening here after
 the patch. See the code in slub.c that is similar.
 
Doh, somehow I convinced myself that there's #else and alloc_pages() is only
used for !CONFIG_NUMA so it doesn't matter. Here's a fixed version.

--8--
From: Vlastimil Babka vba...@suse.cz
Date: Fri, 24 Jul 2015 15:49:47 +0200
Subject: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to
 __alloc_pages_node

The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 (page
allocator: do not check NUMA node ID when the caller knows the node is valid)
as an optimized variant of alloc_pages_node(), that doesn't fallback to current
node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily
suggest that the allocation is restricted to the given node and fails
otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is
passed among the gfp flags.

The misleading name has lead to mistakes in the past, see 5265047ac301 (mm,
thp: really limit transparent hugepage allocation to local node) and
b360edb43f8e (mm, mempolicy: migrate_to_node should only migrate to node).

Another issue with the name is that there's a family of alloc_pages_exact*()
functions where 'exact' means exact size (instead of page order), which leads
to more confusion.

To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's
an optimized variant of alloc_pages_node() not intended for general usage.
Both functions get described in comments.

It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't duplicate
the API needlessly. The number of users would be small anyway.

Existing callers of alloc_pages_exact_node() are simply converted to call
__alloc_pages_node(), with the exception of sba_alloc_coherent() which
open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON
checks, and since the current check for nid in alloc_pages_node() uses a 'nid 
0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which
would be previously exposed. Both differences will be rectified by the next
patch.

To sum up, this patch makes no functional changes, except temporarily hiding
potentially buggy callers. Restricting the checks in alloc_pages_node() is
left for the next patch which can in turn expose more existing buggy callers.

Signed-off-by: Vlastimil Babka vba...@suse.cz
Acked-by: Johannes Weiner han...@cmpxchg.org
Cc: Mel Gorman mgor...@suse.de
Cc: David Rientjes rient...@google.com
Cc: Greg Thelen gthe...@google.com
Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Cc: Christoph Lameter c...@linux.com
Cc: Pekka Enberg penb...@kernel.org
Cc: Joonsoo Kim iamjoonsoo@lge.com
Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Arnd Bergmann a...@arndb.de
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Acked-by: Michael Ellerman m...@ellerman.id.au
Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Cliff Whickman c...@sgi.com
Acked-by: Robin Holt robinmh...@gmail.com
---
 arch/ia64/hp/common/sba_iommu.c   |  6 +-
 arch/ia64/kernel/uncached.c   |  2 +-
 arch/ia64/sn/pci/pci_dma.c|  2 +-
 arch/powerpc/platforms/cell/ras.c |  2 +-
 arch/x86/kvm/vmx.c|  2 +-
 drivers/misc/sgi-xp/xpc_uv.c  |  2 +-
 include/linux/gfp.h   | 23 +++
 kernel/profile.c  |  8 
 mm/filemap.c  |  2 +-
 mm/huge_memory.c  |  6 ++
 mm/hugetlb.c  |  4 ++--
 mm/memory-failure.c   |  2 +-
 mm/mempolicy.c|  4 ++--
 mm/migrate.c  |  4 ++--
 mm/page_alloc.c   |  2 --
 mm/slab.c |  2 +-
 mm/slob.c |  4 ++--
 mm/slub.c |  2 +-
 18 files changed, 39 insertions(+), 40 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common

[PATCH v3 3/3] mm: use numa_mem_id() in alloc_pages_node()

2015-07-30 Thread Vlastimil Babka

numa_mem_id() is able to handle allocation from CPUs on memory-less nodes,
so it's a more robust fallback than the currently used numa_node_id().

Suggested-by: Christoph Lameter c...@linux.com
Signed-off-by: Vlastimil Babka vba...@suse.cz
Acked-by: David Rientjes rient...@google.com
Acked-by: Mel Gorman mgor...@techsingularity.net
---
 include/linux/gfp.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4a12cae2..f92cbd2 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -318,13 +318,14 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int 
order)
 
 /*
  * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
- * prefer the current CPU's node. Otherwise node must be valid and online.
+ * prefer the current CPU's closest node. Otherwise node must be valid and
+ * online.
  */
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
if (nid == NUMA_NO_NODE)
-   nid = numa_node_id();
+   nid = numa_mem_id();
 
return __alloc_pages_node(nid, gfp_mask, order);
 }
-- 
2.4.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 2/3] mm: unify checks in alloc_pages_node() and __alloc_pages_node()

2015-07-30 Thread Vlastimil Babka

Perform the same debug checks in alloc_pages_node() as are done in
__alloc_pages_node(), by making the former function a wrapper of the latter
one.

In addition to better diagnostics in DEBUG_VM builds for situations which
have been already fatal (e.g. out-of-bounds node id), there are two visible
changes for potential existing buggy callers of alloc_pages_node():

- calling alloc_pages_node() with any negative nid (e.g. due to arithmetic
  overflow) was treated as passing NUMA_NO_NODE and fallback to local node was
  applied. This will now be fatal.
- calling alloc_pages_node() with an offline node will now be checked for
  DEBUG_VM builds. Since it's not fatal if the node has been previously online,
  and this patch may expose some existing buggy callers, change the VM_BUG_ON
  in __alloc_pages_node() to VM_WARN_ON.

Signed-off-by: Vlastimil Babka vba...@suse.cz
Acked-by: David Rientjes rient...@google.com
---
v3: I think this is enough for what patch 4/4 in v2 tried to provide on top,
so there's no patch 4/4 anymore. The DEBUG_VM-only fixup didn't seem
justified to me.

 include/linux/gfp.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index d2c142b..4a12cae2 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -310,23 +310,23 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 {
-   VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));
+   VM_BUG_ON(nid  0 || nid = MAX_NUMNODES);
+   VM_WARN_ON(!node_online(nid));
 
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
 /*
  * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
- * prefer the current CPU's node.
+ * prefer the current CPU's node. Otherwise node must be valid and online.
  */
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
-   /* Unknown node is current node */
-   if (nid  0)
+   if (nid == NUMA_NO_NODE)
nid = numa_node_id();
 
-   return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+   return __alloc_pages_node(nid, gfp_mask, order);
 }
 
 #ifdef CONFIG_NUMA
-- 
2.4.6

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node

2015-07-30 Thread Vlastimil Babka

The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 (page
allocator: do not check NUMA node ID when the caller knows the node is valid)
as an optimized variant of alloc_pages_node(), that doesn't fallback to current
node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily
suggest that the allocation is restricted to the given node and fails
otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is
passed among the gfp flags.

The misleading name has lead to mistakes in the past, see 5265047ac301 (mm,
thp: really limit transparent hugepage allocation to local node) and
b360edb43f8e (mm, mempolicy: migrate_to_node should only migrate to node).

Another issue with the name is that there's a family of alloc_pages_exact*()
functions where 'exact' means exact size (instead of page order), which leads
to more confusion.

To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's
an optimized variant of alloc_pages_node() not intended for general usage.
Both functions get described in comments.

It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't duplicate
the API needlessly. The number of users would be small anyway.

Existing callers of alloc_pages_exact_node() are simply converted to call
__alloc_pages_node(), with two exceptions. sba_alloc_coherent() and
slob_new_page() both open-code the check for NUMA_NO_NODE, so they are
converted to use alloc_pages_node() instead. This means they no longer perform
some VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid  0' comparison (which includes NUMA_NO_NODE),
it may hide wrong values which would be previously exposed. Both differences
will be rectified by the next patch.

To sum up, this patch makes no functional changes, except temporarily hiding
potentially buggy callers. Restricting the checks in alloc_pages_node() is
left for the next patch which can in turn expose more existing buggy callers.

Signed-off-by: Vlastimil Babka vba...@suse.cz
Cc: Mel Gorman mgor...@suse.de
Cc: David Rientjes rient...@google.com
Cc: Greg Thelen gthe...@google.com
Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Cc: Christoph Lameter c...@linux.com
Cc: Pekka Enberg penb...@kernel.org
Cc: Joonsoo Kim iamjoonsoo@lge.com
Cc: Naoya Horiguchi n-horigu...@ah.jp.nec.com
Cc: Tony Luck tony.l...@intel.com
Cc: Fenghua Yu fenghua...@intel.com
Cc: Arnd Bergmann a...@arndb.de
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Acked-by: Michael Ellerman m...@ellerman.id.au
Cc: Gleb Natapov g...@kernel.org
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: Cliff Whickman c...@sgi.com
Acked-by: Robin Holt robinmh...@gmail.com
---
 Based on feedback from v1 and v2, The name is __alloc_pages_node() instead of
 alloc_pages_prefer_node() from v1. Two callsites were also converted to
 alloc_pages_node() instead. v2 was a RFC for linux-mm to settle on the API
 first. It tried keeping alloc_pages_exact_node() as a wrapper that adds
 __GFP_THISNODE but the consensus was to drop it.

 I'm CC'ing also maintainers of the callsites so they can verify that the
 callsites that don't pass __GFP_THISNODE are really not intended to restrict
 allocation to the given node. I went through them myself and each looked like
 it's better off if it can successfully allocate on a fallback node rather
 than fail. DavidR checked them also I think, but it's better if maintainers
 can verify that. I'm not completely sure about all the usages in sl*b due to
 multiple layers through which gfp flags are being passed.

 Patches 2 and 3 are mm-only so I don't CC everyone.

 arch/ia64/hp/common/sba_iommu.c   |  6 +-
 arch/ia64/kernel/uncached.c   |  2 +-
 arch/ia64/sn/pci/pci_dma.c|  2 +-
 arch/powerpc/platforms/cell/ras.c |  2 +-
 arch/x86/kvm/vmx.c|  2 +-
 drivers/misc/sgi-xp/xpc_uv.c  |  2 +-
 include/linux/gfp.h   | 23 +++
 kernel/profile.c  |  8 
 mm/filemap.c  |  2 +-
 mm/huge_memory.c  |  6 ++
 mm/hugetlb.c  |  4 ++--
 mm/memory-failure.c   |  2 +-
 mm/mempolicy.c|  4 ++--
 mm/migrate.c  |  4 ++--
 mm/page_alloc.c   |  2 --
 mm/slab.c |  2 +-
 mm/slob.c | 14 --
 mm/slub.c |  2 +-
 18 files changed, 41 insertions(+), 48 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 344387a..a6d6190 100644
--- a/arch/ia64/hp/common

Re: [PATCH V5 0/7] Allow user to request memory to be locked on page fault

2015-07-29 Thread Vlastimil Babka

On 07/29/2015 12:45 PM, Michal Hocko wrote:
 In a much less
 likely corner case, it is not possible in the current setup to request
 all current VMAs be VM_LOCKONFAULT and all future be VM_LOCKED.
 
 Vlastimil has already pointed that out. MCL_FUTURE doesn't clear
 MCL_CURRENT. I was quite surprised in the beginning but it makes a
 perfect sense. mlockall call shouldn't lead into munlocking, that would
 be just weird. Clearing MCL_FUTURE on MCL_CURRENT makes sense on the
 other hand because the request is explicit about _current_ memory and it
 doesn't lead to any munlocking.

Yeah after more thinking it does make some sense despite the perceived
inconsistency, but it's definitely worth documenting properly. It also already
covers the usecase for munlockall2(MCL_FUTURE) which IIRC you had in the earlier
revisions...
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 0/7] Allow user to request memory to be locked on page fault

2015-07-28 Thread Vlastimil Babka


On 07/28/2015 01:17 PM, Michal Hocko wrote:

[I am sorry but I didn't get to this sooner.]

On Mon 27-07-15 10:54:09, Eric B Munson wrote:

Now that VM_LOCKONFAULT is a modifier to VM_LOCKED and
cannot be specified independentally, it might make more sense to mirror
that relationship to userspace.  Which would lead to soemthing like the
following:


A modifier makes more sense.


To lock and populate a region:
mlock2(start, len, 0);

To lock on fault a region:
mlock2(start, len, MLOCK_ONFAULT);

If LOCKONFAULT is seen as a modifier to mlock, then having the flags
argument as 0 mean do mlock classic makes more sense to me.

To mlock current on fault only:
mlockall(MCL_CURRENT | MCL_ONFAULT);

To mlock future on fault only:
mlockall(MCL_FUTURE | MCL_ONFAULT);

To lock everything on fault:
mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);


Makes sense to me. The only remaining and still tricky part would be
the munlock{all}(flags) behavior. What should munlock(MLOCK_ONFAULT)
do? Keep locked and poppulate the range or simply ignore the flag an
just unlock?


munlock(all) already lost both MLOCK_LOCKED and MLOCK_ONFAULT flags in 
this revision, so I suppose in the next revision it will also not accept 
MLOCK_ONFAULT, and will just munlock whatever was mlocked in either mode.



I can see some sense to allow munlockall(MCL_FUTURE[|MLOCK_ONFAULT]),
munlockall(MCL_CURRENT) resp. munlockall(MCL_CURRENT|MCL_FUTURE) but
other combinations sound weird to me.


The effect of munlockall(MCL_FUTURE|MLOCK_ONFAULT), which you probably 
intended for converting the onfault to full prepopulation for future 
mappings, can be achieved by calling mlockall(MCL_FUTURE) (without 
MLOCK_ONFAULT).



Anyway munlock with flags opens new doors of trickiness.




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 0/7] Allow user to request memory to be locked on page fault

2015-07-27 Thread Vlastimil Babka


On 07/27/2015 04:54 PM, Eric B Munson wrote:

On Mon, 27 Jul 2015, Vlastimil Babka wrote:



We do actually have an MCL_LOCKED, we just call it MCL_CURRENT.  Would
you prefer that I match the name in mlock2() (add MLOCK_CURRENT
instead)?


Hm it's similar but not exactly the same, because MCL_FUTURE is not
the same as MLOCK_ONFAULT :) So MLOCK_CURRENT would be even more
confusing. Especially if mlockall(MCL_CURRENT | MCL_FUTURE) is OK,
but mlock2(MLOCK_LOCKED | MLOCK_ONFAULT) is invalid.


MLOCK_ONFAULT isn't meant to be the same as MCL_FUTURE, rather it is
meant to be the same as MCL_ONFAULT.  MCL_FUTURE only controls if the
locking policy will be applied to any new mappings made by this process,
not the locking policy itself.  The better comparison is MCL_CURRENT to
MLOCK_LOCK and MCL_ONFAULT to MLOCK_ONFAULT.  MCL_CURRENT and
MLOCK_LOCK do the same thing, only one requires a specific range of
addresses while the other works process wide.  This is why I suggested
changing MLOCK_LOCK to MLOCK_CURRENT.  It is an error to call
mlock2(MLOCK_LOCK | MLOCK_ONFAULT) just like it is an error to call
mlockall(MCL_CURRENT | MCL_ONFAULT).  The combinations do no make sense.


How is it an error to call mlockall(MCL_CURRENT | MCL_ONFAULT)? How else 
would you apply mlock2(MCL_ONFAULT) to all current mappings? Later below 
you use the same example and I don't think it's different by removing 
MLOCK_LOCKED flag.



This was all decided when VM_LOCKONFAULT was a separate state from
VM_LOCKED.  Now that VM_LOCKONFAULT is a modifier to VM_LOCKED and
cannot be specified independentally, it might make more sense to mirror
that relationship to userspace.  Which would lead to soemthing like the
following:

To lock and populate a region:
mlock2(start, len, 0);

To lock on fault a region:
mlock2(start, len, MLOCK_ONFAULT);

If LOCKONFAULT is seen as a modifier to mlock, then having the flags
argument as 0 mean do mlock classic makes more sense to me.


Yup that's what I was trying to suggest.


To mlock current on fault only:
mlockall(MCL_CURRENT | MCL_ONFAULT);

To mlock future on fault only:
mlockall(MCL_FUTURE | MCL_ONFAULT);

To lock everything on fault:
mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT);

I think I have talked myself into rewriting the set again :/


Sorry :) You could also wait a bit for more input than just from me...




Finally, on the question of MAP_LOCKONFAULT, do you just dislike
MAP_LOCKED and do not want to see it extended, or is this a NAK on the
set if that patch is included.  I ask because I have to spin a V6 to get
the MLOCK flag declarations right, but I would prefer not to do a V7+.
If this is a NAK with, I can drop that patch and rework the tests to
cover without the mmap flag.  Otherwise I want to keep it, I have an
internal user that would like to see it added.


I don't want to NAK that patch if you think it's useful.




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 0/7] Allow user to request memory to be locked on page fault

2015-07-27 Thread Vlastimil Babka


On 07/27/2015 03:35 PM, Eric B Munson wrote:

On Mon, 27 Jul 2015, Vlastimil Babka wrote:


On 07/24/2015 11:28 PM, Eric B Munson wrote:

...


Changes from V4:
Drop all architectures for new sys call entries except x86[_64] and MIPS
Drop munlock2 and munlockall2
Make VM_LOCKONFAULT a modifier to VM_LOCKED only to simplify book keeping
Adjust tests to match


Hi, thanks for considering my suggestions. Well, I do hope there
were correct as API's are hard and I'm no API expert. But since
API's are also impossible to change after merging, I'm sorry but
I'll keep pestering for one last thing. Thanks again for persisting,
I do believe it's for the good thing!

The thing is that I still don't like that one has to call
mlock2(MLOCK_LOCKED) to get the equivalent of the old mlock(). Why
is that flag needed? We have two modes of locking now, and v5 no
longer treats them separately in vma flags. But having two flags
gives us four possible combinations, so two of them would serve
nothing but to confuse the programmer IMHO. What will mlock2()
without flags do? What will mlock2(MLOCK_LOCKED | MLOCK_ONFAULT) do?
(Note I haven't studied the code yet, as having agreed on the API
should come first. But I did suggest documenting these things more
thoroughly too...)
OK I checked now and both cases above seem to return EINVAL.

So about the only point I see in MLOCK_LOCKED flag is parity with
MAP_LOCKED for mmap(). But as Kirill said (and me before as well)
MAP_LOCKED is broken anyway so we shouldn't twist the rest just of
the API to keep the poor thing happier in its misery.

Also note that AFAICS you don't have MCL_LOCKED for mlockall() so
there's no full parity anyway. But please don't fix that by adding
MCL_LOCKED :)

Thanks!



I have an MLOCK_LOCKED flag because I prefer an interface to be
explicit.


I think it's already explicit enough that the user calls mlock2(), no? 
He obviously wants the range mlocked. An optional flag says that there 
should be no pre-fault.



The caller of mlock2() will be required to fill in the flags
argument regardless.


I guess users not caring about MLOCK_ONFAULT will continue using plain 
mlock() without flags anyway.


I can drop the MLOCK_LOCKED flag with 0 being the

value for LOCKED, but I thought it easier to make clear what was going
on at any call to mlock2().  If user space defines a MLOCK_LOCKED that
happens to be 0, I suppose that would be okay.


Yeah that would remove the weird 4-states-of-which-2-are-invalid problem 
I mentioned, but at the cost of glibc wrapper behaving differently than 
the kernel syscall itself. For little gain.



We do actually have an MCL_LOCKED, we just call it MCL_CURRENT.  Would
you prefer that I match the name in mlock2() (add MLOCK_CURRENT
instead)?


Hm it's similar but not exactly the same, because MCL_FUTURE is not the 
same as MLOCK_ONFAULT :) So MLOCK_CURRENT would be even more confusing. 
Especially if mlockall(MCL_CURRENT | MCL_FUTURE) is OK, but 
mlock2(MLOCK_LOCKED | MLOCK_ONFAULT) is invalid.



Finally, on the question of MAP_LOCKONFAULT, do you just dislike
MAP_LOCKED and do not want to see it extended, or is this a NAK on the
set if that patch is included.  I ask because I have to spin a V6 to get
the MLOCK flag declarations right, but I would prefer not to do a V7+.
If this is a NAK with, I can drop that patch and rework the tests to
cover without the mmap flag.  Otherwise I want to keep it, I have an
internal user that would like to see it added.


I don't want to NAK that patch if you think it's useful.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 0/7] Allow user to request memory to be locked on page fault

2015-07-27 Thread Vlastimil Babka


On 07/24/2015 11:28 PM, Eric B Munson wrote:

...


Changes from V4:
Drop all architectures for new sys call entries except x86[_64] and MIPS
Drop munlock2 and munlockall2
Make VM_LOCKONFAULT a modifier to VM_LOCKED only to simplify book keeping
Adjust tests to match


Hi, thanks for considering my suggestions. Well, I do hope there were 
correct as API's are hard and I'm no API expert. But since API's are 
also impossible to change after merging, I'm sorry but I'll keep 
pestering for one last thing. Thanks again for persisting, I do believe 
it's for the good thing!


The thing is that I still don't like that one has to call 
mlock2(MLOCK_LOCKED) to get the equivalent of the old mlock(). Why is 
that flag needed? We have two modes of locking now, and v5 no longer 
treats them separately in vma flags. But having two flags gives us four 
possible combinations, so two of them would serve nothing but to confuse 
the programmer IMHO. What will mlock2() without flags do? What will 
mlock2(MLOCK_LOCKED | MLOCK_ONFAULT) do? (Note I haven't studied the 
code yet, as having agreed on the API should come first. But I did 
suggest documenting these things more thoroughly too...)

OK I checked now and both cases above seem to return EINVAL.

So about the only point I see in MLOCK_LOCKED flag is parity with 
MAP_LOCKED for mmap(). But as Kirill said (and me before as well) 
MAP_LOCKED is broken anyway so we shouldn't twist the rest just of the 
API to keep the poor thing happier in its misery.


Also note that AFAICS you don't have MCL_LOCKED for mlockall() so 
there's no full parity anyway. But please don't fix that by adding 
MCL_LOCKED :)


Thanks!
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-24 Thread Vlastimil Babka


On 07/23/2015 10:27 PM, David Rientjes wrote:

On Thu, 23 Jul 2015, Christoph Lameter wrote:


The only possible downside would be existing users of
alloc_pages_node() that are calling it with an offline node.  Since it's a
VM_BUG_ON() that would catch that, I think it should be changed to a
VM_WARN_ON() and eventually fixed up because it's nonsensical.
VM_BUG_ON() here should be avoided.


The offline node thing could be addresses by using numa_mem_id()?



I was concerned about any callers that were passing an offline node, not
NUMA_NO_NODE, today.  One of the alloc-node functions has a VM_BUG_ON()
for it, the other silently calls node_zonelist() on it.

I suppose the final alloc_pages_node() implementation could be

 if (nid == NUMA_NO_NODE || VM_WARN_ON(!node_online(nid)))
 nid = numa_mem_id();

 VM_BUG_ON(nid  0 || nid = MAX_NUMNODES);
 return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));

though.


I've posted v2 based on David's and Christoph's suggestions (thanks) but 
to avoid spamming everyone until we agree on the final interface, it's 
marked as RFC and excludes the arch people from CC:


http://marc.info/?l=linux-kernelm=143774920902608w=2
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

1 2 >

1 - 100 of 111 matches

Mail list logo