[PATCH resend] x86,kvm: Add a kernel parameter to disable PV spinlock

2017-09-04 Thread Oscar Salvador
This is just a resend of Waiman Long's patch.
I could not find why it was not merged to upstream, so I thought
to give it another chance.
What follows is what Waiman Long wrote.

Xen has an kernel command line argument "xen_nopvspin" to disable
paravirtual spinlocks. This patch adds a similar "kvm_nopvspin"
argument to disable paravirtual spinlocks for KVM. This can be useful
for testing as well as allowing administrators to choose unfair lock
for their KVM guests if they want to.

Signed-off-by: Waiman Long 
Signed-off-by: Oscar Salvador 
---
 Documentation/admin-guide/kernel-parameters.txt |  6 +-
 arch/x86/kernel/kvm.c   | 14 +-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index d9c171ce4190..56c6e3acdf8e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1899,6 +1899,10 @@
feature (tagged TLBs) on capable Intel chips.
Default is 1 (enabled)
 
+   kvm_nopvspin[X86,KVM]
+   Disables the paravirtualized spinlock slowpath
+   optimizations for KVM.
+
l2cr=   [PPC]
 
l3cr=   [PPC]
@@ -4533,7 +4537,7 @@
never -- do not unplug even if version check succeeds
 
xen_nopvspin[X86,XEN]
-   Disables the ticketlock slowpath using Xen PV
+   Disables the spinlock slowpath using Xen PV
optimizations.
 
xen_nopv[X86]
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index d04e30e3c0ff..51addf874fc1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -568,6 +568,18 @@ static void kvm_kick_cpu(int cpu)
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
 }
 
+static bool kvm_pvspin = true;
+
+/*
+ * Allow disabling of PV spinlock in kernel command line
+ */
+static __init int kvm_parse_nopvspin(char *arg)
+{
+   kvm_pvspin = false;
+   return 0;
+}
+early_param("kvm_nopvspin", kvm_parse_nopvspin);
+
 #include 
 
 static void kvm_wait(u8 *ptr, u8 val)
@@ -633,7 +645,7 @@ asm(
  */
 void __init kvm_spinlock_init(void)
 {
-   if (!kvm_para_available())
+   if (!kvm_para_available() || !kvm_pvspin)
return;
/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
-- 
2.13.5



Re: [PATCH resend] x86,kvm: Add a kernel parameter to disable PV spinlock

2017-09-05 Thread Oscar Salvador



On 09/05/2017 08:28 AM, Juergen Gross wrote:

On 05/09/17 00:21, Davidlohr Bueso wrote:

On Mon, 04 Sep 2017, Peter Zijlstra wrote:


For testing its trivial to hack your kernel and I don't feel this is
something an Admin can make reasonable decisions about.

So why? In general less knobs is better.

+1.

Also, note how b8fa70b51aa (xen, pvticketlocks: Add xen_nopvspin parameter
to disable xen pv ticketlocks) has no justification as to why its wanted
in the first place. The only thing I could find was from 15a3eac0784
(xen/spinlock: Document the xen_nopvspin parameter):

"Useful for diagnosing issues and comparing benchmarks in over-commit
CPU scenarios."

Hmm, I think I should clarify the Xen knob, as I was the one requesting
it:

In my previous employment we had a configuration where dom0 ran
exclusively on a dedicated set of physical cpus. We experienced
scalability problems when doing I/O performance tests: with a decent
number of dom0 cpus we achieved throughput of 700 MB/s with only 20%
cpu load in dom0. A higher dom0 cpu count let the throughput drop to
about 150 MB/s and cpu load was up to 100%. Reason was the additional
load due to hypervisor interactions on a high frequency lock.

So in special configurations at least for Xen the knob is useful for
production environment.


It may be that the original patch was just to keep consistency between 
Xen and KVM, and also only for testing purposes.
But we find a case when a customer of ours is running some workloads 
with 1<->1 mapping between physical cores and virtual cores, and we 
realized that with the pv spinlocks disabled there is a 4-5% of 
performance gain.


A perf analysis showed that the application was very lock intensive with 
a lot of time spent in __raw_callee_save___pv_queued_spin_unlock.




Re: [RFC PATCH 2/3] hugetlb: convert page_huge_active() to HPageMigratable flag

2021-01-15 Thread Oscar Salvador
On Fri, Jan 15, 2021 at 09:43:36AM -0800, Mike Kravetz wrote:
> > Before the page_huge_active() in scan_movable_pages() we have the
> > if (!PageHuge(page)) check, but could it be that between that check and
> > the page_huge_active(), the page gets dissolved, and so we are checking
> > a wrong page[1]? Am I making sense? 
> 
> Yes, you are making sense.
> 
> The reason I decided to drop the check is because it does not eliminate the
> race.  Even with that check in page_huge_active, the page could be dissolved
> between that check and check of page[1].  There really is no way to eliminate
> the race without holding a reference to the page (or hugetlb_lock).  That
> check in page_huge_active just shortens the race window.

Yeah, you are right, the race already exists.
Anyway, do_migrate_range should take care of making sure what it is
handling, so I think we are good.

-- 
Oscar Salvador
SUSE L3


Re: [RFC PATCH 2/3] hugetlb: convert page_huge_active() to HPageMigratable flag

2021-01-15 Thread Oscar Salvador
On Fri, Jan 15, 2021 at 12:05:29PM -0800, Mike Kravetz wrote:
> I went back and took a closer look.  Migration is the reason the existing
> page_huge_active interfaces were introduced.  And, the only use of the
> page_huge_active check is to determine if a page can be migrated.  So,
> I think 'Migratable' may be the most suitable name.

Ok, I did not know that. Let us stick with 'Migratable' then.

> To address the concern about not all hugetlb sizes are migratable, we can
> just make a check before setting the flag.  This should even help in the
> migration/offline paths as we will know sooner if the page can be
> migrated or not.

This sounds like a good idea to me.

> We can address naming in the 'migrating free hugetlb pages' issue when
> that code is written.

Sure, it was just a suggestion as when I though about that something
like 'InUse' or 'Active' made more sense to me, but your point is valid.

Sorry for the confusion.

About that alloc_contig_range topic, I would like to take a look unless
someone is already on it or about to be.

Thanks Mike for the time ;-)


-- 
Oscar Salvador
SUSE L3


Re: [PATCH V3 0/3] mm/memory_hotplug: Pre-validate the address range with platform

2021-01-19 Thread Oscar Salvador
On Tue, Jan 19, 2021 at 02:33:03PM +0100, David Hildenbrand wrote:
> Minor thing, we should make up our mind if we want to call stuff
> internally "memhp_" or "mhp". I prefer the latter, because it is shorter.

I would rather use the latter as well. I used that in [1].
MEMHP_MERGE_RESOURCE should be renamed if we agree on that.

[1] 
https://patchwork.kernel.org/project/linux-mm/cover/20201217130758.11565-1-osalva...@suse.de/


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 2/5] hugetlb: convert page_huge_active() HPageMigratable flag

2021-01-21 Thread Oscar Salvador
On Tue, Jan 19, 2021 at 05:30:46PM -0800, Mike Kravetz wrote:
> Use the new hugetlb page specific flag HPageMigratable to replace the
> page_huge_active interfaces.  By it's name, page_huge_active implied
> that a huge page was on the active list.  However, that is not really
> what code checking the flag wanted to know.  It really wanted to determine
> if the huge page could be migrated.  This happens when the page is actually
> added the page cache and/or task page table.  This is the reasoning behind
> the name change.
> 
> The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not
> really necessary as we KNOW the page is a hugetlb page.  Therefore, they
> are removed.
> 
> The routine page_huge_active checked for PageHeadHuge before testing the
> active bit.  This is unnecessary in the case where we hold a reference or
> lock and know it is a hugetlb head page.  page_huge_active is also called
> without holding a reference or lock (scan_movable_pages), and can race with
> code freeing the page.  The extra check in page_huge_active shortened the
> race window, but did not prevent the race.  Offline code calling
> scan_movable_pages already deals with these races, so removing the check
> is acceptable.  Add comment to racy code.
> 
> Signed-off-by: Mike Kravetz 

Reviewed-by: Oscar Salvador 

> -/*
> - * Test to determine whether the hugepage is "active/in-use" (i.e. being 
> linked
> - * to hstate->hugepage_activelist.)
> - *
> - * This function can be called for tail pages, but never returns true for 
> them.
> - */
> -bool page_huge_active(struct page *page)
> -{
> - return PageHeadHuge(page) && PagePrivate([1]);

This made me think once again.
I wonder if we could ever see a scenario where page[0] is a rightful page while
page[1] is poisoned/unitialized (poison_pages()).
A lot of things would have to happen between the two checks, so I do not see it
possible and as you mentioned earlier, the race is already there.

Just wanted to speak up my mind. 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH V3 1/3] mm/memory_hotplug: Prevalidate the address range being added with platform

2021-01-21 Thread Oscar Salvador
On Wed, Jan 20, 2021 at 02:03:45PM +0530, Anshuman Khandual wrote:
> Just to be sure, will the following change achieve what you are
> suggesting here. pagemap_range() after this change, will again
> be the same like the V1 series.

With below diff on top it looks good to me:

Reviewed-by: Oscar Salvador 

The only nit I would have is whether the declaration of arch_get_mappable_range
should be in include/linux/memory_hotplug.h.
As you pointed out, arch_get_mappable_range() might be used by the platform
for other purposes, and since you are defining it out of CONFIG_MEMORY_HOTPLUG
anyway.
Would include/linu/memory.h be a better fit?

As I said, nothing to bikeshed about, just my thoughts.

> ---
>  mm/memory_hotplug.c |  3 +--
>  mm/memremap.c   | 12 +---
>  2 files changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 46faa914aa25..10d4ec8f349c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -304,8 +304,7 @@ int __ref __add_pages(int nid, unsigned long pfn, 
> unsigned long nr_pages,
>   if (WARN_ON_ONCE(!params->pgprot.pgprot))
>   return -EINVAL;
>  
> - if(!memhp_range_allowed(PFN_PHYS(pfn), nr_pages * PAGE_SIZE, false))
> - return -E2BIG;
> + VM_BUG_ON(!memhp_range_allowed(PFN_PHYS(pfn), nr_pages * PAGE_SIZE, 
> false));
>  
>   if (altmap) {
>   /*
> diff --git a/mm/memremap.c b/mm/memremap.c
> index e15b13736f6a..26c1825756cc 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -185,6 +185,7 @@ static void dev_pagemap_percpu_release(struct percpu_ref 
> *ref)
>  static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params 
> *params,
>   int range_id, int nid)
>  {
> + const bool is_private = pgmap->type == MEMORY_DEVICE_PRIVATE;
>   struct range *range = >ranges[range_id];
>   struct dev_pagemap *conflict_pgmap;
>   int error, is_ram;
> @@ -230,6 +231,9 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
> struct mhp_params *params,
>   if (error)
>   goto err_pfn_remap;
>  
> + if (!memhp_range_allowed(range->start, range_len(range), !is_private))
> + goto err_pfn_remap;
> +
>   mem_hotplug_begin();
>  
>   /*
> @@ -243,7 +247,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
> struct mhp_params *params,
>* the CPU, we do want the linear mapping and thus use
>* arch_add_memory().
>*/
> - if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> + if (is_private) {
>   error = add_pages(nid, PHYS_PFN(range->start),
>   PHYS_PFN(range_len(range)), params);
>   } else {
> @@ -253,12 +257,6 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
> struct mhp_params *params,
>   goto err_kasan;
>   }
>  
> - if (!memhp_range_allowed(range->start, range_len(range), true)) 
> {
> - error = -ERANGE;
> - mem_hotplug_done();
> - goto err_add_memory;
> - }
> -
>   error = arch_add_memory(nid, range->start, range_len(range),
>   params);
>   }
> -- 
> 2.20.1
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v4 4/5] mm: Fix page reference leak in soft_offline_page()

2021-01-13 Thread Oscar Salvador

On 2021-01-14 07:18, Dan Williams wrote:

To be honest I dislike the entire flags based scheme for communicating
the fact that page reference obtained by madvise needs to be dropped
later. I'd rather pass a non-NULL 'struct page *' than redo
pfn_to_page() conversions in the leaf functions, but that's a much
larger change.


We tried to remove that flag in the past but for different reasons.
I will have another look to see what can be done.

Thanks

--
Oscar Salvador
SUSE L3


Re: [External] Re: [PATCH v12 04/13] mm/hugetlb: Free the vmemmap pages associated with each HugeTLB page

2021-01-14 Thread Oscar Salvador
On Thu, Jan 14, 2021 at 06:54:30PM +0800, Muchun Song wrote:
> I think this approach may be only suitable for generic huge page only.
> So we can implement it only for huge page.
> 
> Hi Oscar,
> 
> What's your opinion about this?

I tried something like:

static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
  unsigned long end,
  struct vmemmap_remap_walk *walk)
{
pte_t *pte;

pte = pte_offset_kernel(pmd, addr);

if (!walk->reuse_page) {
BUG_ON(pte_none(*pte));

walk->reuse_page = pte_page(*pte++);
addr = walk->remap_start;
}

for (; addr != end; addr += PAGE_SIZE, pte++) {
BUG_ON(pte_none(*pte));

walk->remap_pte(pte, addr, walk);
}
}

void vmemmap_remap_free(unsigned long start, unsigned long end,
unsigned long reuse)
{
LIST_HEAD(vmemmap_pages);
struct vmemmap_remap_walk walk = {
.remap_pte  = vmemmap_remap_pte,
.reuse_addr = reuse,
.remap_start = start,
.vmemmap_pages  = _pages,
};

BUG_ON(start != reuse + PAGE_SIZE);

vmemmap_remap_range(reuse, end, );
free_vmemmap_page_list(_pages);
}

but it might overcomplicate things and I am not sure it is any better.
So I am fine with keeping it as is.
Should another user come in the future, we can always revisit.
Maybe just add a little comment in vmemmap_pte_range(), explaining while we
are "+= PAGE_SIZE" for address and I would like to see a comment in 
vmemmap_remap_free why the BUG_ON and more important what it is checking.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 2/5] hugetlb: convert page_huge_active() to HP_Migratable flag

2021-01-16 Thread Oscar Salvador
On Sat, Jan 16, 2021 at 04:24:16AM +, Matthew Wilcox wrote:
> and name these HPG_restore_reserve and HPG_migratable
> 
> and generate the calls to hugetlb_set_page_flag etc from macros, eg:
> 
> #define TESTHPAGEFLAG(uname, lname)   \
> static __always_inline bool HPage##uname(struct page *page)   \
> { return test_bit(HPG_##lname, >private); }
> ...
> #define HPAGEFLAG(uname, lname)   
> \
>   TESTHPAGEFLAG(uname, lname) \
>   SETHPAGEFLAG(uname, lname)  \
>   CLEARHPAGEFLAG(uname, lname)
> 
> HPAGEFLAG(RestoreReserve, restore_reserve)
> HPAGEFLAG(Migratable, migratable)
> 
> just to mirror page-flags.h more closely.

That is on me.
I thought that given the low number of flags, we coud get away with:

hugetlb_{set,test,clear}_page_flag(page, flag)

and call it from the code.
But some of the flags need to be set/tested outside hugetlb code, so
it indeed looks nicer and more consistent to follow page-flags.h convention.

Sorry for the noise.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 2/5] hugetlb: convert page_huge_active() HPageMigratable flag

2021-01-21 Thread Oscar Salvador
On Wed, Jan 20, 2021 at 01:48:05PM -0800, Mike Kravetz wrote:
> >> This comment addresses both this patch and the next one.
> >>
> >> Instead of putting the SetHPageMigratable flag spread over the
> >> allocation paths, would it make more sense to place it in
> >> alloc_huge_page before returning the page?
> >> Then we could opencode SetHPageMigratableIfSupported right there.
> > 
> > and in putback_active_hugepage.
> 
> 
> Hi Oscar,
> 
> In Muchun's series of hugetlb bug fixes, Michal asked the same question.
> 
> https://lore.kernel.org/linux-mm/7e69a55c-d501-6b42-8225-a677f09fb...@oracle.com/
> 
> The 'short answer' is that the this would allow a page to be migrated
> after allocation but before the page fault code adds it to the page
> cache or page tables.  This actually caused bugs in the past.

Oh, I see. I jumped late into that patchset so I missed some early messages.
Thanks for explaining this again.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range

2021-01-18 Thread Oscar Salvador
On Thu, Dec 17, 2020 at 02:07:55PM +0100, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>  a) an existing memory is consumed for that purpose
> (eg: ~2MB per 128MB memory section on x86_64)
>  b) if the whole node is movable then we have off-node struct pages
> which has performance drawbacks.
>  c) It might be there are no PMD_ALIGNED chunks so memmap array gets
> populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This comes in handy as in {online,offline}_pages, all the isolation and
> migration is being done on (buddy_start_pfn, end_pfn] range,
> being buddy_start_pfn = start_pfn + nr_vmemmap_pages.
> 
> In this way, we have:
> 
> (start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> (buddy_start_pfn, end_pfn]   = Initialized and sent to buddy
> 
> Hot-remove:
> 
>  We need to be careful when removing memory, as adding and
>  removing memory needs to be done with the same granularity.
>  To check that this assumption is not violated, we check the
>  memory range we want to remove and if a) any memory block has
>  vmemmap pages and b) the range spans more than a single memory
>  block, we scream out loud and refuse to proceed.
> 
>  If all is good and the range was using memmap on memory (aka vmemmap pages),
>  we construct an altmap structure so free_hugepage_table does the right
>  thing and calls vmem_altmap_free instead of free_pagetable.
> 
> Signed-off-by: Oscar Salvador 

Let us refloat this one before it sinks deeper :-)


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 1/5] hugetlb: use page.private for hugetlb specific page flags

2021-01-20 Thread Oscar Salvador
On Tue, Jan 19, 2021 at 05:30:45PM -0800, Mike Kravetz wrote:
> + * Macros to create test, set and clear function definitions for
> + * hugetlb specific page flags.
> + */
> +#ifdef CONFIG_HUGETLB_PAGE
> +#define TESTHPAGEFLAG(uname, flname) \
> +static inline int HPage##uname(struct page *page)\
> + { BUILD_BUG_ON(sizeof_field(struct page, private) * \
> + BITS_PER_BYTE < __NR_HPAGEFLAGS);   \
> + return test_bit(HPG_##flname, &(page->private)); }
> +
> +#define SETHPAGEFLAG(uname, flname)  \
> +static inline void SetHPage##uname(struct page *page)\
> + { set_bit(HPG_##flname, &(page->private)); }
> +
> +#define CLEARHPAGEFLAG(uname, flname)\
> +static inline void ClearHPage##uname(struct page *page)  \
> + { clear_bit(HPG_##flname, &(page->private)); }
> +#else
> +#define TESTHPAGEFLAG(uname, flname) \
> +static inline int HPage##uname(struct page *page)\
> + { BUILD_BUG_ON(sizeof_field(struct page, private) * \
> + BITS_PER_BYTE < __NR_HPAGEFLAGS);   \
> + return 0 }

You missed a ";" right there.

I might be missing something, but I do not think we need a BUILD_BUG_ON there
when CONFIG_HUGETLB_PAGE is not set?
Actually, would make more sense to move the BUILD_BUG_ON from above to
hugetlb_init?

Other than that, looks good to me, and I think it is a great improvment
towards readability and maintability.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 5/5] hugetlb: convert PageHugeFreed to HPageFreed flag

2021-01-20 Thread Oscar Salvador
On Tue, Jan 19, 2021 at 05:30:49PM -0800, Mike Kravetz wrote:
> Use new hugetlb specific HPageFreed flag to replace the
> PageHugeFreed interfaces.
> 
> Signed-off-by: Mike Kravetz 

Reviewed-by: Oscar Salvador 

> ---
>  include/linux/hugetlb.h |  3 +++
>  mm/hugetlb.c| 23 ---
>  2 files changed, 7 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ec329b9cc0fc..8fd0970cefdb 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -487,11 +487,13 @@ unsigned long hugetlb_get_unmapped_area(struct file 
> *file, unsigned long addr,
>   *   allocator.  Typically used for migration target pages when no pages
>   *   are available in the pool.  The hugetlb free page path will
>   *   immediately free pages with this flag set to the buddy allocator.
> + * HPG_freed - Set when page is on the free lists.
>   */
>  enum hugetlb_page_flags {
>   HPG_restore_reserve = 0,
>   HPG_migratable,
>   HPG_temporary,
> + HPG_freed,
>   __NR_HPAGEFLAGS,
>  };
>  
> @@ -540,6 +542,7 @@ static inline void ClearHPage##uname(struct page *page)   
> \
>  HPAGEFLAG(RestoreReserve, restore_reserve)
>  HPAGEFLAG(Migratable, migratable)
>  HPAGEFLAG(Temporary, temporary)
> +HPAGEFLAG(Freed, freed)
>  
>  #ifdef CONFIG_HUGETLB_PAGE
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0d2bfc2b6adc..d5a78aedbfda 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -79,21 +79,6 @@ DEFINE_SPINLOCK(hugetlb_lock);
>  static int num_fault_mutexes;
>  struct mutex *hugetlb_fault_mutex_table cacheline_aligned_in_smp;
>  
> -static inline bool PageHugeFreed(struct page *head)
> -{
> - return page_private(head + 4) == -1UL;
> -}
> -
> -static inline void SetPageHugeFreed(struct page *head)
> -{
> - set_page_private(head + 4, -1UL);
> -}
> -
> -static inline void ClearPageHugeFreed(struct page *head)
> -{
> - set_page_private(head + 4, 0);
> -}
> -
>  /* Forward declaration */
>  static int hugetlb_acct_memory(struct hstate *h, long delta);
>  
> @@ -1043,7 +1028,7 @@ static void enqueue_huge_page(struct hstate *h, struct 
> page *page)
>   list_move(>lru, >hugepage_freelists[nid]);
>   h->free_huge_pages++;
>   h->free_huge_pages_node[nid]++;
> - SetPageHugeFreed(page);
> + SetHPageFreed(page);
>  }
>  
>  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> @@ -1060,7 +1045,7 @@ static struct page *dequeue_huge_page_node_exact(struct 
> hstate *h, int nid)
>  
>   list_move(>lru, >hugepage_activelist);
>   set_page_refcounted(page);
> - ClearPageHugeFreed(page);
> + ClearHPageFreed(page);
>   h->free_huge_pages--;
>   h->free_huge_pages_node[nid]--;
>   return page;
> @@ -1474,7 +1459,7 @@ static void prep_new_huge_page(struct hstate *h, struct 
> page *page, int nid)
>   spin_lock(_lock);
>   h->nr_huge_pages++;
>   h->nr_huge_pages_node[nid]++;
> - ClearPageHugeFreed(page);
> + ClearHPageFreed(page);
>   spin_unlock(_lock);
>  }
>  
> @@ -1747,7 +1732,7 @@ int dissolve_free_huge_page(struct page *page)
>* We should make sure that the page is already on the free list
>* when it is dissolved.
>*/
> - if (unlikely(!PageHugeFreed(head))) {
> + if (unlikely(!HPageFreed(head))) {
>   rc = -EAGAIN;
>   goto out;
>   }
> -- 
> 2.29.2
> 
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 4/5] hugetlb: convert PageHugeTemporary() to HPageTemporary flag

2021-01-20 Thread Oscar Salvador
On Tue, Jan 19, 2021 at 05:30:48PM -0800, Mike Kravetz wrote:
> Use new hugetlb specific HPageTemporary flag to replace the
> PageHugeTemporary() interfaces.
> 
> Signed-off-by: Mike Kravetz 

I would have added a brief comment explaining why it is ok to drop
the PageHuge() check in PageHugeTemporary.
AFAICS, the paths checking it already know they are handling with a 
hugetlb page, but still it is better to mention it in the changelog
in case someone wonders.

Other than that looks good to me:

Reviewed-by: Oscar Salvador 

> ---
>  include/linux/hugetlb.h |  6 ++
>  mm/hugetlb.c| 36 +++-
>  2 files changed, 13 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1e17529c8b81..ec329b9cc0fc 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -483,10 +483,15 @@ unsigned long hugetlb_get_unmapped_area(struct file 
> *file, unsigned long addr,
>   * HPG_migratable  - Set after a newly allocated page is added to the page
>   *   cache and/or page tables.  Indicates the page is a candidate for
>   *   migration.
> + * HPG_temporary - - Set on a page that is temporarily allocated from the 
> buddy
> + *   allocator.  Typically used for migration target pages when no pages
> + *   are available in the pool.  The hugetlb free page path will
> + *   immediately free pages with this flag set to the buddy allocator.
>   */
>  enum hugetlb_page_flags {
>   HPG_restore_reserve = 0,
>   HPG_migratable,
> + HPG_temporary,
>   __NR_HPAGEFLAGS,
>  };
>  
> @@ -534,6 +539,7 @@ static inline void ClearHPage##uname(struct page *page)   
> \
>   */
>  HPAGEFLAG(RestoreReserve, restore_reserve)
>  HPAGEFLAG(Migratable, migratable)
> +HPAGEFLAG(Temporary, temporary)
>  
>  #ifdef CONFIG_HUGETLB_PAGE
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6e32751489e8..0d2bfc2b6adc 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1353,28 +1353,6 @@ struct hstate *size_to_hstate(unsigned long size)
>   return NULL;
>  }
>  
> -/*
> - * Internal hugetlb specific page flag. Do not use outside of the hugetlb
> - * code
> - */
> -static inline bool PageHugeTemporary(struct page *page)
> -{
> - if (!PageHuge(page))
> - return false;
> -
> - return (unsigned long)page[2].mapping == -1U;
> -}
> -
> -static inline void SetPageHugeTemporary(struct page *page)
> -{
> - page[2].mapping = (void *)-1U;
> -}
> -
> -static inline void ClearPageHugeTemporary(struct page *page)
> -{
> - page[2].mapping = NULL;
> -}
> -
>  static void __free_huge_page(struct page *page)
>  {
>   /*
> @@ -1422,9 +1400,9 @@ static void __free_huge_page(struct page *page)
>   if (restore_reserve)
>   h->resv_huge_pages++;
>  
> - if (PageHugeTemporary(page)) {
> + if (HPageTemporary(page)) {
>   list_del(>lru);
> - ClearPageHugeTemporary(page);
> + ClearHPageTemporary(page);
>   update_and_free_page(h, page);
>   } else if (h->surplus_huge_pages_node[nid]) {
>   /* remove the page from active list */
> @@ -1863,7 +1841,7 @@ static struct page *alloc_surplus_huge_page(struct 
> hstate *h, gfp_t gfp_mask,
>* codeflow
>*/
>   if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> - SetPageHugeTemporary(page);
> + SetHPageTemporary(page);
>   spin_unlock(_lock);
>   put_page(page);
>   return NULL;
> @@ -1894,7 +1872,7 @@ static struct page *alloc_migrate_huge_page(struct 
> hstate *h, gfp_t gfp_mask,
>* We do not account these pages as surplus because they are only
>* temporary and will be released properly on the last reference
>*/
> - SetPageHugeTemporary(page);
> + SetHPageTemporary(page);
>  
>   return page;
>  }
> @@ -5607,12 +5585,12 @@ void move_hugetlb_state(struct page *oldpage, struct 
> page *newpage, int reason)
>* here as well otherwise the global surplus count will not match
>* the per-node's.
>*/
> - if (PageHugeTemporary(newpage)) {
> + if (HPageTemporary(newpage)) {
>   int old_nid = page_to_nid(oldpage);
>   int new_nid = page_to_nid(newpage);
>  
> - SetPageHugeTemporary(oldpage);
> - ClearPageHugeTemporary(newpage);
> + SetHPageTemporary(oldpage);
> + ClearHPageTemporary(newpage);
>  
>   spin_lock(_lock);
>   if (h->surplus_huge_pages_node[old_nid]) {
> -- 
> 2.29.2
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 2/5] hugetlb: convert page_huge_active() HPageMigratable flag

2021-01-20 Thread Oscar Salvador
On Wed, Jan 20, 2021 at 10:59:05AM +0100, Oscar Salvador wrote:
> On Tue, Jan 19, 2021 at 05:30:46PM -0800, Mike Kravetz wrote:
> > Use the new hugetlb page specific flag HPageMigratable to replace the
> > page_huge_active interfaces.  By it's name, page_huge_active implied
> > that a huge page was on the active list.  However, that is not really
> > what code checking the flag wanted to know.  It really wanted to determine
> > if the huge page could be migrated.  This happens when the page is actually
> > added the page cache and/or task page table.  This is the reasoning behind
> > the name change.
> > 
> > The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not
> > really necessary as we KNOW the page is a hugetlb page.  Therefore, they
> > are removed.
> > 
> > The routine page_huge_active checked for PageHeadHuge before testing the
> > active bit.  This is unnecessary in the case where we hold a reference or
> > lock and know it is a hugetlb head page.  page_huge_active is also called
> > without holding a reference or lock (scan_movable_pages), and can race with
> > code freeing the page.  The extra check in page_huge_active shortened the
> > race window, but did not prevent the race.  Offline code calling
> > scan_movable_pages already deals with these races, so removing the check
> > is acceptable.  Add comment to racy code.
> > 
> > Signed-off-by: Mike Kravetz 
> 
> Hi Mike,
> 
> This comment addresses both this patch and the next one.
> 
> Instead of putting the SetHPageMigratable flag spread over the
> allocation paths, would it make more sense to place it in
> alloc_huge_page before returning the page?
> Then we could opencode SetHPageMigratableIfSupported right there.

and in putback_active_hugepage.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 2/5] hugetlb: convert page_huge_active() HPageMigratable flag

2021-01-20 Thread Oscar Salvador
On Tue, Jan 19, 2021 at 05:30:46PM -0800, Mike Kravetz wrote:
> Use the new hugetlb page specific flag HPageMigratable to replace the
> page_huge_active interfaces.  By it's name, page_huge_active implied
> that a huge page was on the active list.  However, that is not really
> what code checking the flag wanted to know.  It really wanted to determine
> if the huge page could be migrated.  This happens when the page is actually
> added the page cache and/or task page table.  This is the reasoning behind
> the name change.
> 
> The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not
> really necessary as we KNOW the page is a hugetlb page.  Therefore, they
> are removed.
> 
> The routine page_huge_active checked for PageHeadHuge before testing the
> active bit.  This is unnecessary in the case where we hold a reference or
> lock and know it is a hugetlb head page.  page_huge_active is also called
> without holding a reference or lock (scan_movable_pages), and can race with
> code freeing the page.  The extra check in page_huge_active shortened the
> race window, but did not prevent the race.  Offline code calling
> scan_movable_pages already deals with these races, so removing the check
> is acceptable.  Add comment to racy code.
> 
> Signed-off-by: Mike Kravetz 

Hi Mike,

This comment addresses both this patch and the next one.

Instead of putting the SetHPageMigratable flag spread over the
allocation paths, would it make more sense to place it in
alloc_huge_page before returning the page?
Then we could opencode SetHPageMigratableIfSupported right there.

I might be missing something and this might not be possible, but if it
is, it would look cleaner and more logical to me.


-- 
Oscar Salvador
SUSE L3


Re: [PATCH V3 1/3] mm/memory_hotplug: Prevalidate the address range being added with platform

2021-01-20 Thread Oscar Salvador
On Wed, Jan 20, 2021 at 11:41:53AM +0100, David Hildenbrand wrote:
> On 20.01.21 09:33, Anshuman Khandual wrote:
> > 
> > 
> > On 1/19/21 5:51 PM, David Hildenbrand wrote:
> >> On 18.01.21 14:12, Anshuman Khandual wrote:
> >>> This introduces memhp_range_allowed() which can be called in various 
> >>> memory
> >>> hotplug paths to prevalidate the address range which is being added, with
> >>> the platform. Then memhp_range_allowed() calls memhp_get_pluggable_range()
> >>> which provides applicable address range depending on whether linear 
> >>> mapping
> >>> is required or not. For ranges that require linear mapping, it calls a new
> >>> arch callback arch_get_mappable_range() which the platform can override. 
> >>> So
> >>> the new callback, in turn provides the platform an opportunity to 
> >>> configure
> >>> acceptable memory hotplug address ranges in case there are constraints.
> >>>
> >>> This mechanism will help prevent platform specific errors deep down during
> >>> hotplug calls. This drops now redundant check_hotplug_memory_addressable()
> >>> check in __add_pages() but instead adds a VM_BUG_ON() check which would
> >>
> >> In this patch, you keep the __add_pages() checks. But as discussed, we
> >> could perform it in mm/memremap.c:pagemap_range() insted and convert it
> >> to a VM_BUG_ON().
> > 
> > Just to be sure, will the following change achieve what you are
> > suggesting here. pagemap_range() after this change, will again
> > be the same like the V1 series.
> 
> Yeah, as we used to have in v1. Maybe other reviewers (@Oscar?) have a
> different opinion.

No, I think that placing the check in pagemap_range() out of the if-else
makes much more sense.
Actually, unless my memory  fails me that is what I suggested in v2.

I plan to have a look at the series later this week as I am fairly busy
atm.

Thanks


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v13 00/12] Free some vmemmap pages of HugeTLB page

2021-01-20 Thread Oscar Salvador
6/mm/init_64.c   |  13 +-
> >  fs/Kconfig  |  18 ++
> >  include/linux/bootmem_info.h|  65 ++
> >  include/linux/hugetlb.h |  37 
> >  include/linux/hugetlb_cgroup.h  |  15 +-
> >  include/linux/memory_hotplug.h  |  27 ---
> >  include/linux/mm.h  |   5 +
> >  mm/Makefile |   2 +
> >  mm/bootmem_info.c   | 124 +++
> >  mm/hugetlb.c| 218 +--
> >  mm/hugetlb_vmemmap.c| 278 
> > 
> >  mm/hugetlb_vmemmap.h|  45 
> >  mm/memory_hotplug.c | 116 --
> >  mm/sparse-vmemmap.c | 273 
> > +++
> >  mm/sparse.c |   1 +
> >  17 files changed, 1082 insertions(+), 172 deletions(-)
> >  create mode 100644 include/linux/bootmem_info.h
> >  create mode 100644 mm/bootmem_info.c
> >  create mode 100644 mm/hugetlb_vmemmap.c
> >  create mode 100644 mm/hugetlb_vmemmap.h
> >
> > --
> > 2.11.0
> >
> 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 1/2] x86/setup: don't remove E820_TYPE_RAM for pfn 0

2021-01-13 Thread Oscar Salvador
On Mon, Jan 11, 2021 at 09:40:16PM +0200, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The first 4Kb of memory is a BIOS owned area and to avoid its allocation
> for the kernel it was not listed in e820 tables as memory. As the result,
> pfn 0 was never recognised by the generic memory management and it is not a
> part of neither node 0 nor ZONE_DMA.

So, since it never was added to memblock.memory structs, it was not
initialized by init_unavailable_mem, right?

-- 
Oscar Salvador
SUSE L3


Re: [RFC PATCH 2/3] hugetlb: convert page_huge_active() to HPageMigratable flag

2021-01-15 Thread Oscar Salvador
On Mon, Jan 11, 2021 at 01:01:51PM -0800, Mike Kravetz wrote:
> Use the new hugetlb page specific flag to replace the page_huge_active
> interfaces.  By it's name, page_huge_active implied that a huge page
> was on the active list.  However, that is not really what code checking
> the flag wanted to know.  It really wanted to determine if the huge
> page could be migrated.  This happens when the page is actually added
> the page cache and/or task page table.  This is the reasoning behind the
> name change.
> 
> The VM_BUG_ON_PAGE() calls in the interfaces were not really necessary
> as in all case but one we KNOW the page is a hugetlb page.  Therefore,
> they are removed.  In one call to HPageMigratable() is it possible for
> the page to not be a hugetlb page due to a race.  However, the code
> making the call (scan_movable_pages) is inherently racy, and page state
> will be validated later in the migration process.
> 
> Note:  Since HPageMigratable is used outside hugetlb.c, it can not be
> static.  Therefore, a new set of hugetlb page flag macros is added for
> non-static flag functions.

Two things about this one:

I am not sure about the name of this one.
It is true that page_huge_active() was only called by memory-hotplug and all
it wanted to know was whether the page was in-use and so if it made sense
to migrate it, so I see some value in the new PageMigratable flag.

However, not all in-use hugetlb can be migrated, e.g: we might have constraints
when it comes to migrate certain sizes of hugetlb, right?
So setting HPageMigratable to all active hugetlb pages might be a bit 
misleading?
HPageActive maybe? (Sorry, don't have a replacement)

The other thing is that you are right that scan_movable_pages is racy, but
page_huge_active() was checking if the page had the Head flag set before
retrieving page[1].

Before the page_huge_active() in scan_movable_pages() we have the
if (!PageHuge(page)) check, but could it be that between that check and
the page_huge_active(), the page gets dissolved, and so we are checking
a wrong page[1]? Am I making sense? 


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2] mm/hugetlb: avoid unnecessary hugetlb_acct_memory() call

2021-01-15 Thread Oscar Salvador
On Fri, Jan 15, 2021 at 04:20:13AM -0500, Miaohe Lin wrote:
> When gbl_reserve is 0, hugetlb_acct_memory() will do nothing except holding
> and releasing hugetlb_lock. We should avoid this unnecessary hugetlb_lock
> lock/unlock cycle which is happening on 'most' hugetlb munmap operations by
> check delta against 0 at the beginning of hugetlb_acct_memory.
> 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Miaohe Lin 

I would avoid mentioning gbl_reserve as not all callers use it, and focus
on what delta means:

"When reservation accounting remains unchanged..", but anyway:

Reviewed-by: Oscar Salvador 


-- 
Oscar Salvador
SUSE L3


Re: [RFC PATCH 3/3] hugetlb: convert PageHugeTemporary() to HPageTempSurplus

2021-01-15 Thread Oscar Salvador
On Mon, Jan 11, 2021 at 01:01:52PM -0800, Mike Kravetz wrote:
> Use new hugetlb specific flag HPageTempSurplus to replace the
> PageHugeTemporary() interfaces.
> 
> Signed-off-by: Mike Kravetz 
> ---
>  mm/hugetlb.c | 38 +-
>  1 file changed, 9 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 34ce82f4823c..949e1f987319 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -58,6 +58,7 @@ static unsigned long hugetlb_cma_size __initdata;
>  enum htlb_page_flags {
>   HPAGE_RestoreReserve = 0,
>   HPAGE_Migratable,
> + HPAGE_TempSurplus,
>  };
>  
>  /*
> @@ -99,6 +100,7 @@ void ClearHPage##flname(struct page *page) \
>  
>  HPAGEFLAG(RestoreReserve)
>  EXT_HPAGEFLAG(Migratable)
> +HPAGEFLAG(TempSurplus)

Would HPAGE_Temporary/Temporary not be a better fit?
The point about current PageHugeTemporary is that these pages are temporary as
they do not belong to the pool and will vanish once the last reference gets 
drop.
We do not really care that much about the surplus part?

Besides, alloc_migrate_huge_page seems to not want to thread these pages as 
surplus.

Also, I would add a comment either next to each flag or above
the enum htlb_page_flags (probably the latter) with a brief explanation
of each flag.

Besides that, it looks fine to me.
Here I do not see the same problem in
stripping the PageHuge check in PageHugeTemporary, as I did in previous patch,
because all callers of it make sure they operate on a hugetlb page.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 3/5] mm: hugetlb: fix a race between freeing and dissolving the page

2021-01-15 Thread Oscar Salvador
On Fri, Jan 15, 2021 at 08:49:40PM +0800, Muchun Song wrote:
> There is a race condition between __free_huge_page()
> and dissolve_free_huge_page().
> 
> CPU0: CPU1:
> 
> // page_count(page) == 1
> put_page(page)
>   __free_huge_page(page)
>   dissolve_free_huge_page(page)
> spin_lock(_lock)
> // PageHuge(page) && !page_count(page)
> update_and_free_page(page)
> // page is freed to the buddy
> spin_unlock(_lock)
> spin_lock(_lock)
> clear_page_huge_active(page)
> enqueue_huge_page(page)
> // It is wrong, the page is already freed
> spin_unlock(_lock)
> 
> The race windows is between put_page() and dissolve_free_huge_page().
> 
> We should make sure that the page is already on the free list
> when it is dissolved.
> 
> As a result __free_huge_page would corrupt page(s) already in the buddy
> allocator.
> 
> Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle 
> hugepage")
> Signed-off-by: Muchun Song 
> Reviewed-by: Mike Kravetz 
> Cc: sta...@vger.kernel.org

LGTM,

Reviewed-by: Oscar Salvador 


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 4/5] mm: hugetlb: fix a race between isolating and freeing page

2021-01-15 Thread Oscar Salvador
On Fri, Jan 15, 2021 at 08:49:41PM +0800, Muchun Song wrote:
> There is a race between isolate_huge_page() and __free_huge_page().
> 
> CPU0:   CPU1:
> 
> if (PageHuge(page))
> put_page(page)
>   __free_huge_page(page)
>   spin_lock(_lock)
>   update_and_free_page(page)
> 
> set_compound_page_dtor(page,
>   NULL_COMPOUND_DTOR)
>   spin_unlock(_lock)
>   isolate_huge_page(page)
> // trigger BUG_ON
> VM_BUG_ON_PAGE(!PageHead(page), page)
> spin_lock(_lock)
> page_huge_active(page)
>   // trigger BUG_ON
>   VM_BUG_ON_PAGE(!PageHuge(page), page)
> spin_unlock(_lock)
> 
> When we isolate a HugeTLB page on CPU0. Meanwhile, we free it to the
> buddy allocator on CPU1. Then, we can trigger a BUG_ON on CPU0. Because
> it is already freed to the buddy allocator.
> 
> Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle 
> hugepage")
> Signed-off-by: Muchun Song 
> Reviewed-by: Mike Kravetz 
> Acked-by: Michal Hocko 
> Cc: sta...@vger.kernel.org

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 5/5] mm: hugetlb: remove VM_BUG_ON_PAGE from page_huge_active

2021-01-15 Thread Oscar Salvador
On Fri, Jan 15, 2021 at 08:49:42PM +0800, Muchun Song wrote:
> The page_huge_active() can be called from scan_movable_pages() which
> do not hold a reference count to the HugeTLB page. So when we call
> page_huge_active() from scan_movable_pages(), the HugeTLB page can
> be freed parallel. Then we will trigger a BUG_ON which is in the
> page_huge_active() when CONFIG_DEBUG_VM is enabled. Just remove the
> VM_BUG_ON_PAGE.
> 
> Fixes: 7e1f049efb86 ("mm: hugetlb: cleanup using paeg_huge_active()")
> Signed-off-by: Muchun Song 
> Reviewed-by: Mike Kravetz 
> Acked-by: Michal Hocko 
> Cc: sta...@vger.kernel.org

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v13 03/12] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page

2021-01-23 Thread Oscar Salvador
+ vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
> +

I would like to see a comment there explaining why those variables get
they value they do.

> +/**
> + * vmemmap_remap_walk - walk vmemmap page table
> + *
> + * @remap_pte:   called for each non-empty PTE (lowest-level) 
> entry.
> + * @reuse_page:  the page which is reused for the tail vmemmap 
> pages.
> + * @reuse_addr:  the virtual address of the @reuse_page page.
> + * @vmemmap_pages:   the list head of the vmemmap pages that can be freed.

Let us align the tabs there.

> +static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> +   unsigned long end,
> +   struct vmemmap_remap_walk *walk)
> +{
> + pte_t *pte;
> +
> + pte = pte_offset_kernel(pmd, addr);
> +
> + /*
> +  * The reuse_page is found 'first' in table walk before we start
> +  * remapping (which is calling @walk->remap_pte).
> +  */
> + if (walk->reuse_addr == addr) {
> + BUG_ON(pte_none(*pte));

If it is found first, would not be

if (!walk->reuse_page) {
BUG_ON(walk->reuse_addr != addr)
...
}

more intuitive?


> +static void vmemmap_remap_range(unsigned long start, unsigned long end,
> + struct vmemmap_remap_walk *walk)
> +{
> + unsigned long addr = start;
> + unsigned long next;
> + pgd_t *pgd;
> +
> + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
> + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
> +
> + pgd = pgd_offset_k(addr);
> + do {
> + BUG_ON(pgd_none(*pgd));
> +
> + next = pgd_addr_end(addr, end);
> + vmemmap_p4d_range(pgd, addr, next, walk);
> + } while (pgd++, addr = next, addr != end);
> +
> + /*
> +  * We do not change the mapping of the vmemmap virtual address range
> +  * [@start, @start + PAGE_SIZE) which is belong to the reuse range.
"which belongs to"

> +  * So we not need to flush the TLB.
> +  */
> + flush_tlb_kernel_range(start - PAGE_SIZE, end);

you already commented on on this one.

> +/**
> + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, 
> @end)
> + *   to the page which @reuse is mapped, then free vmemmap
> + *   pages.
> + * @start:   start address of the vmemmap virtual address range.

Well, it is the start address of the range we want to remap.
Reading it made me think that it is really the __start__ address
of the vmemmap range.

> +void vmemmap_remap_free(unsigned long start, unsigned long end,
> + unsigned long reuse)
> +{
> + LIST_HEAD(vmemmap_pages);
> + struct vmemmap_remap_walk walk = {
> + .remap_pte  = vmemmap_remap_pte,
> + .reuse_addr = reuse,
> + .vmemmap_pages  = _pages,
> + };
> +
> + /*
> +  * In order to make remapping routine most efficient for the huge pages,
> +  * the routine of vmemmap page table walking has the following rules
> +  * (see more details from the vmemmap_pte_range()):
> +  *
> +  * - The @reuse address is part of the range that we are walking.
> +  * - The @reuse address is the first in the complete range.
> +  *
> +  * So we need to make sure that @start and @reuse meet the above rules.

You say that "reuse" and "start" need to meet some  rules, but in the
paragraph above you only seem to point "reuse" rules?


-- 
Oscar Salvador
SUSE L3


Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range

2021-01-25 Thread Oscar Salvador
On Mon, Jan 25, 2021 at 11:39:55AM +0100, Oscar Salvador wrote:
> > Interresting, so we automatically support differeing sizeof(struct
> > page). I guess it will be problematic in case of sizeof(struct page) !=
> > 64, because then, we might not have multiples of 2MB for the memmap of a
> > memory block.
> > 
> > IIRC, it could happen that if we add two consecutive memory blocks, that
> > the second one might reuse parts of the vmemmap residing on the first
> > memory block. If you remove the first one, you might be in trouble.
> > 
> > E.g., on x86-64
> >  
> > vmemmap_populate()->vmemmap_populate_hugepages()->vmemmap_alloc_block_buf():
> > - Populate a huge page
> > 
> > vmemmap_free()->remove_pagetable()...->remove_pmd_table():
> > - memchr_inv() will leave the hugepage populated.
> > 
> > Do we want to fence that off, maybe in mhp_supports_memmap_on_memory()?
> > Or do we somehow want to fix that? We should never populate partial huge
> > pages from an altmap ...
> > 
> > But maybe I am missing something.
> 
> No, you are not missing anything.
> 
> I think that remove_pmd_table() should be fixed.
> vmemmap_populate_hugepages() allocates PMD_SIZE chunk and marks the PMD as
> PAGE_KERNEL_LARGE, so when remove_pmd_table() sees that 1) the PMD
> is large and 2) the chunk is not aligned, the memset() should be writing
> PAGE_INUSE for a PMD chunk.
> 
> I do not really a see a reason why this should not happen.
> Actually, we will be leaving pagetables behind as we never get to free
> the PMD chunk when sizeof(struct page) is not a multiple of 2MB.
> 
> I plan to fix remove_pmd_table(), but for now let us not allow to use this 
> feature
> if the size of a struct page is not 64.
> Let us keep it simply for the time being, shall we?

My first intention was:

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b5a3fa4033d3..0c9756a2eb24 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1044,32 +1044,14 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, 
unsigned long end,
continue;
 
if (pmd_large(*pmd)) {
-   if (IS_ALIGNED(addr, PMD_SIZE) &&
-   IS_ALIGNED(next, PMD_SIZE)) {
-   if (!direct)
-   free_hugepage_table(pmd_page(*pmd),
-   altmap);
-
-   spin_lock(_mm.page_table_lock);
-   pmd_clear(pmd);
-   spin_unlock(_mm.page_table_lock);
-   pages++;
-   } else {
-   /* If here, we are freeing vmemmap pages. */
-   memset((void *)addr, PAGE_INUSE, next - addr);
-
-   page_addr = page_address(pmd_page(*pmd));
-   if (!memchr_inv(page_addr, PAGE_INUSE,
-   PMD_SIZE)) {
-   free_hugepage_table(pmd_page(*pmd),
-   altmap);
-
-   spin_lock(_mm.page_table_lock);
-   pmd_clear(pmd);
-   spin_unlock(_mm.page_table_lock);
-   }
-   }
+   if (!direct)
+   free_hugepage_table(pmd_page(*pmd),
+   altmap);
 
+   spin_lock(_mm.page_table_lock);
+   pmd_clear(pmd);
+   spin_unlock(_mm.page_table_lock);
+   pages++;
continue;
}

I did not try it out yet and this might explode badly, but AFAICS, a PMD size
chunk is always allocated even when the section does not spand 2MB.
E.g: sizeof(struct page) = 56.

PAGES_PER_SECTION * 64 = 2097152
PAGES_PER_SECTION * 56 = 1835008

Even in the latter case, vmemmap_populate_hugepages will allocate a PMD size 
chunk
to satisfy the unaligned range.
So, unless I am missing something, it should not be a problem to free that 2MB 
in
remove_pmd_table when 1) the PMD is large and 2) the range is not aligned.


-- 
Oscar Salvador
SUSE L3


Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range

2021-01-25 Thread Oscar Salvador
On Mon, Jan 25, 2021 at 12:02:53PM +0100, David Hildenbrand wrote:
> Assume you have two consecutive memory blocks with 56 sizeof(struct page).
> The first one allocates a PMD (2097152) but only consumes 1835008, the second
> one reuses the remaining part and allocates another PMD (1835008),
> only using parts of it.
> 
> Ripping out a memory block, along with the PMD in the vmemmap would
> remove parts of the vmemmap of another memory block.

Bleh, yeah, I was confused, you are right.

> You might want to take a look at:

Thanks a lot for the hints, I will have a look ;-)


-- 
Oscar Salvador
SUSE L3


Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range

2021-01-25 Thread Oscar Salvador
On Mon, Jan 25, 2021 at 11:57:20AM +0100, David Hildenbrand wrote:
> I'm confused.
> 
> 1. Assume we hotplug memory, online it to ZONE_MOVABLE. The vmemmap gets
> allocated from altmap space.

The vmemmap could have never been allocated from altmap in case hpage vmemmap
feature is enabled.

Have a look at [1].
If is_hugetlb_free_vmemmap_enabled(), vmemmap_populate() ends up calling
vmemmap_populate_basepages().

And since no memory was consumed from altmap, and hence altmap_alloc_block_buf()
was never called, vmem_altmap->alloc will be 0, and 
memory_block->nr_vmemmap_pages
will be 0 as well.

But on a second though, true is that we will get in trouble if hpage vmemmap
feature ever gets to work with vmemmap_populate_hugepages.
I will queue that to look in a new future.

[1] 
https://patchwork.kernel.org/project/linux-mm/patch/20210117151053.24600-10-songmuc...@bytedance.com/


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 2/6] mm: Teach pfn_to_online_page() to consider subsection validity

2021-01-13 Thread Oscar Salvador
On Tue, Jan 12, 2021 at 11:35:27PM -0800, Dan Williams wrote:
> pfn_section_valid() determines pfn validity on subsection granularity
> where pfn_valid() may be limited to coarse section granularity.
> Explicitly validate subsections after pfn_valid() succeeds.
> 
> Fixes: b13bc35193d9 ("mm/hotplug: invalid PFNs from pfn_to_online_page()")
> Cc: Qian Cai 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Reported-by: David Hildenbrand 
> Signed-off-by: Dan Williams 

Reviewed-by: Oscar Salvador 

> ---
>  mm/memory_hotplug.c |   24 
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 55a69d4396e7..9f37f8a68da4 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -308,11 +308,27 @@ static int check_hotplug_memory_addressable(unsigned 
> long pfn,
>  struct page *pfn_to_online_page(unsigned long pfn)
>  {
>   unsigned long nr = pfn_to_section_nr(pfn);
> + struct mem_section *ms;
> +
> + if (nr >= NR_MEM_SECTIONS)
> + return NULL;
> +
> + ms = __nr_to_section(nr);
> + if (!online_section(ms))
> + return NULL;
> +
> + /*
> +  * Save some code text when online_section() +
> +  * pfn_section_valid() are sufficient.
> +  */
> + if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID))
> + if (!pfn_valid(pfn))
> + return NULL;
> +
> + if (!pfn_section_valid(ms, pfn))
> + return NULL;
>  
> - if (nr < NR_MEM_SECTIONS && online_section_nr(nr) &&
> - pfn_valid_within(pfn))
> - return pfn_to_page(pfn);
> - return NULL;
> + return pfn_to_page(pfn);
>  }
>  EXPORT_SYMBOL_GPL(pfn_to_online_page);
>  
> 

-- 
Oscar Salvador
SUSE L3


Re: [RFC V2] mm/vmstat: Add events for HugeTLB migration

2020-09-30 Thread Oscar Salvador
On Wed, Sep 30, 2020 at 11:30:49AM +0530, Anshuman Khandual wrote:
> - is_thp = PageTransHuge(page) && !PageHuge(page);
> - nr_subpages = thp_nr_pages(page);
> + is_thp = false;
> + is_hugetlb = false;
> + if (PageTransHuge(page)) {
> + if (PageHuge(page))
> + is_hugetlb = true;
> + else
> + is_thp = true;
> + }

Since PageHuge only returns true for hugetlb pages, I think the following is
more simple?

if (PageHuge(page))
is_hugetlb = true;
else if (PageTransHuge(page))
is_thp = true


Besides that, it looks good to me:

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [RFC V2] mm/vmstat: Add events for HugeTLB migration

2020-09-30 Thread Oscar Salvador
On Wed, Sep 30, 2020 at 03:32:10PM +0530, Anshuman Khandual wrote:
> Right, it would be simple. But as Mike had mentioned before PageHuge()
> check is more expensive than PageTransHuge(). This proposal just tries
> not to call PageHuge() unless the page first clears PageTransHuge(),
> saving some potential CPU cycles on normal pages.

Ah, I remember now.
I missed that, sorry.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 0/5] HWpoison: further fixes and cleanups

2020-09-16 Thread Oscar Salvador
On Wed, Sep 16, 2020 at 09:53:58AM -0400, Aristeu Rozanski wrote:
> Hi Oscar,

Thanks Aristeu,

> 
> On Wed, Sep 16, 2020 at 09:27:02AM +0200, Oscar Salvador wrote:
> > Could you please re-run the tests with the below patch applied, and
> > attached then the logs here?
> 
> here it is:
> (removed previous dump_page() calls for other pages that didn't fail)
> 
> [  109.709342] Soft offlining pfn 0x3fb526 at process virtual address 
> 0x7ffc7a18
> [  109.716969] page:f367dde5 refcount:1 mapcount:0 
> mapping: index:0x7ffc7a18 pfn:0x3fb526
> [  109.716978] anon flags: 
> 0x380008000e(referenced|uptodate|dirty|swapbacked)
> [  109.716988] raw: 00380008000e 5deadbeef100 5deadbeef122 
> c03fcd56d839
> [  109.716997] raw: 7ffc7a18  0001 
> c03fd42f5000
> [  109.717005] page dumped because: page_handle_poison
> [  109.717011] page->mem_cgroup:c03fd42f5000
> [  109.725882] page_handle_poison: hugepage_or_freepage failed
> [  109.725885] __soft_offline_page: page_handle_poison -EBUSY
> [  109.725898] page:f367dde5 refcount:3 mapcount:0 
> mapping:b43d73e6 index:0x58 pfn:0x3fb526
> [  109.725951] aops:xfs_address_space_operations [xfs] ino:49f9c5f dentry 
> name:"messages"
> [  109.725958] flags: 0x382008(dirty|private)
> [  109.725963] raw: 00382008 5deadbeef100 5deadbeef122 
> c03fb8b7eea8
> [  109.725969] raw: 0058 c03fdd94eb20 0003 
> c03fd3c42000
> [  109.725975] page dumped because: __soft_offline_page after migrate
> [  109.725980] page->mem_cgroup:c03fd3c42000

Can you try the other patch I posted in response to Naoya?

thanks

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 0/5] HWpoison: further fixes and cleanups

2020-09-16 Thread Oscar Salvador
On Wed, Sep 16, 2020 at 01:42:15PM +, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Sep 15, 2020 at 05:22:22PM -0400, Aristeu Rozanski wrote:
 
> I reproduced the similar -EBUSY with small average x86 VM, where it seems to 
> me
> a race between page_take_off_buddy() and page allocation.  Oscar's debug patch
> shows the following kernel messages:
> 
> [  627.357009] Soft offlining pfn 0x235018 at process virtual address 
> 0x7fd11214
> [  627.358747] __get_any_page: 0x235018 free buddy page
> [  627.359875] page:038b52c9 refcount:0 mapcount:-128 
> mapping: index:0x1 pfn:0x235018
> [  627.362002] flags: 0x57ffe0()
> [  627.362841] raw: 0057ffe0 f84648d12688 955abffd1dd0 
> 
> [  627.364555] raw: 0001  ff7f 
> 
> [  627.366258] page dumped because: page_handle_poison
> [  627.367357] page->mem_cgroup:9559b6912000
> [  627.368342] page_handle_poison: hugepage_or_freepage failed\xb8n
> [  627.368344] soft_offline_free_page: page_handle_poison -EBUSY
> [  627.370901] page:038b52c9 refcount:6 mapcount:3 
> mapping:1226bf89 index:0x2710 pfn:0x235018
> [  627.373048] aops:ext4_da_aops ino:c63f3 dentry name:"system.journal"
> [  627.374526] flags: 0x57ffe0201c(uptodate|dirty|lru|private)
> [  627.375865] raw: 0057ffe0201c f84648d300c8 955ab8c3f020 
> 955aba5f4ee0
> [  627.377586] raw: 2710 9559b811fc98 00050002 
> 9559b6912000
> [  627.379308] page dumped because: soft_offline_free_page
> [  627.380480] page->mem_cgroup:9559b6912000
> 
> CPU 0CPU 1
> 
> get_any_page // returns 0 (free buddy path)
>   soft_offline_free_page
>  the page is allocated
> page_handle_poison -> fail
>   return -EBUSY
> 
> I'm still not sure why this issue is invisible before rework patch,
> but setting migrate type to MIGRATE_ISOLATE during offlining could affect
> the behavior sensitively.

Well, this is very timing depending.
AFAICS, before the rework patchset, we could still race with an allocation
as the page could have been allocated between the get_any_page()
and the call to set_hwpoison_free_buddy_page() which takes the zone->lock
to prevent that.

Maybe we just happen to take longer now to reach take_page_off_buddy, so the
race window is bigger.

AFAICS, this has nothing to do with MIGRATE_ISOLATE, because here we are
dealing with pages that already free (part of the buddy system).

The only thing that comes to my mind right off the bat, might be to do
a "retry" in soft_offline_page in case soft_offline_free_page returns -EBUSY,
so we can call again get_any_page and try to handle the new type of page.
Something like (untested):

@@ -1923,6 +1977,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 {
int ret;
struct page *page;
+   bool try_again = true;
 
if (!pfn_valid(pfn))
return -ENXIO;
@@ -1938,6 +1993,7 @@ int soft_offline_page(unsigned long pfn, int flags)
return 0;
}
 
+retry:
get_online_mems();
ret = get_any_page(page, pfn, flags);
put_online_mems();
@@ -1945,7 +2001,10 @@ int soft_offline_page(unsigned long pfn, int flags)
if (ret > 0)
ret = soft_offline_in_use_page(page);
else if (ret == 0)
-   ret = soft_offline_free_page(page);
+   if (soft_offline_free_page(page) && try_again) {
+   try_again = false;
+   goto retry;
+   }
 
return ret;


-- 
Oscar Salvador
SUSE L3


Re: [PATCH -next] mm/madvise: Remove set but not used variable 'zone'

2020-09-17 Thread Oscar Salvador
On Thu, Sep 17, 2020 at 03:17:57PM +0800, Liu Shixin wrote:
> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> mm/madvise.c: In function 'madvise_inject_error':
> mm/madvise.c:882:15: warning: unused variable 'zone' [-Wunused-variable]
> 
> After 4b63fdbe2b25 ("mm: remove the now-unnecessary mmget_still_valid()
> hack"), variable 'zone' is never used. Remove it to avoid build warning.

Hi Liu,

Andrew already fixed that in the mmotm tree.

Thanks anyway

> 
> Signed-off-by: Liu Shixin 
> ---
>  mm/madvise.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 460e19d60ba3..94b9d17331b9 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -879,7 +879,6 @@ static long madvise_remove(struct vm_area_struct *vma,
>  static int madvise_inject_error(int behavior,
>   unsigned long start, unsigned long end)
>  {
> - struct zone *zone;
>   unsigned long size;
>  
>   if (!capable(CAP_SYS_ADMIN))
> -- 
> 2.25.1
> 
> 

-- 
Oscar Salvador
SUSE L3


[PATCH v4 3/7] mm,hwpoison: Try to narrow window race for free pages

2020-09-17 Thread Oscar Salvador
Aristeu Rozanski reported that a customer test case started
to report -EBUSY after the hwpoison report patchset.

There is a race window between spotting a free page and taking it off
its buddy freelist, so it might be that by the time we try to take it off,
the page has been already allocated.

This patch tries to handle such race window by trying to handle the new
type of page again if the page was allocated under us.

After this patch, Aristeu said the test cases work properly.

Signed-off-by: Oscar Salvador 
Reported-by: Aristeu Rozanski 
---
 mm/memory-failure.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index db61bdee9734..a2ccd3ba4015 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1917,6 +1917,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 {
int ret;
struct page *page;
+   bool try_again = true;
 
if (!pfn_valid(pfn))
return -ENXIO;
@@ -1932,6 +1933,7 @@ int soft_offline_page(unsigned long pfn, int flags)
return 0;
}
 
+retry:
get_online_mems();
ret = get_any_page(page, pfn, flags);
put_online_mems();
@@ -1939,7 +1941,10 @@ int soft_offline_page(unsigned long pfn, int flags)
if (ret > 0)
ret = soft_offline_in_use_page(page);
else if (ret == 0)
-   ret = soft_offline_free_page(page);
+   if (soft_offline_free_page(page) && try_again) {
+   try_again = false;
+   goto retry;
+   }
 
return ret;
 }
-- 
2.26.2



[PATCH v4 1/7] mm,hwpoison: take free pages off the buddy freelists

2020-09-17 Thread Oscar Salvador
The crux of the matter is that historically we left poisoned pages in the
buddy system because we have some checks in place when allocating a page
that a gatekeeper for poisoned pages.  Unfortunately, we do have other
users (e.g: compaction [1]) that scan buddy freelists and try to get a
page from there without checking whether the page is HWPoison.

As I stated already, I think it is fundamentally wrong to keep HWPoison
pages within the buddy systems, checks in place or not.

Let us fix this we same way we did for soft_offline [2], and take the page
off the buddy freelist, so it is completely unreachable.

Note that this is fairly simple to trigger, as we only need to poison free
buddy pages (madvise MADV_HWPOISON) and then we need to run some sort of
memory stress system.

Just for a matter of reference, I put a dump_page in compaction_alloc to
trigger for HWPoison patches:

kernel: page:12b2982b refcount:1 mapcount:0 mapping: 
index:0x1 pfn:0x1d5db
kernel: flags: 0xfc080(hwpoison)
kernel: raw: 000fc080 ea7573c8 c9857de0 
kernel: raw: 0001  0001 
kernel: page dumped because: compaction_alloc

kernel: CPU: 4 PID: 123 Comm: kcompactd0 Tainted: GE 
5.9.0-rc2-mm1-1-default+ #5
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
kernel: Call Trace:
kernel:  dump_stack+0x6d/0x8b
kernel:  compaction_alloc+0xb2/0xc0
kernel:  migrate_pages+0x2a6/0x12a0
kernel:  ? isolate_freepages+0xc80/0xc80
kernel:  ? __ClearPageMovable+0xb0/0xb0
kernel:  compact_zone+0x5eb/0x11c0
kernel:  ? finish_task_switch+0x74/0x300
kernel:  ? lock_timer_base+0xa8/0x170
kernel:  proactive_compact_node+0x89/0xf0
kernel:  ? kcompactd+0x2d0/0x3a0
kernel:  kcompactd+0x2d0/0x3a0
kernel:  ? finish_wait+0x80/0x80
kernel:  ? kcompactd_do_work+0x350/0x350
kernel:  kthread+0x118/0x130
kernel:  ? kthread_associate_blkcg+0xa0/0xa0
kernel:  ret_from_fork+0x22/0x30

After that, if e.g: someone faults in the page, that someone will get
killed unexpectedly.

While we are at it, I also changed the action result for such cases.
I think that MF_DELAYED is a bit misleading, because in case we could
contain the page and take it off the buddy, such a page is not to be used
again unless it is unpoisoned, so we fixed the situation.

So unless I am missing something, I strongly think that we should report
MF_RECOVERED.

[1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
[2] https://patchwork.kernel.org/patch/11694847/

Signed-off-by: Oscar Salvador 
---
 mm/memory-failure.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4eb3c42ffe35..74a53881f94b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1323,8 +1323,9 @@ int memory_failure(unsigned long pfn, int flags)
struct page *hpage;
struct page *orig_head;
struct dev_pagemap *pgmap;
-   int res;
unsigned long page_flags;
+   int res = 0;
+   bool retry = true;
 
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1364,10 +1365,21 @@ int memory_failure(unsigned long pfn, int flags)
 * In fact it's dangerous to directly bump up page count from 0,
 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 */
+try_again:
if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
if (is_free_buddy_page(p)) {
-   action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
-   return 0;
+   if (take_page_off_buddy(p)) {
+   action_result(pfn, MF_MSG_BUDDY, MF_RECOVERED);
+   return 0;
+   } else {
+   /* We lost the race, try again */
+   if (retry) {
+   retry = false;
+   goto try_again;
+   }
+   action_result(pfn, MF_MSG_BUDDY, MF_FAILED);
+   return -EBUSY;
+   }
} else {
action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, 
MF_IGNORED);
return -EBUSY;
@@ -1393,11 +1405,15 @@ int memory_failure(unsigned long pfn, int flags)
shake_page(p, 0);
/* shake_page could have turned it free. */
if (!PageLRU(p) && is_free_buddy_page(p)) {
+   if (!take_page_off_buddy(p))
+   res = -EBUSY;
+
if (flags & MF_COUNT_INCREASED)
-   action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
+   ac

[PATCH v4 0/7] HWpoison: further fixes and cleanups

2020-09-17 Thread Oscar Salvador
This patchset includes some fixups (patch#1,patch#2 and patch#3)
and some cleanups (patch#4-7).

Patch#1 is a fix to take off HWPoison pages off a buddy freelist since
it can lead us to having HWPoison pages back in the game without no one
noticing it.
So fix it (we did that already for soft_offline_page [1]).

Patch#2 is fixing a rebasing problem that made the call
to page_handle_poison from _soft_offline_page set the
wrong value for hugepage_or_freepage. [2]

Patch#3 is not really a fixup, but tries to re-handle a page
in case it was allocated under us.

[1] https://patchwork.kernel.org/cover/11704083/
[2] https://patchwork.kernel.org/comment/23619775/

Thanks

Oscar Salvador (7):
  mm,hwpoison: take free pages off the buddy freelists
  mm,hwpoison: Do not set hugepage_or_freepage unconditionally
  mm,hwpoison: Try to narrow window race for free pages
  mm,hwpoison: refactor madvise_inject_error
  mm,hwpoison: drain pcplists before bailing out for non-buddy
zero-refcount page
  mm,hwpoison: drop unneeded pcplist draining
  mm,hwpoison: remove stale code

 mm/madvise.c| 36 +++
 mm/memory-failure.c | 59 +
 2 files changed, 58 insertions(+), 37 deletions(-)

-- 
2.26.2



[PATCH v4 2/7] mm,hwpoison: Do not set hugepage_or_freepage unconditionally

2020-09-17 Thread Oscar Salvador
Aristeu Rozanski reported that some hwpoison tests
were returning -EBUSY while before the rework they succeed [1] [2].

The root case is that during a rebase, the call to page_handle_poison was
setting hugepage_or_freepage = true unconditionally from __soft_offline_page.

Aristeu said that after this fix, everything works.

[1] https://patchwork.kernel.org/comment/23617301/
[2] https://patchwork.kernel.org/comment/23619535/

Signed-off-by: Oscar Salvador 
Reported-by: Aristeu Rozanski 
---
 mm/memory-failure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 74a53881f94b..db61bdee9734 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1852,7 +1852,7 @@ static int __soft_offline_page(struct page *page)
if (!ret) {
bool release = !huge;
 
-   if (!page_handle_poison(page, true, release))
+   if (!page_handle_poison(page, huge, release))
ret = -EBUSY;
} else {
if (!list_empty())
-- 
2.26.2



[PATCH v4 6/7] mm,hwpoison: drop unneeded pcplist draining

2020-09-17 Thread Oscar Salvador
memory_failure and soft_offline_path paths now drain pcplists by calling
get_hwpoison_page.

memory_failure flags the page as HWPoison before, so that page cannot
longer go into a pcplist, and soft_offline_page only flags a page as
HWPoison if 1) we took the page off a buddy freelist 2) the page was
in-use and we migrated it 3) was a clean pagecache.

Because of that, a page cannot longer be poisoned and be in a pcplist.

Signed-off-by: Oscar Salvador 
---
 mm/madvise.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4a48f7215195..302f3a84d17c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -914,10 +914,6 @@ static int madvise_inject_error(int behavior,
return ret;
}
 
-   /* Ensure that all poisoned pages are removed from per-cpu lists */
-   for_each_populated_zone(zone)
-   drain_all_pages(zone);
-
return 0;
 }
 #endif
-- 
2.26.2



[PATCH v4 7/7] mm,hwpoison: remove stale code

2020-09-17 Thread Oscar Salvador
Currently we call shake_page and then check whether the page is Buddy
because shake_page calls drain_all_pages, which sends pcp-pages back to
the buddy freelists, so we could have a chance to handle free pages.

get_hwpoison_page already calls drain_all_pages, and we do call
get_hwpoison_page right before coming here, so we should be on the safe
side.

Signed-off-by: Oscar Salvador 
---
 mm/memory-failure.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7fba4ba201d5..f68cb5e3b320 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1423,18 +1423,6 @@ int memory_failure(unsigned long pfn, int flags)
 * walked by the page reclaim code, however that's not a big loss.
 */
shake_page(p, 0);
-   /* shake_page could have turned it free. */
-   if (!PageLRU(p) && is_free_buddy_page(p)) {
-   if (!take_page_off_buddy(p))
-   res = -EBUSY;
-
-   if (flags & MF_COUNT_INCREASED)
-   action_result(pfn, MF_MSG_BUDDY, res ? MF_FAILED : 
MF_RECOVERED);
-   else
-   action_result(pfn, MF_MSG_BUDDY_2ND, res ? MF_FAILED : 
MF_RECOVERED);
-
-   return res;
-   }
 
lock_page(p);
 
-- 
2.26.2



[PATCH v4 5/7] mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page

2020-09-17 Thread Oscar Salvador
A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
Currently, we bail out with an error if we encounter such a page, meaning
that we do not handle pcppages neither from hard-offline nor from
soft-offline path.

Fix this by draining pcplists whenever we find this kind of page and retry
the check again.  It might be that pcplists have been spilled into the
buddy allocator and so we can handle it.

Signed-off-by: Oscar Salvador 
---
 mm/memory-failure.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a2ccd3ba4015..7fba4ba201d5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -950,13 +950,13 @@ static int page_action(struct page_state *ps, struct page 
*p,
 }
 
 /**
- * get_hwpoison_page() - Get refcount for memory error handling:
+ * __get_hwpoison_page() - Get refcount for memory error handling:
  * @page:  raw error page (hit by memory error)
  *
  * Return: return 0 if failed to grab the refcount, otherwise true (some
  * non-zero value.)
  */
-static int get_hwpoison_page(struct page *page)
+static int __get_hwpoison_page(struct page *page)
 {
struct page *head = compound_head(page);
 
@@ -986,6 +986,26 @@ static int get_hwpoison_page(struct page *page)
return 0;
 }
 
+static int get_hwpoison_page(struct page *p)
+{
+   int ret;
+   bool drained = false;
+
+retry:
+   ret = __get_hwpoison_page(p);
+   if (!ret && !is_free_buddy_page(p) && !page_count(p) && !drained) {
+   /*
+* The page might be in a pcplist, so try to drain those
+* and see if we are lucky.
+*/
+   drain_all_pages(page_zone(p));
+   drained = true;
+   goto retry;
+   }
+
+   return ret;
+}
+
 /*
  * Do all that is necessary to remove user space mappings. Unmap
  * the pages and send SIGBUS to the processes if the data was dirty.
-- 
2.26.2



[PATCH v4 4/7] mm,hwpoison: refactor madvise_inject_error

2020-09-17 Thread Oscar Salvador
Make a proper if-else condition for {hard,soft}-offline.

[akpm: remove zone variable and refactor comment]
Signed-off-by: Oscar Salvador 
---
 mm/madvise.c | 32 ++--
 1 file changed, 14 insertions(+), 18 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4d87b31a1576..4a48f7215195 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -869,8 +869,6 @@ static long madvise_remove(struct vm_area_struct *vma,
 static int madvise_inject_error(int behavior,
unsigned long start, unsigned long end)
 {
-   struct page *page;
-   struct zone *zone;
unsigned long size;
 
if (!capable(CAP_SYS_ADMIN))
@@ -879,6 +877,7 @@ static int madvise_inject_error(int behavior,
 
for (; start < end; start += size) {
unsigned long pfn;
+   struct page *page;
int ret;
 
ret = get_user_pages_fast(start, 1, 0, );
@@ -895,25 +894,22 @@ static int madvise_inject_error(int behavior,
 
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual 
address %#lx\n",
-   pfn, start);
-
+pfn, start);
ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
-   if (ret)
-   return ret;
-   continue;
+   } else {
+   pr_info("Injecting memory failure for pfn %#lx at 
process virtual address %#lx\n",
+pfn, start);
+   /*
+* Drop the page reference taken by
+* get_user_pages_fast(). In the absence of
+* MF_COUNT_INCREASED the memory_failure() routine is
+* responsible for pinning the page to prevent it
+* from being released back to the page allocator.
+*/
+   put_page(page);
+   ret = memory_failure(pfn, 0);
}
 
-   pr_info("Injecting memory failure for pfn %#lx at process 
virtual address %#lx\n",
-   pfn, start);
-
-   /*
-* Drop the page reference taken by get_user_pages_fast(). In
-* the absence of MF_COUNT_INCREASED the memory_failure()
-* routine is responsible for pinning the page to prevent it
-* from being released back to the page allocator.
-*/
-   put_page(page);
-   ret = memory_failure(pfn, 0);
if (ret)
return ret;
}
-- 
2.26.2



Re: [PATCH v4 0/7] HWpoison: further fixes and cleanups

2020-09-17 Thread Oscar Salvador
On Thu, Sep 17, 2020 at 11:39:21AM +, HORIGUCHI NAOYA(堀口 直也) wrote:
> Thanks for the update.
> This patchset triggers the following BUG_ON() with Aristeu's workload:

I just took a look, but I found more oddities.
The patchset you sent seems to be a bit buggy and it is missing some things
my patchset contains, e.g:

static __always_inline bool free_pages_prepare(struct page *page,
unsigned int order, bool check_free)
{
...
if (unlikely(PageHWPoison(page)) && !order) {
...
return false;
}

Moreover, in page_handle_poison, you managed it wrong because:

static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, 
bool release)
{
if (release) {
put_page(page);
drain_all_pages(page_zone(page));
}

...
SetPageHWPoison(page);
page_ref_inc(page);

1) You are freeing the page first, which means it goes to buddy
2) Then you set it as poisoned and you update its refcount.

Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that 
is quite wrong.

Honestly, I do not know how your patchset diverged so much from mine, but is
not right.

I will go over my patchset and yours and compare/fix things.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v4 0/7] HWpoison: further fixes and cleanups

2020-09-17 Thread Oscar Salvador
On Thu, Sep 17, 2020 at 03:09:52PM +0200, Oscar Salvador wrote:
> static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, 
> bool release)
> {
> if (release) {
> put_page(page);
> drain_all_pages(page_zone(page));
> }
> 
>   ...
> SetPageHWPoison(page);
> page_ref_inc(page);
> 
> 1) You are freeing the page first, which means it goes to buddy
> 2) Then you set it as poisoned and you update its refcount.
> 
> Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and 
> that is quite wrong.

Hi Naoya,

Ok, I tested it and with the following changes on top I cannot reproduce the 
issue:

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f68cb5e3b320..4ffaaa5c2603 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -67,11 +67,6 @@ atomic_long_t num_poisoned_pages __read_mostly = 
ATOMIC_LONG_INIT(0);
 
 static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, 
bool release)
 {
-   if (release) {
-   put_page(page);
-   drain_all_pages(page_zone(page));
-   }
-
if (hugepage_or_freepage) {
/*
 * Doing this check for free pages is also fine since 
dissolve_free_huge_page
@@ -89,6 +84,12 @@ static bool page_handle_poison(struct page *page, bool 
hugepage_or_freepage, boo
}
 
SetPageHWPoison(page);
+
+   if (release) {
+put_page(page);
+drain_all_pages(page_zone(page));
+}
+
page_ref_inc(page);
num_poisoned_pages_inc();
return true;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d9f9bd0e06c..8a6ab05f074c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1173,6 +1173,16 @@ static __always_inline bool free_pages_prepare(struct 
page *page,
 
trace_mm_page_free(page, order);
 
+   if (unlikely(PageHWPoison(page)) && !order) {
+   /*
+* Untie memcg state and reset page's owner
+*/
+   if (memcg_kmem_enabled() && PageKmemcg(page))
+   __memcg_kmem_uncharge_page(page, order);
+   reset_page_owner(page, order);
+   return false;
+   }
+
/*
 * Check tail pages before head page information is cleared to
 * avoid checking PageCompound for order-0 pages.


# sh tmp_run_ksm_madv.sh 
p1 0x7f6b6b08e000
p2 0x7f6b529ee000
madvise(p1) 0
madvise(p2) 0
writing p1 ... done
writing p2 ... done
soft offline
soft offline returns 0
OK


Can you try to re-run your tests and see if they come clean?
If they do, I will try to see if Andrew can squezee above changes into [1],
where they belong to.

Otherwise I will craft a v5 containing all fixups (pretty unfortunate).

[1] https://patchwork.kernel.org/patch/11704099/

-- 
Oscar Salvador
SUSE L3


Re: [RFC 1/5] mm, page_alloc: clean up pageset high and batch update

2020-09-10 Thread Oscar Salvador
On Mon, Sep 07, 2020 at 06:36:24PM +0200, Vlastimil Babka wrote:
> Signed-off-by: Vlastimil Babka 

>   for_each_possible_cpu(cpu)
> - setup_pageset(_cpu(boot_pageset, cpu), 0);
> + setup_pageset(_cpu(boot_pageset, cpu));

This is not really anything important but I realized we have like 7 functions
messing with pcp lists, and everytime I try to follow them my head spins.

Since setup_pageset is only being called here, could we replace it by the
pageset_init and pageset_update?

(As I said, not important and probably a matter of taste. I just think that
having so many mini functions around is not always cool,
e.g: setup_zone_pageset->zone_pageset_init)

> -/*
> - * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
> - * to the value high for the pageset p.
> - */
> -static void pageset_set_high(struct per_cpu_pageset *p,
> - unsigned long high)
> -{
> - unsigned long batch = max(1UL, high / 4);
> - if ((high / 4) > (PAGE_SHIFT * 8))
> - batch = PAGE_SHIFT * 8;
> -
> - pageset_update(>pcp, high, batch);
> + pageset_update(>pcp, 0, 1);
>  }

Could we restore the comment we had in pageset_set_high, and maybe
update it to match this new function? I think it would be useful.
>  
>  static void pageset_set_high_and_batch(struct zone *zone,
> -struct per_cpu_pageset *pcp)
> +struct per_cpu_pageset *p)
>  {
> - if (percpu_pagelist_fraction)
> - pageset_set_high(pcp,
> - (zone_managed_pages(zone) /
> - percpu_pagelist_fraction));
> - else
> - pageset_set_batch(pcp, zone_batchsize(zone));
> + unsigned long new_high;
> + unsigned long new_batch;
> + int fraction = READ_ONCE(percpu_pagelist_fraction);

Why the READ_ONCE? In case there is a parallel update so things to get
messed up?

as I said, I'd appreciate a comment in pageset_set_high_and_batch to be
restored and updated, otherwise:

Reviewed-by: Oscar Salvador  

Thanks

-- 
Oscar Salvador
SUSE L3


Re: [RFC 1/5] mm, page_alloc: clean up pageset high and batch update

2020-09-10 Thread Oscar Salvador
On Thu, Sep 10, 2020 at 10:31:20AM +0200, Oscar Salvador wrote:
> On Mon, Sep 07, 2020 at 06:36:24PM +0200, Vlastimil Babka wrote:
> > Signed-off-by: Vlastimil Babka 
> 
> > for_each_possible_cpu(cpu)
> > -   setup_pageset(_cpu(boot_pageset, cpu), 0);
> > +   setup_pageset(_cpu(boot_pageset, cpu));
> 
> This is not really anything important but I realized we have like 7 functions
> messing with pcp lists, and everytime I try to follow them my head spins.
> 
> Since setup_pageset is only being called here, could we replace it by the
> pageset_init and pageset_update?
> 
> (As I said, not important and probably a matter of taste. I just think that
> having so many mini functions around is not always cool,
> e.g: setup_zone_pageset->zone_pageset_init)

Sorry, I did not see that you just did that in Patch#3, bleh.


-- 
Oscar Salvador
SUSE L3


Re: [RFC 2/5] mm, page_alloc: calculate pageset high and batch once per zone

2020-09-10 Thread Oscar Salvador
On Mon, Sep 07, 2020 at 06:36:25PM +0200, Vlastimil Babka wrote:
> We currently call pageset_set_high_and_batch() for each possible cpu,
> which repeats the same calculations of high and batch values.
> 
> Instead call it once per zone, and it applies the calculated values
> to all per-cpu pagesets of the zone.
> 
> This also allows removing zone_pageset_init() and __zone_pcp_update() 
> wrappers.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka 

I like this, it simplifies the things.

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [RFC 3/5] mm, page_alloc(): remove setup_pageset()

2020-09-10 Thread Oscar Salvador
On Mon, Sep 07, 2020 at 06:36:26PM +0200, Vlastimil Babka wrote:
> We initialize boot-time pagesets with setup_pageset(), which sets high and
> batch values that effectively disable pcplists.
> 
> We can remove this wrapper if we just set these values for all pagesets in
> pageset_init(). Non-boot pagesets then subsequently update them to specific
> values.
> 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Oscar Salvador 

Just one question below:

> -static void setup_pageset(struct per_cpu_pageset *p)
> -{
> - pageset_init(p);
> - pageset_update(>pcp, 0, 1);
> + /*
> +  * Set batch and high values safe for a boot pageset. Proper pageset's
> +  * initialization will update them.
> +  */
> + pcp->high = 0;
> + pcp->batch  = 1;

pageset_update was manipulating these values with barriers in between.
I guess we do not care here because we are not really updating but
initializing them, right?

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 01/10] mm/memory_hotplug: inline __offline_pages() into offline_pages()

2020-08-24 Thread Oscar Salvador
On Wed, Aug 19, 2020 at 07:59:48PM +0200, David Hildenbrand wrote:
> There is only a single user, offline_pages(). Let's inline, to make
> it look more similar to online_pages().
> 
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 02/10] mm/memory_hotplug: enforce section granularity when onlining/offlining

2020-08-24 Thread Oscar Salvador
On Wed, Aug 19, 2020 at 07:59:49PM +0200, David Hildenbrand wrote:
> Already two people (including me) tried to offline subsections, because
> the function looks like it can deal with it. But we really can only
> online/offline full sections that are properly aligned (e.g., we can only
> mark full sections online/offline via SECTION_IS_ONLINE).
> 
> Add a simple safety net to document the restriction now. Current users
> (core and powernv/memtrace) respect these restrictions.

It's been a while since I looked at sub-section handling stuff so sorry to ask
this, but was it true that we can hot-{remove,add} sub-section granularity, 
while
we can only online /offline on section granularity?

> 
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 03/10] mm/memory_hotplug: simplify page offlining

2020-08-24 Thread Oscar Salvador
On Wed, Aug 19, 2020 at 07:59:50PM +0200, David Hildenbrand wrote:
> We make sure that we cannot have any memory holes right at the beginning
> of offline_pages(). We no longer need walk_system_ram_range() and can
> call test_pages_isolated() and __offline_isolated_pages() directly.
> 
> offlined_pages always corresponds to nr_pages, so we can simplify that.
> 
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 04/10] mm/page_alloc: simplify __offline_isolated_pages()

2020-08-24 Thread Oscar Salvador
On Wed, Aug 19, 2020 at 07:59:51PM +0200, David Hildenbrand wrote:
> offline_pages() is the only user. __offline_isolated_pages() never gets
> called with ranges that contain memory holes and we no longer care about
> the return value. Drop the return value handling and all pfn_valid()
> checks.
> 
> Update the documentation.
> 
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Signed-off-by: David Hildenbrand 

pretty nice

Reviewed-by: Oscar Salvador 


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 05/10] mm/memory_hotplug: drop nr_isolate_pageblock in offline_pages()

2020-08-24 Thread Oscar Salvador
On Wed, Aug 19, 2020 at 07:59:52PM +0200, David Hildenbrand wrote:
> We make sure that we cannot have any memory holes right at the beginning
> of offline_pages() and we only support to online/offline full sections.
> Both, sections and pageblocks are a power of two in size, and sections
> always span full pageblocks.
> 
> We can directly calculate the number of isolated pageblocks from nr_pages.
> 
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 06/10] mm/page_isolation: simplify return value of start_isolate_page_range()

2020-08-24 Thread Oscar Salvador
On Wed, Aug 19, 2020 at 07:59:53PM +0200, David Hildenbrand wrote:
> Callers no longer need the number of isolated pageblocks. Let's
> simplify.
> 
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 07/10] mm/memory_hotplug: simplify page onlining

2020-08-24 Thread Oscar Salvador
On Wed, Aug 19, 2020 at 07:59:54PM +0200, David Hildenbrand wrote:
> We don't allow to offline memory with holes, all boot memory is online,
> and all hotplugged memory cannot have holes.
> 
> We can now simplify onlining of pages. As we only allow to online/offline
> full sections and sections always span full MAX_ORDER_NR_PAGES, we can just
> process MAX_ORDER - 1 pages without further special handling.
> 
> The number of onlined pages simply corresponds to the number of pages we
> were requested to online.
> 
> While at it, refine the comment regarding the callback not exposing all
> pages to the buddy.
> 
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 08/10] mm/page_alloc: drop stale pageblock comment in memmap_init_zone*()

2020-08-24 Thread Oscar Salvador
On Wed, Aug 19, 2020 at 07:59:55PM +0200, David Hildenbrand wrote:
> Commit ac5d2539b238 ("mm: meminit: reduce number of times pageblocks are
> set during struct page init") moved the actual zone range check, leaving
> only the alignment check for pageblocks.
> 
> Let's drop the stale comment and make the pageblock check easier to read.
> 
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Cc: Mel Gorman 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 09/10] mm: pass migratetype into memmap_init_zone() and move_pfn_range_to_zone()

2020-08-24 Thread Oscar Salvador
On Wed, Aug 19, 2020 at 07:59:56PM +0200, David Hildenbrand wrote:
> On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
> un-isolate the pages after memory onlining is complete. Let's allow
> passing in the migratetype.
> 
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Cc: Tony Luck 
> Cc: Fenghua Yu 
> Cc: Logan Gunthorpe 
> Cc: Dan Williams 
> Cc: Mike Rapoport 
> Cc: "Matthew Wilcox (Oracle)" 
> Cc: Michel Lespinasse 
> Cc: linux-i...@vger.kernel.org
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 10/10] mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining memory

2020-08-24 Thread Oscar Salvador
On Wed, Aug 19, 2020 at 07:59:57PM +0200, David Hildenbrand wrote:
> Currently, it can happen that pages are allocated (and freed) via the buddy
> before we finished basic memory onlining.
> 
> For example, pages are exposed to the buddy and can be allocated before
> we actually mark the sections online. Allocated pages could suddenly
> fail pfn_to_online_page() checks. We had similar issues with pcp
> handling, when pages are allocated+freed before we reach
> zone_pcp_update() in online_pages() [1].
> 
> Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
> impossible. Once done with the heavy lifting, use
> undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
> freelist, marking them ready for allocation. Similar to offline_pages(),
> we have to manually adjust zone->nr_isolate_pageblock.
> 
> [1] 
> https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-chara...@codeaurora.org
> 
> Acked-by: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Baoquan He 
> Cc: Pankaj Gupta 
> Cc: Oscar Salvador 
> Cc: Charan Teja Reddy 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 12/12] mm,hwpoison: double-check page count in __get_any_page()

2020-08-24 Thread Oscar Salvador
On Thu, Aug 06, 2020 at 06:49:23PM +, nao.horigu...@gmail.com wrote:
> From: Naoya Horiguchi 
> 
> Soft offlining could fail with EIO due to the race condition with
> hugepage migration. This issuse became visible due to the change by
> previous patch that makes soft offline handler take page refcount
> by its own.  We have no way to directly pin zero refcount page, and
> the page considered as a zero refcount page could be allocated just
> after the first check.
> 
> This patch adds the second check to find the race and gives us
> chance to handle it more reliably.
> 
> Reported-by: Qian Cai 
> Signed-off-by: Naoya Horiguchi 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 2/3] mm: don't rely on system state to detect hot-plug operations

2020-09-14 Thread Oscar Salvador
On Mon, Sep 14, 2020 at 09:57:46AM +0200, David Hildenbrand wrote:
> >  /* register memory section under specified node if it spans that node */
> > +struct rmsun_args {
> > +   int nid;
> > +   enum memplug_context context;
> > +};

Uhmf, that is a not so descriptive name.

> Instead of handling this in register_mem_sect_under_node(), I
> think it would be better two have two separate
> register_mem_sect_under_node() implementations.
> 
> static int register_mem_sect_under_node_hotplug(struct memory_block *mem_blk,
>   void *arg)
> {
>   const int nid = *(int *)arg;
>   int ret;
> 
>   /* Hotplugged memory has no holes and belongs to a single node. */
>   mem_blk->nid = nid;
>   ret = sysfs_create_link_nowarn(_devices[nid]->dev.kobj,
>  _blk->dev.kobj,
>  kobject_name(_blk->dev.kobj));
>   if (ret)
>   returnr et;
>   return sysfs_create_link_nowarn(_blk->dev.kobj,
>   _devices[nid]->dev.kobj,
>   
> kobject_name(_devices[nid]->dev.kobj));
> 
> }
> 
> Cleaner, right? :) No unnecessary checks.

I tend to agree here, I like more a simplistic version for hotplug.

> One could argue if link_mem_section_hotplug() would be better than passing 
> around the context.

I am not sure if I would duplicate the code there.
We could just pass the pointer of the function we want to call to
link_mem_sections? either register_mem_sect_under_node_hotplug or
register_mem_sect_under_node_early?
Would not that be clean and clear enough?

-- 
Oscar Salvador
SUSE L3


Re: [PATCH] mm/madvise.c: Remove the unused variable in function madvise_inject_error

2020-09-14 Thread Oscar Salvador
On Mon, Sep 14, 2020 at 05:22:16PM +0800, Qi Liu wrote:
> Variable zone is unused in function madvise_inject_error, let's remove it.
> 
> Signed-off-by: Qi Liu 

Andrew already fixed that up in my patch.

Thanks anyway

-- 
Oscar Salvador
SUSE L3


[PATCH v3 0/5] HWpoison: further fixes and cleanups

2020-09-14 Thread Oscar Salvador
The important bit of this patchset is patch#1, which is a fix to take off
HWPoison pages off a buddy freelist since it can lead us to having HWPoison
pages back in the game without no one noticing it.
So fix it (we did that already for soft_offline_page [1]).

The other patches are clean-ups and not that important, so if anything,
consider patch#1 for inclusion.

[1] https://patchwork.kernel.org/cover/11704083/

Thanks

Oscar Salvador (5):
  mm,hwpoison: take free pages off the buddy freelists
  mm,hwpoison: refactor madvise_inject_error
  mm,hwpoison: drain pcplists before bailing out for non-buddy
zero-refcount page
  mm,hwpoison: drop unneeded pcplist draining
  mm,hwpoison: remove stale code

 mm/madvise.c| 36 +---
 mm/memory-failure.c | 50 +
 2 files changed, 51 insertions(+), 35 deletions(-)

-- 
2.26.2



[PATCH v3 4/5] mm,hwpoison: drop unneeded pcplist draining

2020-09-14 Thread Oscar Salvador
memory_failure and soft_offline_path paths now drain pcplists by calling
get_hwpoison_page.

memory_failure flags the page as HWPoison before, so that page cannot
longer go into a pcplist, and soft_offline_page only flags a page as
HWPoison if 1) we took the page off a buddy freelist 2) the page was
in-use and we migrated it 3) was a clean pagecache.

Because of that, a page cannot longer be poisoned and be in a pcplist.

Link: https://lkml.kernel.org/r/20200908075626.11976-5-osalva...@suse.de
Signed-off-by: Oscar Salvador 
Cc: Michal Hocko 
Cc: Naoya Horiguchi 
Cc: Oscar Salvador 
Cc: Qian Cai 
Cc: Tony Luck 
Signed-off-by: Andrew Morton 
Signed-off-by: Stephen Rothwell 
---
 mm/madvise.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4ce66bab53dd..4bac8e85497a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -917,10 +917,6 @@ static int madvise_inject_error(int behavior,
return ret;
}
 
-   /* Ensure that all poisoned pages are removed from per-cpu lists */
-   for_each_populated_zone(zone)
-   drain_all_pages(zone);
-
return 0;
 }
 #endif
-- 
2.26.2



[PATCH v3 3/5] mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page

2020-09-14 Thread Oscar Salvador
A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
Currently, we bail out with an error if we encounter such a page, meaning
that we do not handle pcppages neither from hard-offline nor from
soft-offline path.

Fix this by draining pcplists whenever we find this kind of page and retry
the check again.  It might be that pcplists have been spilled into the
buddy allocator and so we can handle it.

Link: https://lkml.kernel.org/r/20200908075626.11976-4-osalva...@suse.de
Signed-off-by: Oscar Salvador 
Cc: Michal Hocko 
Cc: Naoya Horiguchi 
Cc: Oscar Salvador 
Cc: Qian Cai 
Cc: Tony Luck 
Signed-off-by: Andrew Morton 
Signed-off-by: Stephen Rothwell 
---
 mm/memory-failure.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 989fb3efdca6..4468c1eb5027 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -948,13 +948,13 @@ static int page_action(struct page_state *ps, struct page 
*p,
 }
 
 /**
- * get_hwpoison_page() - Get refcount for memory error handling:
+ * __get_hwpoison_page() - Get refcount for memory error handling:
  * @page:  raw error page (hit by memory error)
  *
  * Return: return 0 if failed to grab the refcount, otherwise true (some
  * non-zero value.)
  */
-static int get_hwpoison_page(struct page *page)
+static int __get_hwpoison_page(struct page *page)
 {
struct page *head = compound_head(page);
 
@@ -984,6 +984,26 @@ static int get_hwpoison_page(struct page *page)
return 0;
 }
 
+static int get_hwpoison_page(struct page *p)
+{
+   int ret;
+   bool drained = false;
+
+retry:
+   ret = __get_hwpoison_page(p);
+   if (!ret && !is_free_buddy_page(p) && !page_count(p) && !drained) {
+   /*
+* The page might be in a pcplist, so try to drain those
+* and see if we are lucky.
+*/
+   drain_all_pages(page_zone(p));
+   drained = true;
+   goto retry;
+   }
+
+   return ret;
+}
+
 /*
  * Do all that is necessary to remove user space mappings. Unmap
  * the pages and send SIGBUS to the processes if the data was dirty.
-- 
2.26.2



[PATCH v3 5/5] mm,hwpoison: remove stale code

2020-09-14 Thread Oscar Salvador
Currently we call shake_page and then check whether the page is Buddy
because shake_page calls drain_all_pages, which sends pcp-pages back to
the buddy freelists, so we could have a chance to handle free pages.

get_hwpoison_page already calls drain_all_pages, and we do call
get_hwpoison_page right before coming here, so we should be on the safe
side.

Link: https://lkml.kernel.org/r/20200908075626.11976-6-osalva...@suse.de
Signed-off-by: Oscar Salvador 
Cc: Michal Hocko 
Cc: Naoya Horiguchi 
Cc: Oscar Salvador 
Cc: Qian Cai 
Cc: Tony Luck 
Signed-off-by: Andrew Morton 
Signed-off-by: Stephen Rothwell 
---
 mm/memory-failure.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4468c1eb5027..fbe174d54fad 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1421,18 +1421,6 @@ int memory_failure(unsigned long pfn, int flags)
 * walked by the page reclaim code, however that's not a big loss.
 */
shake_page(p, 0);
-   /* shake_page could have turned it free. */
-   if (!PageLRU(p) && is_free_buddy_page(p)) {
-   if (!take_page_off_buddy(p))
-   res = -EBUSY;
-
-   if (flags & MF_COUNT_INCREASED)
-   action_result(pfn, MF_MSG_BUDDY, res ? MF_FAILED : 
MF_RECOVERED);
-   else
-   action_result(pfn, MF_MSG_BUDDY_2ND, res ? MF_FAILED : 
MF_RECOVERED);
-
-   return res;
-   }
 
lock_page(p);
 
-- 
2.26.2



[PATCH v3 2/5] mm,hwpoison: refactor madvise_inject_error

2020-09-14 Thread Oscar Salvador
Make a proper if-else condition for {hard,soft}-offline.

[akpm: remove zone variable and refactor comment]
Link: https://lkml.kernel.org/r/20200908075626.11976-3-osalva...@suse.de
Signed-off-by: Oscar Salvador 
Cc: Michal Hocko 
Cc: Naoya Horiguchi 
Cc: Qian Cai 
Cc: Tony Luck 
Signed-off-by: Andrew Morton 
Signed-off-by: Stephen Rothwell 
---
 mm/madvise.c | 32 ++--
 1 file changed, 14 insertions(+), 18 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index a8f18420efeb..4ce66bab53dd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -872,8 +872,6 @@ static long madvise_remove(struct vm_area_struct *vma,
 static int madvise_inject_error(int behavior,
unsigned long start, unsigned long end)
 {
-   struct page *page;
-   struct zone *zone;
unsigned long size;
 
if (!capable(CAP_SYS_ADMIN))
@@ -882,6 +880,7 @@ static int madvise_inject_error(int behavior,
 
for (; start < end; start += size) {
unsigned long pfn;
+   struct page *page;
int ret;
 
ret = get_user_pages_fast(start, 1, 0, );
@@ -898,25 +897,22 @@ static int madvise_inject_error(int behavior,
 
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual 
address %#lx\n",
-   pfn, start);
-
+pfn, start);
ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
-   if (ret)
-   return ret;
-   continue;
+   } else {
+   pr_info("Injecting memory failure for pfn %#lx at 
process virtual address %#lx\n",
+pfn, start);
+   /*
+* Drop the page reference taken by
+* get_user_pages_fast(). In the absence of
+* MF_COUNT_INCREASED the memory_failure() routine is
+* responsible for pinning the page to prevent it
+* from being released back to the page allocator.
+*/
+   put_page(page);
+   ret = memory_failure(pfn, 0);
}
 
-   pr_info("Injecting memory failure for pfn %#lx at process 
virtual address %#lx\n",
-   pfn, start);
-
-   /*
-* Drop the page reference taken by get_user_pages_fast(). In
-* the absence of MF_COUNT_INCREASED the memory_failure()
-* routine is responsible for pinning the page to prevent it
-* from being released back to the page allocator.
-*/
-   put_page(page);
-   ret = memory_failure(pfn, 0);
if (ret)
return ret;
}
-- 
2.26.2



[PATCH v3 1/5] mm,hwpoison: take free pages off the buddy freelists

2020-09-14 Thread Oscar Salvador
Patch series "HWpoison: further fixes and cleanups", v2.

The important bit of this patchset is patch#1, which is a fix to take off
HWPoison pages off a buddy freelist since it can lead us to having
HWPoison pages back in the game without no one noticing it.  So fix it (we
did that already for soft_offline_page [1]).

The other patches are clean-ups and not that important.

This patch (of 5):

The crux of the matter is that historically we left poisoned pages in the
buddy system because we have some checks in place when allocating a page
that a gatekeeper for poisoned pages.  Unfortunately, we do have other
users (e.g: compaction [1]) that scan buddy freelists and try to get a
page from there without checking whether the page is HWPoison.

As I stated already, I think it is fundamentally wrong to keep HWPoison
pages within the buddy systems, checks in place or not.

Let us fix this we same way we did for soft_offline [2], and take the page
off the buddy freelist, so it is completely unreachable.

Note that this is fairly simple to trigger, as we only need to poison free
buddy pages (madvise MADV_HWPOISON) and then we need to run some sort of
memory stress system.

Just for a matter of reference, I put a dump_page in compaction_alloc to
trigger for HWPoison patches:

kernel: page:12b2982b refcount:1 mapcount:0 mapping: 
index:0x1 pfn:0x1d5db
kernel: flags: 0xfc080(hwpoison)
kernel: raw: 000fc080 ea7573c8 c9857de0 
kernel: raw: 0001  0001 
kernel: page dumped because: compaction_alloc

kernel: CPU: 4 PID: 123 Comm: kcompactd0 Tainted: GE 
5.9.0-rc2-mm1-1-default+ #5
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
kernel: Call Trace:
kernel:  dump_stack+0x6d/0x8b
kernel:  compaction_alloc+0xb2/0xc0
kernel:  migrate_pages+0x2a6/0x12a0
kernel:  ? isolate_freepages+0xc80/0xc80
kernel:  ? __ClearPageMovable+0xb0/0xb0
kernel:  compact_zone+0x5eb/0x11c0
kernel:  ? finish_task_switch+0x74/0x300
kernel:  ? lock_timer_base+0xa8/0x170
kernel:  proactive_compact_node+0x89/0xf0
kernel:  ? kcompactd+0x2d0/0x3a0
kernel:  kcompactd+0x2d0/0x3a0
kernel:  ? finish_wait+0x80/0x80
kernel:  ? kcompactd_do_work+0x350/0x350
kernel:  kthread+0x118/0x130
kernel:  ? kthread_associate_blkcg+0xa0/0xa0
kernel:  ret_from_fork+0x22/0x30

After that, if e.g: someone faults in the page, that someone will get
killed unexpectedly.

While we are at it, I also changed the action result for such cases.
I think that MF_DELAYED is a bit misleading, because in case we could
contain the page and take it off the buddy, such a page is not to be used
again unless it is unpoisoned, so we fixed the situation.

So unless I am missing something, I strongly think that we should report
MF_RECOVERED.

[1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
[2] https://patchwork.kernel.org/patch/11694847/

Link: https://lkml.kernel.org/r/20200908075626.11976-1-osalva...@suse.de
Link: https://lkml.kernel.org/r/20200908075626.11976-2-osalva...@suse.de
Signed-off-by: Oscar Salvador 
Cc: Naoya Horiguchi 
Cc: Michal Hocko 
Cc: Tony Luck 
Cc: Qian Cai 
Cc: Oscar Salvador 
Signed-off-by: Andrew Morton 
Signed-off-by: Stephen Rothwell 
---
 mm/memory-failure.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 68f736add7cc..989fb3efdca6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1321,8 +1321,9 @@ int memory_failure(unsigned long pfn, int flags)
struct page *hpage;
struct page *orig_head;
struct dev_pagemap *pgmap;
-   int res;
unsigned long page_flags;
+   int res = 0;
+   bool retry = true;
 
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1362,10 +1363,21 @@ int memory_failure(unsigned long pfn, int flags)
 * In fact it's dangerous to directly bump up page count from 0,
 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 */
+try_again:
if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
if (is_free_buddy_page(p)) {
-   action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
-   return 0;
+   if (take_page_off_buddy(p)) {
+   action_result(pfn, MF_MSG_BUDDY, MF_RECOVERED);
+   return 0;
+   } else {
+   /* We lost the race, try again */
+   if (retry) {
+   retry = false;
+   goto try_again;
+   }
+   action_res

Re: [RFC 3/5] mm, page_alloc(): remove setup_pageset()

2020-09-10 Thread Oscar Salvador
On Thu, Sep 10, 2020 at 11:23:07AM +0200, Oscar Salvador wrote:
> On Mon, Sep 07, 2020 at 06:36:26PM +0200, Vlastimil Babka wrote:
> > We initialize boot-time pagesets with setup_pageset(), which sets high and
> > batch values that effectively disable pcplists.
> > 
> > We can remove this wrapper if we just set these values for all pagesets in
> > pageset_init(). Non-boot pagesets then subsequently update them to specific
> > values.
> > 
> > Signed-off-by: Vlastimil Babka 
> 
> Reviewed-by: Oscar Salvador 

You would need to squash here the remove of setup_pageset's declaration.
And then remove it from patch#4.

-- 
Oscar Salvador
SUSE L3


Re: [RFC 4/5] mm, page_alloc: cache pageset high and batch in struct zone

2020-09-10 Thread Oscar Salvador
On Mon, Sep 07, 2020 at 06:36:27PM +0200, Vlastimil Babka wrote:
   */
> -static void setup_pageset(struct per_cpu_pageset *p);
> +static void pageset_init(struct per_cpu_pageset *p);

this belongs to the respective patches

> -static void zone_set_pageset_high_and_batch(struct zone *zone)
> +static void zone_set_pageset_high_and_batch(struct zone *zone, bool 
> force_update)
>  {
>   unsigned long new_high;
>   unsigned long new_batch;
> @@ -6256,6 +6256,14 @@ static void zone_set_pageset_high_and_batch(struct 
> zone *zone)
>   new_batch = max(1UL, 1 * new_batch);
>   }
>  
> + if (zone->pageset_high != new_high ||
> + zone->pageset_batch != new_batch) {
> + zone->pageset_high = new_high;
> + zone->pageset_batch = new_batch;
> + } else if (!force_update) {
> + return;
> + }

I am probably missimg something obvious, so sorry, but why do we need
force_update here?
AFAICS, we only want to call pageset_update() in case zone->pageset_high/batch
and the new computed high/batch differs, so if everything is equal, why do we 
want
to call it anyways?

-- 
Oscar Salvador
SUSE L3


Re: [PATCH] mm: don't rely on system state to detect hot-plug operations

2020-09-10 Thread Oscar Salvador
On Thu, Sep 10, 2020 at 01:35:32PM +0200, Laurent Dufour wrote:
 
> That points has been raised by David, quoting him here:
> 
> > IIRC, ACPI can hotadd memory while SCHEDULING, this patch would break that.
> > 
> > Ccing Oscar, I think he mentioned recently that this is the case with ACPI.
> 
> Oscar told that he need to investigate further on that.

I think my reply got lost.

We can see acpi hotplugs during SYSTEM_SCHEDULING:

$QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host 
-monitor pty \
-m size=$MEM,slots=255,maxmem=4294967296k  \
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
-object memory-backend-ram,id=memdimm0,size=134217728 -device 
pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
-object memory-backend-ram,id=memdimm1,size=134217728 -device 
pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
-object memory-backend-ram,id=memdimm2,size=134217728 -device 
pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
-object memory-backend-ram,id=memdimm3,size=134217728 -device 
pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
-object memory-backend-ram,id=memdimm4,size=134217728 -device 
pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
-object memory-backend-ram,id=memdimm5,size=134217728 -device 
pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
-object memory-backend-ram,id=memdimm6,size=134217728 -device 
pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

kernel: [0.753643] __add_memory: nid: 0 start: 01 - 010800 
(size: 134217728)
kernel: [0.756950] register_mem_sect_under_node: system_state= 1

kernel: [0.760811]  register_mem_sect_under_node+0x4f/0x230
kernel: [0.760811]  walk_memory_blocks+0x80/0xc0
kernel: [0.760811]  link_mem_sections+0x32/0x40
kernel: [0.760811]  add_memory_resource+0x148/0x250
kernel: [0.760811]  __add_memory+0x5b/0x90
kernel: [0.760811]  acpi_memory_device_add+0x130/0x300
kernel: [0.760811]  acpi_bus_attach+0x13c/0x1c0
kernel: [0.760811]  acpi_bus_attach+0x60/0x1c0
kernel: [0.760811]  acpi_bus_scan+0x33/0x70
kernel: [0.760811]  acpi_scan_init+0xea/0x21b
kernel: [0.760811]  acpi_init+0x2f1/0x33c
kernel: [0.760811]  do_one_initcall+0x46/0x1f4



-- 
Oscar Salvador
SUSE L3


Re: [PATCH] mm: don't rely on system state to detect hot-plug operations

2020-09-10 Thread Oscar Salvador
On Thu, Sep 10, 2020 at 02:48:47PM +0200, Michal Hocko wrote:
> > Is there any actual usecase for a configuration like this? What is the
> > point to statically define additional memory like this when the same can
> > be achieved on the same command line?

Well, for qemu I am not sure, but if David is right, it seems you can face
the same if you reboot a vm with hotplugged memory.
Moreover, it seems that the problem we spotted with [1], it was a VM running on
Promox (KVM).
The Hypervisor probably said at boot time "Ey, I do have these ACPI devices, 
care
to enable them now"?

As always, there are all sorts of configurations/scenarios out there in the 
wild.

> Forgot to ask one more thing. Who is going to online that memory when
> userspace is not running yet?

Depends, if you have CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE set or you specify
memhp_default_online_type=[online,online_*], memory will get onlined right
after hot-adding stage:

/* online pages if requested */
if (memhp_default_online_type != MMOP_OFFLINE)
walk_memory_blocks(start, size, NULL, online_memory_block);

If not, systemd-udev will do the magic once the system is up.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 1/3] mm: replace memmap_context by memplug_context

2020-09-15 Thread Oscar Salvador
On Tue, Sep 15, 2020 at 11:41:41AM +0200, Laurent Dufour wrote:
> The memmap_context is used to detect whether a memory operation is due to a
> hot-add operation or happening at boot time.
> 
> Make it general to the hotplug operation and rename it at memplug_context.
> 
> There is no functional change introduced by this patch
> 
> Suggested-by: David Hildenbrand 
> Signed-off-by: Laurent Dufour 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 2/3] mm: don't rely on system state to detect hot-plug operations

2020-09-15 Thread Oscar Salvador
On Tue, Sep 15, 2020 at 11:41:42AM +0200, Laurent Dufour wrote:
> [1] According to Oscar Salvador, using this qemu command line, ACPI memory
> hotplug operations are raised at SYSTEM_SCHEDULING state:

I would like to stress that this is not the only way we can end up
hotplugging memor while state = SYSTEM_SCHEDULING.
According to David, we can end up doing this if we reboot a VM
with hotplugged memory.
(And I have seen other virtualization technologies do the same)

 
> Fixes: 4fbce633910e ("mm/memory_hotplug.c: make 
> register_mem_sect_under_node() a callback of walk_memory_range()")
> Signed-off-by: Laurent Dufour 
> Reviewed-by: David Hildenbrand 
> Cc: sta...@vger.kernel.org
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 3/3] mm: don't panic when links can't be created in sysfs

2020-09-15 Thread Oscar Salvador
On Tue, Sep 15, 2020 at 11:41:43AM +0200, Laurent Dufour wrote:
> At boot time, or when doing memory hot-add operations, if the links in
> sysfs can't be created, the system is still able to run, so just report the
> error in the kernel log rather than BUG_ON and potentially make system
> unusable because the callpath can be called with locks held.
> 
> Since the number of memory blocks managed could be high, the messages are
> rate limited.
> 
> As a consequence, link_mem_sections() has no status to report anymore.
> 
> Signed-off-by: Laurent Dufour 
> Acked-by: Michal Hocko 
> Acked-by: David Hildenbrand 
> Cc: Greg Kroah-Hartman 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v3 0/5] HWpoison: further fixes and cleanups

2020-09-16 Thread Oscar Salvador
On Tue, Sep 15, 2020 at 05:22:22PM -0400, Aristeu Rozanski wrote:
> Hi Oscar, Naoya,

Hi Aristeu,

thanks for reporting this.

> I've run these tests using mmotm and mmotm with this patchset on top.

Could you please re-run the tests with the below patch applied, and
attached then the logs here?

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 84a7f228af36..d7b6e7724e47 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -67,6 +67,7 @@ atomic_long_t num_poisoned_pages __read_mostly = 
ATOMIC_LONG_INIT(0);
 
 static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, 
bool release)
 {
+   dump_page(page, "page_handle_poison");
if (release) {
put_page(page);
drain_all_pages(page_zone(page));
@@ -77,7 +78,7 @@ static bool page_handle_poison(struct page *page, bool 
hugepage_or_freepage, boo
 * Doing this check for free pages is also fine since 
dissolve_free_huge_page
 * returns 0 for non-hugetlb pages as well.
 */
-   if (dissolve_free_huge_page(page) || !take_page_off_buddy(page))
+   if (dissolve_free_huge_page(page) || 
!take_page_off_buddy(page)) {
/*
 * We could fail to take off the target page from buddy
 * for example due to racy page allocaiton, but that's
@@ -85,7 +86,9 @@ static bool page_handle_poison(struct page *page, bool 
hugepage_or_freepage, boo
 * and if someone really want to use it, they should
 * take it.
 */
+   pr_info("%s: hugepage_or_freepage failed¸n", __func__);
return false;
+   }
}
 
SetPageHWPoison(page);
@@ -1858,8 +1861,11 @@ static int __soft_offline_page(struct page *page)
if (!ret) {
bool release = !huge;
 
-   if (!page_handle_poison(page, true, release))
+   if (!page_handle_poison(page, true, release)) {
+   pr_info("%s: page_handle_poison -EBUSY\n", 
__func__);
+   dump_page(page, "__soft_offline_page after 
migrate");
ret = -EBUSY;
+   }
} else {
if (!list_empty())
putback_movable_pages();
@@ -1872,6 +1878,7 @@ static int __soft_offline_page(struct page *page)
} else {
pr_info("soft offline: %#lx: %s isolation failed: %d, page 
count %d, type %lx (%pGp)\n",
pfn, msg_page[huge], ret, page_count(page), 
page->flags, >flags);
+   dump_page(page, "__soft_offline_page isolation failed");
ret = -EBUSY;
}
return ret;
@@ -1882,8 +1889,11 @@ static int soft_offline_in_use_page(struct page *page)
struct page *hpage = compound_head(page);
 
if (!PageHuge(page) && PageTransHuge(hpage))
-   if (try_to_split_thp_page(page, "soft offline") < 0)
+   if (try_to_split_thp_page(page, "soft offline") < 0) {
+   pr_info("%s: try_to_split_thp_page -EBUSY\n", __func__);
+   dump_page(page, "try_to_split_thp_page");
return -EBUSY;
+   }
return __soft_offline_page(page);
 }
 
@@ -1891,8 +1901,11 @@ static int soft_offline_free_page(struct page *page)
 {
int rc = 0;
 
-   if (!page_handle_poison(page, true, false))
+   if (!page_handle_poison(page, true, false)) {
+   pr_info("%s: page_handle_poison -EBUSY\n", __func__);
+       dump_page(page, "soft_offline_free_page");
rc = -EBUSY;
+   }
 
return rc;
 }

Thanks

-- 
Oscar Salvador
SUSE L3


Re: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim

2020-10-27 Thread Oscar Salvador
On Wed, Oct 07, 2020 at 09:17:45AM -0700, Dave Hansen wrote:
> Signed-off-by: Dave Hansen 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 

I am still going through all the details, but just my thoughts on things
that caught my eye:

> --- a/include/linux/migrate.h~demote-with-migrate_pages   2020-10-07 
> 09:15:31.028642442 -0700
> +++ b/include/linux/migrate.h 2020-10-07 09:15:31.034642442 -0700
> @@ -27,6 +27,7 @@ enum migrate_reason {
>   MR_MEMPOLICY_MBIND,
>   MR_NUMA_MISPLACED,
>   MR_CONTIG_RANGE,
> + MR_DEMOTION,
>   MR_TYPES

I think you also need to add it under include/trace/events/migrate.h, so
mm_migrate_pages event can know about it.

> +bool migrate_demote_page_ok(struct page *page, struct scan_control *sc)

Make it static?
Also, scan_control seems to be unused here.

> +{
> + int next_nid = next_demotion_node(page_to_nid(page));
> +
> + VM_BUG_ON_PAGE(!PageLocked(page), page);

Right after the call to migrate_demote_page_ok, we call unlock_page
which already has this check in place.
I know that this is only to be on the safe side and we do not loss anything,
but just my thoughts.

> +static struct page *alloc_demote_page(struct page *page, unsigned long node)
> +{
> + /*
> +  * Try to fail quickly if memory on the target node is not
> +  * available.  Leaving out __GFP_IO and __GFP_FS helps with
> +  * this.  If the desintation node is full, we want kswapd to
> +  * run there so that its pages will get reclaimed and future
> +  * migration attempts may succeed.
> +  */
> + gfp_t flags = (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_NORETRY |
> +__GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE |
> +__GFP_KSWAPD_RECLAIM);

I think it would be nicer to have this as a real GFP_ thingy defined.
e.g: GFP_DEMOTION

> + /* HugeTLB pages should not be on the LRU */
> + WARN_ON_ONCE(PageHuge(page));

I am not sure about this one.
This could only happen if the page, which now it is in another list, ends up in
the buddy system. That is quite unlikely bth.
And nevertheless, this is only a warning, which means that if this scenario gets
to happen, we will be allocating a single page to satisfy a higher-order page, 
and
I am not sure about the situation we will end up with.

> +
> + if (PageTransHuge(page)) {
> + struct page *thp;
> +
> + flags |= __GFP_COMP;
> +
> + thp = alloc_pages_node(node, flags, HPAGE_PMD_ORDER);
> + if (!thp)
> + return NULL;
> + prep_transhuge_page(thp);
> + return thp;
> + }
> +
> + return __alloc_pages_node(node, flags, 0);

Would make sense to transform this in some sort of new_demotion_page,
which actually calls alloc_migration_target with the right stuff in place?
And then pass a struct migration_target_control so alloc_migration_target
does the right thing.
alloc_migration_target also takes care of calling prep_transhuge_page
when needed.
e.g:

static struct page *new_demotion_node(struct page *page, unsigned long private)
{
struct migration_target_control mtc = {
.nid = private,
.gfp_mask = GFP_DEMOTION,
};

if (PageTransHuge(page))
mtc.gfp_mask |= __GFP_COMP;

return alloc_migration_target(page, (unsigned long));
}

The only thing I see is that alloc_migration_target seems to "override"
the gfp_mask and does ORs GFP_TRANSHUGE for THP pages, which includes
__GFP_DIRECT_RECLAIM (not appreciated in this case).
But maybe this can be worked around by checking if gfp_mask == GFP_DEMOTION,
and if so, just keep the mask as it is.

> +
> + if (list_empty(demote_pages))
> + return 0;
> +
> + /* Demotion ignores all cpuset and mempolicy settings */
> + err = migrate_pages(demote_pages, alloc_demote_page, NULL,
> + target_nid, MIGRATE_ASYNC, MR_DEMOTION,
> +     _succeeded);

As I said, instead of alloc_demote_page, use a new_demote_page and make
alloc_migration_target handle the allocations and prep thp pages.


-- 
Oscar Salvador
SUSE L3


Re: [RFC PATCH 0/3] Allocate memmap from hotadded memory (per device)

2020-10-27 Thread Oscar Salvador
On Thu, Oct 22, 2020 at 03:01:44PM +0200, David Hildenbrand wrote:
> > This does not go without saying that the patchset is not 100% complete.
> > It is missing:
> > 
> >  - a way to disable memmap_on_memory (either sysfs or boot_time cmd)
> >  - atm, arch_add_memory for s390 screams if an altmap is passed.
> >I am still thinking of a way to nicely drop handle that.
> >Maybe a function in s390 that sets memmap_on_memory false and
> >stuff that check in support_memmap_on_memory function.
> 
> Or simply implement altmap support ... shouldn't be too hard :)

Yeah, I guess so, but first I would like to have everything else settled.
So, gentle ping :-)


-- 
Oscar Salvador
SUSE L3


Re: [RFC PATCH 0/3] Allocate memmap from hotadded memory (per device)

2020-10-27 Thread Oscar Salvador
On Tue, Oct 27, 2020 at 04:44:33PM +0100, David Hildenbrand wrote:
> I'm planning on looking into patch #2/3 later this or next week (this week
> is open source summit / KVM Forum).

Sure, aprecciated the time ;-)

> 
> One thing to look into right now is how to make this fly this with vmemmap
> optimizations for hugetlb pages.
> 
> https://lkml.kernel.org/r/20201026145114.59424-1-songmuc...@bytedance.com

I was about to have a look at that series eitherway, but good you mentioned.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v8 06/12] mm/hugetlb: Allocate the vmemmap pages associated with each HugeTLB page

2020-12-11 Thread Oscar Salvador
On Thu, Dec 10, 2020 at 11:55:20AM +0800, Muchun Song wrote:
> When we free a HugeTLB page to the buddy allocator, we should allocate the
> vmemmap pages associated with it. We can do that in the __free_hugepage()
"vmemmap pages that describe the range" would look better to me, but it is ok.

> +#define GFP_VMEMMAP_PAGE \
> + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH | __GFP_NOWARN)
>  
>  #ifndef VMEMMAP_HPAGE_SHIFT
>  #define VMEMMAP_HPAGE_SHIFT  HPAGE_SHIFT
> @@ -197,6 +200,11 @@
>   (__boundary - 1 < (end) - 1) ? __boundary : (end);   \
>  })
>  
> +typedef void (*vmemmap_remap_pte_func_t)(struct page *reuse, pte_t *pte,
> +  unsigned long start, unsigned long end,
> +  void *priv);

Any reason to not have defined GFP_VMEMMAP_PAGE and the new typedef into
hugetlb_vmemmap.h?

  
> +static void vmemmap_restore_pte_range(struct page *reuse, pte_t *pte,
> +   unsigned long start, unsigned long end,
> +   void *priv)
> +{
> + pgprot_t pgprot = PAGE_KERNEL;
> + void *from = page_to_virt(reuse);
> + unsigned long addr;
> + struct list_head *pages = priv;
[...]
> +
> + /*
> +  * Make sure that any data that writes to the @to is made
> +  * visible to the physical page.
> +  */
> + flush_kernel_vmap_range(to, PAGE_SIZE);

Correct me if I am wrong, but flush_kernel_vmap_range is a NOOP under arches 
which
do not have ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE.
Since we only enable support for x86_64, and x86_64 is one of those arches,
could we remove this, and introduced later on in case we enable this feature
on an arch that needs it?

I am not sure if you need to flush the range somehow, as you did in
vmemmap_remap_range.

> +retry:
> + page = alloc_page(GFP_VMEMMAP_PAGE);
> + if (unlikely(!page)) {
> + msleep(100);
> + /*
> +  * We should retry infinitely, because we cannot
> +  * handle allocation failures. Once we allocate
> +  * vmemmap pages successfully, then we can free
> +  * a HugeTLB page.
> +  */
> + goto retry;

I think this is the trickiest part.
With 2MB HugeTLB pages we only need 6 pages, but with 1GB, the number of pages
we need to allocate increases significantly (4088 pages IIRC).
And you are using __GFP_HIGH, which will allow us to use more memory (by
cutting down the watermark), but it might lead to putting the system
on its knees wrt. memory.
And yes, I know that once we allocate the 4088 pages, 1GB gets freed, but
still.

I would like to hear Michal's thoughts on this one, but I wonder if it makes
sense to not let 1GB-HugeTLB pages be freed.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v8 07/12] mm/hugetlb: Set the PageHWPoison to the raw error page

2020-12-11 Thread Oscar Salvador
On Thu, Dec 10, 2020 at 11:55:21AM +0800, Muchun Song wrote:
> +static inline void subpage_hwpoison_deliver(struct hstate *h, struct page 
> *head)
> +{
> + struct page *page = head;
> +
> + if (!free_vmemmap_pages_per_hpage(h))
> + return;
> +
> + if (PageHWPoison(head))
> + page = head + page_private(head + 4);
> +
> + /*
> +  * Move PageHWPoison flag from head page to the raw error page,
> +  * which makes any subpages rather than the error page reusable.
> +  */
> + if (page != head) {
> + SetPageHWPoison(page);
> + ClearPageHWPoison(head);
> + }
> +}

I would make the names coherent.
I am not definitely goot at names, but something like:
hwpoison_subpage_{foo,bar} looks better.

Also, could not subpage_hwpoison_deliver be rewritten like:

  static inline void subpage_hwpoison_deliver(struct hstate *h, struct page 
*head)
  {
   struct page *page;
  
   if (!PageHWPoison(head) || !free_vmemmap_pages_per_hpage(h))
   return;
  
   page = head + page_private(head + 4);
   /*
* Move PageHWPoison flag from head page to the raw error page,
* which makes any subpages rather than the error page reusable.
*/
   if (page != head) {
   SetPageHWPoison(page);
   ClearPageHWPoison(head);
   }
  }

I think it is better code-wise.

> +  * Move PageHWPoison flag from head page to the raw error page,
> +  * which makes any subpages rather than the error page reusable.
> +  */
> + if (page != head) {
> + SetPageHWPoison(page);
> + ClearPageHWPoison(head);
> + }

I would put this in an else-if above:

if (free_vmemmap_pages_per_hpage(h)) {
set_page_private(head + 4, page - head);
return;
} else if (page != head) {
SetPageHWPoison(page);
ClearPageHWPoison(head);
}

or will we lose the optimization in case free_vmemmap_pages_per_hpage gets 
compiled out?


-- 
Oscar Salvador
SUSE L3


[PATCH] mm,hwpoison: Return -EBUSY when migration fails

2020-12-09 Thread Oscar Salvador
Currently, we return -EIO when we fail to migrate the page.

Migrations' failures are rather transient as they can happen due to
several reasons, e.g: high page refcount bump, mapping->migrate_page
failing etc.
All meaning that at that time the page could not be migrated, but
that has nothing to do with an EIO error.

Let us return -EBUSY instead, as we do in case we failed to isolate
the page.

While are it, let us remove the "ret" print as its value does not change.

Signed-off-by: Oscar Salvador 
---
 mm/memory-failure.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 428991e297e2..1942fb83ac64 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1849,11 +1849,11 @@ static int __soft_offline_page(struct page *page)
pr_info("soft offline: %#lx: %s migration failed %d, 
type %lx (%pGp)\n",
pfn, msg_page[huge], ret, page->flags, 
>flags);
if (ret > 0)
-   ret = -EIO;
+   ret = -EBUSY;
}
} else {
-   pr_info("soft offline: %#lx: %s isolation failed: %d, page 
count %d, type %lx (%pGp)\n",
-   pfn, msg_page[huge], ret, page_count(page), 
page->flags, >flags);
+   pr_info("soft offline: %#lx: %s isolation failed, page count 
%d, type %lx (%pGp)\n",
+   pfn, msg_page[huge], page_count(page), page->flags, 
>flags);
ret = -EBUSY;
}
return ret;
-- 
2.26.2



Re: [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory

2020-12-09 Thread Oscar Salvador
On Wed, Dec 02, 2020 at 10:36:54AM +0100, David Hildenbrand wrote:
> Instead of adding these arch callbacks, what about a config option
> 
> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> 
> that gets selected by the archs with CONFIG_SPARSEMEM_VMEMMAP ?
> 
> The mhp_supports_memmap_on_memory() becomes even more trivial.

I think that would not be enough.
E.g: s390x supports CONFIG_SPARSEMEM_VMEMMAP but it does not support
altmap (and maybe other arches I did not check too).
That is why I was careful in choosing the ones that a) supports
CONFIG_SPARSEMEM_VMEMMAP and b) support altmap

> > Note that mhp_memmap_on_memory kernel boot option will be added in
> > a coming patch.
> 
> I think it makes sense to
> 
> a) separate off the arch changes into separate patches, clarifying why
> it can be used. Move this patches to the end of the series.
> 
> b) Squashing the remainings into patch #2

Ok, I can do that.

> > +/*
> > + * We want memmap (struct page array) to be self contained.
> > + * To do so, we will use the beginning of the hot-added range to build
> > + * the page tables for the memmap array that describes the entire range.
> > + * Only selected architectures support it with SPARSE_VMEMMAP.
> 
> You might want to add how the caller can calculate the necessary size
> and that that this calculated piece of memory to be added will be
> accessed before onlining these pages. This is e.g., relevant if
> virtio-mem, the hyper-v balloon, or xen balloon would want to use this
> mechanism. Also, it's somewhat incompatible with standby memory where
> memory cannot be accessed prior to onlining. So pointing that access out
> might be valuable.

Sure, I will be more verbose.

> You can simplify to
> 
> return arch_support_memmap_on_memory() &&
>    IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP) &&
>size == memory_block_size_bytes();

Yeah, thanks ;-)

-- 
Oscar Salvador
SUSE L3


Re: [RFC PATCH v3 3/4] mm,memory_hotplug: Enable MHP_MEMMAP_ON_MEMORY when supported

2020-12-09 Thread Oscar Salvador
On Wed, Dec 02, 2020 at 10:37:23AM +0100, David Hildenbrand wrote:
 
> Please split that patch into two parts, one for each subsystem.

I did not feel the need but if it eases the review, why not :-)


-- 
Oscar Salvador
SUSE L3


Re: [RFC PATCH v3 1/4] mm,memory_hotplug: Add mhp_supports_memmap_on_memory

2020-12-09 Thread Oscar Salvador
On Wed, Dec 09, 2020 at 10:40:13AM +0100, David Hildenbrand wrote:
> Sorry if I was unclear, s390x will simply not set
> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE.

Bleh, that makes sense now.
I'm in a monday.. 

Thanks David

-- 
Oscar Salvador
SUSE L3


Re: [RFC PATCH v3 4/4] mm,memory_hotplug: Add mhp_memmap_on_memory boot option

2020-12-09 Thread Oscar Salvador
On Wed, Dec 02, 2020 at 10:42:18AM +0100, David Hildenbrand wrote:
> I have another memhp tunable in the works. I suggest doing it like
> page_shuffling and using, module parameters instead. Makes this
> a bit nicer IMHO.

Does that have any impact?

> diff --git a/mm/Makefile b/mm/Makefile
> index 069f216e109e..ba7714b5eaa1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -58,9 +58,13 @@ obj-y:= filemap.o mempool.o 
> oom_kill.o fadvise.o \
>  page-alloc-y := page_alloc.o
>  page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
>  
> +# Give "memory_hotplug" its own module-parameter namespace
> +memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) := memory_hotplug.o
> +
>  obj-y += page-alloc.o
>  obj-y += init-mm.o
>  obj-y += memblock.o
> +obj-y += $(memory-hotplug-y)
>  
>  ifdef CONFIG_MMU
> obj-$(CONFIG_ADVISE_SYSCALLS)   += madvise.o
> @@ -82,7 +86,6 @@ obj-$(CONFIG_SLAB) += slab.o
>  obj-$(CONFIG_SLUB) += slub.o
>  obj-$(CONFIG_KASAN)+= kasan/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
> -obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
>  obj-$(CONFIG_MEMTEST)  += memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> 
> 
> The you can just use module_param/MODULE_PARM_DESC and set the parameter via
> 
> "memory_hotplug.memmap_on_memory"

I have to confess that I was not aware of this trick, but looks cleaner
overall.

Thanks

-- 
Oscar Salvador
SUSE L3


Re: [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range

2020-12-09 Thread Oscar Salvador
_cb);
> > +   if (nr_vmemmap_pages) {
> > +   mhp_altmap.alloc = nr_vmemmap_pages;
> > +   altmap = _altmap;
> > +   }
> > +
> 
> If someone would remove_memory() in a different granularity than
> add_memory(), this would no longer work. How can we catch that
> efficiently? Or at least document that this is not supported with
> memmap_on_memory).

Well, we can check whether the size spans more than a single memory block.
And if it does, we can print a warning with pr_warn and refuse to remove
the memory.

We could document that "memory_hotplug.memmap_on_memory" is meant to operate
with ranges that span a single memory block.
And I guess that we could place that documentation under [1].

[1] Documentation/admin-guide/kernel-parameters.txt

Thanks for the review David

-- 
Oscar Salvador
SUSE L3


Re: [PATCH] mm,hwpoison: Return -EBUSY when migration fails

2020-12-09 Thread Oscar Salvador
On Wed, Dec 09, 2020 at 10:59:04AM +0100, David Hildenbrand wrote:
> On 09.12.20 10:28, Oscar Salvador wrote:
> Do we expect callers to retry immediately? -EAGAIN might make also
> sense. But -EBUSY is an obvious improvement. Do we have callers relying
> on this behavior?

Not really, unless something LTP takes a look at the error code in retries
in case EBUSY.

Take into account that most of the callers do not even really check the
return code (GHES, RAS/CEC, etc.)

-- 
Oscar Salvador
SUSE L3


Re: [PATCH] mm,hwpoison: Return -EBUSY when migration fails

2020-12-09 Thread Oscar Salvador
On Wed, Dec 09, 2020 at 11:25:31AM +0100, Vlastimil Babka wrote:
> On 12/9/20 10:28 AM, Oscar Salvador wrote:
> > Currently, we return -EIO when we fail to migrate the page.
> > 
> > Migrations' failures are rather transient as they can happen due to
> > several reasons, e.g: high page refcount bump, mapping->migrate_page
> > failing etc.
> > All meaning that at that time the page could not be migrated, but
> > that has nothing to do with an EIO error.
> > 
> > Let us return -EBUSY instead, as we do in case we failed to isolate
> > the page.
> > 
> > While are it, let us remove the "ret" print as its value does not change.
> > 
> > Signed-off-by: Oscar Salvador 
> 
> Acked-by: Vlastimil Babka 
> 
> Technically this affects madvise(2) so let's cc linux-api. The manpage doesn't
> document error codes of MADV_HWPOISON and MADV_SOFT_OFFLINE (besides EPERM)
> though so nothing to adjust there. It's meant only for the hwpoison testing
> suite anyway.

Well, not only for hwpoison testing suite.

RAS/CEC and GHES also use soft_offline_page by means of memory_failure_queue
in case a page cec count goes beyond a certain thereshold, but they do not
really check the return code.

Only madvise does.


-- 
Oscar Salvador
SUSE L3


Re: [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range

2020-12-09 Thread Oscar Salvador
On Wed, Dec 09, 2020 at 11:32:26AM +0100, Oscar Salvador wrote:
> On Wed, Dec 02, 2020 at 11:05:53AM +0100, David Hildenbrand wrote:
> > If you take a look at generic_online_page() there are some things that
> > won't be done for our vmemmap pages
> > 
> > 1. kernel_map_pages(page, 1 << order, 1);
> > 
> > We're accessing these pages already when initializing the memmap. We
> > might have to explicitly map these vmemmap pages at some point. Might
> > require some thought. Did you test with debug pagealloc?
> 
> I always try to run with all debug stuff enabled, but I definitely
> did not enable debug_pagealloc.
> I will have a look at it.
> 
> > 2. totalram_pages_add(1UL << order);
> > 
> > We should add/remove the vmemmap pages manually from totalram I guess.
> 
> Yes, we should. That was a clear oversight.

Looking closer, I do not think we have to account those into totalram.
I might be mistaken but looking at memblock_free_all, it seems we only
account to totalram_pages those ranges laying in memblock.memory filtering
out memblock.reserved.

And it seems that the pages we use for pglist_data structs (the ones we handle
in register_page_bootmem_info_node) fall in memblock.reserved.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v8 12/12] mm/hugetlb: Optimize the code with the help of the compiler

2020-12-10 Thread Oscar Salvador

On 2020-12-10 04:55, Muchun Song wrote:

We cannot optimize if a "struct page" crosses page boundaries. If
it is true, we can optimize the code with the help of a compiler.
When free_vmemmap_pages_per_hpage() returns zero, most functions are
optimized by the compiler.


As I said earlier, I would squash this patch with patch#10 and
remove the !is_power_of_2 check in hugetlb_vmemmap_init and leave
only the check for the boot parameter.
That should be enough.


 static inline bool is_hugetlb_free_vmemmap_enabled(void)
 {
-   return hugetlb_free_vmemmap_enabled;
+   return hugetlb_free_vmemmap_enabled &&
+  is_power_of_2(sizeof(struct page));


Why? hugetlb_free_vmemmap_enabled can only become true
if the is_power_of_2 check succeeds in early_hugetlb_free_vmemmap_param.
The "is_power_of_2" check here can go.


diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 0a1c0d33a316..5f5e90c81cd2 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -21,7 +21,7 @@ void free_huge_page_vmemmap(struct hstate *h, struct
page *head);
  */
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate 
*h)

 {
-   return h->nr_free_vmemmap_pages;
+	return h->nr_free_vmemmap_pages && is_power_of_2(sizeof(struct 
page));


If hugetlb_free_vmemmap_enabled is false, hugetlb_vmemmap_init() leaves
h->nr_free_vmemmap_pages unset to 0, so no need for the is_power_of_2 
check here.



--
Oscar Salvador
SUSE L3


Re: [PATCH v8 10/12] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate

2020-12-10 Thread Oscar Salvador
On Thu, Dec 10, 2020 at 11:55:24AM +0800, Muchun Song wrote:
> +void __init hugetlb_vmemmap_init(struct hstate *h)
> +{
> + unsigned int nr_pages = pages_per_huge_page(h);
> + unsigned int vmemmap_pages;
> +
> + /* We cannot optimize if a "struct page" crosses page boundaries. */
> + if (!is_power_of_2(sizeof(struct page)))
> + return;
> +
> + if (!hugetlb_free_vmemmap_enabled)
> + return;

I think it would make sense to squash the last patch and this one.
As per the last patch, if "struct page" is not power of 2,
early_hugetlb_free_vmemmap_param() does not set
hugetlb_free_vmemmap_enabled, so the "!is_power_of_2" check from above
would become useless here.
We know that in order for hugetlb_free_vmemmap_enabled to become true,
the is_power_of_2 must have succeed early on when calling the early_
function.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v8 09/12] mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap

2020-12-10 Thread Oscar Salvador
On Thu, Dec 10, 2020 at 11:55:23AM +0800, Muchun Song wrote:
> +hugetlb_free_vmemmap
> + When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this enables freeing
> + unused vmemmap pages associated each HugeTLB page.
  ^^^ with

> - if (end - start < PAGES_PER_SECTION * sizeof(struct page))
> + if (is_hugetlb_free_vmemmap_enabled())
> + err = vmemmap_populate_basepages(start, end, node, NULL);
> + else if (end - start < PAGES_PER_SECTION * sizeof(struct page))
>   err = vmemmap_populate_basepages(start, end, node, NULL);

Not sure if joining those in an OR makes se.se

>   else if (boot_cpu_has(X86_FEATURE_PSE))
>   err = vmemmap_populate_hugepages(start, end, node, altmap);
> @@ -1610,7 +1613,8 @@ void register_page_bootmem_memmap(unsigned long 
> section_nr,
>   }
>   get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
>  
> - if (!boot_cpu_has(X86_FEATURE_PSE)) {
> + if (!boot_cpu_has(X86_FEATURE_PSE) ||
> + is_hugetlb_free_vmemmap_enabled()) {

I would add a variable at the beginning called "basepages_populated"
that holds the result of those two conditions.
I am not sure if it slightly improves the code as the conditions do
not need to be rechecked, but the readibility a bit.

> +bool hugetlb_free_vmemmap_enabled;
> +
> +static int __init early_hugetlb_free_vmemmap_param(char *buf)
> +{
> + if (!buf)
> + return -EINVAL;
> +
> + /* We cannot optimize if a "struct page" crosses page boundaries. */

I think this comment belongs to the last patch.


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v8 00/12] Free some vmemmap pages of HugeTLB page

2020-12-10 Thread Oscar Salvador
On Thu, Dec 10, 2020 at 11:55:14AM +0800, Muchun Song wrote:
> Muchun Song (12):
>   mm/memory_hotplug: Factor out bootmem core functions to bootmem_info.c
>   mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
>   mm/bootmem_info: Introduce free_bootmem_page helper
>   mm/hugetlb: Free the vmemmap pages associated with each HugeTLB page
>   mm/hugetlb: Defer freeing of HugeTLB pages
>   mm/hugetlb: Allocate the vmemmap pages associated with each HugeTLB
> page
>   mm/hugetlb: Set the PageHWPoison to the raw error page
>   mm/hugetlb: Flush work when dissolving hugetlb page
>   mm/hugetlb: Add a kernel parameter hugetlb_free_vmemmap
>   mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate
>   mm/hugetlb: Gather discrete indexes of tail page
>   mm/hugetlb: Optimize the code with the help of the compiler

Well, we went from 24 patches down to 12 patches.
Not bad at all :-)

I will have a look later

Thanks


-- 
Oscar Salvador
SUSE L3


Re: [External] Re: [PATCH v8 12/12] mm/hugetlb: Optimize the code with the help of the compiler

2020-12-10 Thread Oscar Salvador
On Thu, Dec 10, 2020 at 08:14:18PM +0800, Muchun Song wrote:
> Yeah, you are right. But if we do this check can make the code simple.
> 
> For example, here is a code snippet.
> 
> void func(void)
> {
> if (free_vmemmap_pages_per_hpage())
> return;
> /* Do something */
> }
> 
> With this patch, the func will be optimized to null when is_power_of_2
> returns false.
> 
> void func(void)
> {
> }
> 
> Without this patch, the compiler cannot do this optimization.

Ok, I misread the changelog.

So, then is_hugetlb_free_vmemmap_enabled, free_huge_page_vmemmap, 
free_vmemmap_pages_per_hpage and hugetlb_vmemmap_init are optimized
out, right?

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v8 03/12] mm/bootmem_info: Introduce free_bootmem_page helper

2020-12-10 Thread Oscar Salvador
On Thu, Dec 10, 2020 at 11:55:17AM +0800, Muchun Song wrote:
> Any memory allocated via the memblock allocator and not via the buddy
> will be makred reserved already in the memmap. For those pages, we can
 marked
> call free_bootmem_page() to free it to buddy allocator.
> 
> Becasue we wan to free some vmemmap pages of the HugeTLB to the buddy
Because want
> allocator, we can use this helper to do that in the later patchs.
   patches

To be honest, I think if would be best to introduce this along with
patch#4, so we get to see where it gets used.

> Signed-off-by: Muchun Song 
> ---
>  include/linux/bootmem_info.h | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
> index 4ed6dee1adc9..20a8b0df0c39 100644
> --- a/include/linux/bootmem_info.h
> +++ b/include/linux/bootmem_info.h
> @@ -3,6 +3,7 @@
>  #define __LINUX_BOOTMEM_INFO_H
>  
>  #include 
> +#include 

 already includes 

> +static inline void free_bootmem_page(struct page *page)
> +{
> + unsigned long magic = (unsigned long)page->freelist;
> +
> + /* bootmem page has reserved flag in the reserve_bootmem_region */
reserve_bootmem_region sets the reserved flag on bootmem pages?

> + VM_WARN_ON(!PageReserved(page) || page_ref_count(page) != 2);

We do check for PageReserved in patch#4 before calling in here.
Do we need yet another check here? IOW, do we need to be this paranoid?

> + if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
> + put_page_bootmem(page);
> + else
> + WARN_ON(1);

Lately, some people have been complaining about using WARN_ON as some
systems come with panic_on_warn set.

I would say that in this case it does not matter much as if the vmemmap
pages are not either SECTION_INFO or MIX_SECTION_INFO it means that a
larger corruption happened elsewhere.

But I think I would align the checks here.
It does not make sense to me to only scream under DEBUG_VM if page's
refcount differs from 2, and have a WARN_ON if the page we are trying
to free was not used for the memmap array.
Both things imply a corruption, so I would set the checks under the same
configurations.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v8 04/12] mm/hugetlb: Free the vmemmap pages associated with each HugeTLB page

2020-12-10 Thread Oscar Salvador
On Thu, Dec 10, 2020 at 03:42:56PM +0100, Oscar Salvador wrote:
> On Thu, Dec 10, 2020 at 11:55:18AM +0800, Muchun Song wrote:
> > The free_vmemmap_pages_per_hpage() which indicate that how many vmemmap
> > pages associated with a HugeTLB page that can be freed to the buddy
> > allocator just returns zero now, because all infrastructure is not
> > ready. Once all the infrastructure is ready, we will rework this
> > function to support the feature.
> 
> I would reword the above to:
> 
> "free_vmemmap_pages_per_hpage(), which indicates how many vmemmap
>  pages associated with a HugeTLB page can be freed, returns zero for
>  now, which means the feature is disabled.
>  We will enable it once all the infrastructure is there."
> 
>  Or something along those lines.
> 
> > Signed-off-by: Muchun Song 
> 
> Overall this looks good to me, and it has seen a considerable
> simplification, which is good.
> Some nits/questions below:

And as I said, I would merge patch#3 with this one.


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v8 04/12] mm/hugetlb: Free the vmemmap pages associated with each HugeTLB page

2020-12-10 Thread Oscar Salvador
On Thu, Dec 10, 2020 at 11:55:18AM +0800, Muchun Song wrote:
> The free_vmemmap_pages_per_hpage() which indicate that how many vmemmap
> pages associated with a HugeTLB page that can be freed to the buddy
> allocator just returns zero now, because all infrastructure is not
> ready. Once all the infrastructure is ready, we will rework this
> function to support the feature.

I would reword the above to:

"free_vmemmap_pages_per_hpage(), which indicates how many vmemmap
 pages associated with a HugeTLB page can be freed, returns zero for
 now, which means the feature is disabled.
 We will enable it once all the infrastructure is there."

 Or something along those lines.

> Signed-off-by: Muchun Song 

Overall this looks good to me, and it has seen a considerable
simplification, which is good.
Some nits/questions below:


> +#define vmemmap_hpage_addr_end(addr, end) \
> +({\
> + unsigned long __boundary;\
> + __boundary = ((addr) + VMEMMAP_HPAGE_SIZE) & VMEMMAP_HPAGE_MASK; \
> + (__boundary - 1 < (end) - 1) ? __boundary : (end);   \
> +})

Maybe add a little comment explaining what are you trying to get here.

> +/*
> + * Walk a vmemmap address to the pmd it maps.
> + */
> +static pmd_t *vmemmap_to_pmd(unsigned long addr)
> +{
> + pgd_t *pgd;
> + p4d_t *p4d;
> + pud_t *pud;
> + pmd_t *pmd;
> +
> + pgd = pgd_offset_k(addr);
> + if (pgd_none(*pgd))
> + return NULL;
> +
> + p4d = p4d_offset(pgd, addr);
> + if (p4d_none(*p4d))
> + return NULL;
> +
> + pud = pud_offset(p4d, addr);
> + if (pud_none(*pud))
> + return NULL;
> +
> + pmd = pmd_offset(pud, addr);
> + if (pmd_none(*pmd))
> + return NULL;
> +
> + return pmd;
> +}

I saw that some people suggested to put all the non-hugetlb vmemmap
functions under sparsemem-vmemmap.c, which makes some sense if some
feature is going to re-use this code somehow. (I am not sure if the
recent patches that take advantage of this feature for ZONE_DEVICE needs
something like this).

I do not have a strong opinion on this though.

> +static void vmemmap_reuse_pte_range(struct page *reuse, pte_t *pte,
> + unsigned long start, unsigned long end,
> + struct list_head *vmemmap_pages)
> +{
> + /*
> +  * Make the tail pages are mapped with read-only to catch
> +  * illegal write operation to the tail pages.
> +  */
> + pgprot_t pgprot = PAGE_KERNEL_RO;
> + pte_t entry = mk_pte(reuse, pgprot);
> + unsigned long addr;
> +
> + for (addr = start; addr < end; addr += PAGE_SIZE, pte++) {
> + struct page *page;
> +
> + VM_BUG_ON(pte_none(*pte));

If it is none, page will be NULL and we will crash in the list_add
below?

> +static void vmemmap_remap_range(unsigned long start, unsigned long end,
> + struct list_head *vmemmap_pages)
> +{
> + pmd_t *pmd;
> + unsigned long next, addr = start;
> + struct page *reuse = NULL;
> +
> + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
> + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
> + VM_BUG_ON((start >> PUD_SHIFT) != (end >> PUD_SHIFT));
This last VM_BUG_ON, is to see if both fall under the same PUD table?

> +
> + pmd = vmemmap_to_pmd(addr);
> + BUG_ON(!pmd);

Which is the criteria you followed to make this BUG_ON and VM_BUG_ON
in the check from vmemmap_reuse_pte_range? 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v11 10/11] mm/hugetlb: Gather discrete indexes of tail page

2020-12-22 Thread Oscar Salvador
On Tue, Dec 22, 2020 at 10:24:39PM +0800, Muchun Song wrote:
> +#else
> +static inline void hwpoison_subpage_deliver(struct hstate *h, struct page 
> *head)
> +{
> +}
> +
> +static inline void hwpoison_subpage_set(struct hstate *h, struct page *head,
> + struct page *page)
> +{
> + if (PageHWPoison(head) && page != head) {
> + /*
> +  * Move PageHWPoison flag from head page to the raw error page,
> +  * which makes any subpages rather than the error page reusable.
> +  */
> + SetPageHWPoison(page);
> + ClearPageHWPoison(head);
> + }
> +}
> +#endif

Sorry, I guess I should have made a more clear statement.
This patch should really be about changing the numeric index to its symbolic
names, so the #ifdef handling of hwpoison_subpage_* should have been gone into
patch#6.

I will have a closer look later on though.

-- 
Oscar Salvador
SUSE L3


  1   2   3   4   5   6   7   8   9   10   >