[PATCH 1/1] powerpc: fix off by one in max_zone_pfn initialization for ZONE_DMA

2019-06-25 Thread Andrea Arcangeli
25078dc1f74be16b858e914f52cc8f4d03c2271a first introduced an off by
one error in the ZONE_DMA initialization of PPC_BOOK3E_64=y and since
9739ab7eda459f0669ec9807e0d9be5020bab88c the off by one applies to
PPC32=y too. This simply corrects the off by one and should resolve
crashes like below:

[   65.179101] page 0x7fff outside node 0 zone DMA [ 0x0 - 0x7fff ]

Unfortunately in various MM places "max" means a non inclusive end of
range. free_area_init_nodes max_zone_pfn parameter is one case and
MAX_ORDER is another one (unrelated) that comes by memory.

Reported-by: Zorro Lang 
Fixes: 25078dc1f74b ("powerpc: use mm zones more sensibly")
Fixes: 9739ab7eda45 ("powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac")
Signed-off-by: Andrea Arcangeli 
---
 arch/powerpc/mm/mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 2540d3b2588c..2eda1ec36f55 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -249,7 +249,7 @@ void __init paging_init(void)
 
 #ifdef CONFIG_ZONE_DMA
max_zone_pfns[ZONE_DMA] = min(max_low_pfn,
-   ((1UL << ARCH_ZONE_DMA_BITS) - 1) >> PAGE_SHIFT);
+ 1UL << (ARCH_ZONE_DMA_BITS - PAGE_SHIFT));
 #endif
max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
 #ifdef CONFIG_HIGHMEM


Re: [PATCH] powerpc/powernv/npu: Remove redundant change_pte() hook

2019-01-31 Thread Andrea Arcangeli
On Thu, Jan 31, 2019 at 06:30:22PM +0800, Peter Xu wrote:
> The change_pte() notifier was designed to use as a quick path to
> update secondary MMU PTEs on write permission changes or PFN changes.
> For KVM, it could reduce the vm-exits when vcpu faults on the pages
> that was touched up by KSM.  It's not used to do cache invalidations,
> for example, if we see the notifier will be called before the real PTE
> update after all (please see set_pte_at_notify that set_pte_at was
> called later).
> 
> All the necessary cache invalidation should all be done in
> invalidate_range() already.
> 
> CC: Benjamin Herrenschmidt 
> CC: Paul Mackerras 
> CC: Michael Ellerman 
> CC: Alistair Popple 
> CC: Alexey Kardashevskiy 
> CC: Mark Hairgrove 
> CC: Balbir Singh 
> CC: David Gibson 
> CC: Andrea Arcangeli 
> CC: Jerome Glisse 
> CC: Jason Wang 
> CC: linuxppc-dev@lists.ozlabs.org
> CC: linux-ker...@vger.kernel.org
> Signed-off-by: Peter Xu 
> ---
>  arch/powerpc/platforms/powernv/npu-dma.c | 10 --
>  1 file changed, 10 deletions(-)

Reviewed-by: Andrea Arcangeli 

It doesn't make sense to implement change_pte as an invalidate,
change_pte is not compulsory to implement so if one wants to have
invalidates only, change_pte method shouldn't be implemented in the
first place and the common code will guarantee to invoke the range
invalidates instead.

Currently the whole change_pte optimization is effectively disabled as
noted in past discussions with Jerome (because of the range
invalidates that always surrounds it), so we need to revisit the whole
change_pte logic and decide it to re-enable it or to drop it as a
whole, but in the meantime it's good to cleanup spots like below that
should leave change_pte alone.

There are several examples of mmu_notifiers_ops in the kernel that
don't implement change_pte, in fact it's the majority. Of all mmu
notifier users, only nv_nmmu_notifier_ops, intel_mmuops_change and
kvm_mmu_notifier_ops implements change_pte and as Peter found out by
source review nv_nmmu_notifier_ops, intel_mmuops_change are wrong
about it and should stop implementing it as an invalidate.

In short change_pte is only implemented correctly from KVM which can
really updates the spte and flushes the TLB but the spte update
remains and could avoid a vmexit if we figure out how to re-enable the
optimization safely (the TLB fill after change_pte in KVM EPT/shadow
secondary MMU will be looked up by the CPU in hardware).

If change_pte is implemented, it should update the mapping like KVM
does and not do an invalidate.

Thanks,
Andrea

> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index 3f58c7dbd581..c003b29d870e 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -917,15 +917,6 @@ static void pnv_npu2_mn_release(struct mmu_notifier *mn,
>   mmio_invalidate(npu_context, 0, ~0UL);
>  }
>  
> -static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
> - struct mm_struct *mm,
> - unsigned long address,
> - pte_t pte)
> -{
> - struct npu_context *npu_context = mn_to_npu_context(mn);
> - mmio_invalidate(npu_context, address, PAGE_SIZE);
> -}
> -
>  static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
>   struct mm_struct *mm,
>   unsigned long start, unsigned long end)
> @@ -936,7 +927,6 @@ static void pnv_npu2_mn_invalidate_range(struct 
> mmu_notifier *mn,
>  
>  static const struct mmu_notifier_ops nv_nmmu_notifier_ops = {
>   .release = pnv_npu2_mn_release,
> - .change_pte = pnv_npu2_mn_change_pte,
>   .invalidate_range = pnv_npu2_mn_invalidate_range,
>  };
>  
> -- 
> 2.17.1
> 


Re: [PATCH V6 1/4] mm/cma: Add PF flag to force non cma alloc

2019-01-08 Thread Andrea Arcangeli
On Tue, Jan 08, 2019 at 10:21:07AM +0530, Aneesh Kumar K.V wrote:
> This patch add PF_MEMALLOC_NOCMA which make sure any allocation in that 
> context
> is marked non movable and hence cannot be satisfied by CMA region.
> 
> This is useful with get_user_pages_cma_migrate where we take a page pin by
> migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures
> that we avoid uncessary page migration later.
> 
> Suggested-by: Andrea Arcangeli 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Andrea Arcangeli 


Re: [PATCH V6 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_get

2019-01-08 Thread Andrea Arcangeli
Hello,

On Tue, Jan 08, 2019 at 10:21:09AM +0530, Aneesh Kumar K.V wrote:
> @@ -187,41 +149,25 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>   goto unlock_exit;
>   }
>  
> + ret = get_user_pages_cma_migrate(ua, entries, 1, mem->hpages);

In terms of gup APIs, I've been wondering if this shall become
get_user_pages_longerm(FOLL_CMA_MIGRATE). So basically moving this
CMA migrate logic inside get_user_pages_longerm.

It depends if powerpc will ever need to bail on dax and/or if other
non-powerpc vfio drivers which are already bailing on dax may also
later optionally need to avoid interfering with CMA.

Aside from the API detail above, this CMA page migration logic seems a
good solution for the problem.

Thanks,
Andrea


Re: [PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count

2017-11-02 Thread Andrea Arcangeli
On Thu, Nov 02, 2017 at 06:25:11PM +0100, Laurent Dufour wrote:
> I think there is some memory barrier missing when the VMA is modified so
> currently the modifications done in the VMA structure may not be written
> down at the time the pte is locked. So doing that change will also requires
> to call smp_wmb() before locking the page tables. In the current patch this
> is ensured by the call to write_seqcount_end().
> Doing so will still require to have a memory barrier when touching the VMA.
> Not sure we get far better performance compared to the sequence count
> change. But I'll give it a try anyway ;)

Luckily smp_wmb is a noop on x86. I would suggest to ignore the above
issue completely if you give it a try, and then if this performs, we
can just embed a smp_wmb() before spin_lock() somewhere in
pte_offset_map_lock/pte_lockptr/spin_lock_nested for those archs whose
spin_lock isn't a smp_wmb() equivalent. I would focus at flushing
writes before every pagetable spin_lock for non-x86 archs, rather than
after all vma modifications. That should be easier to keep under
control and it's going to be more efficient too as if something there
are fewer spin locks than vma modifications.

For non-x86 archs we may then need a smp_wmb__before_spin_lock. That
looks more self contained than surrounding all vma modifications and
it's a noop on x86 anyway.

I thought about the contention detection logic too yesterday: to
detect contention we could have a mm->mmap_sem_contention_jiffies and
if down_read_trylock_exclusive() [same as down_read_if_not_hold in
prev mail] fails (and it'll fail if either read or write mmap_sem is
hold, so also convering mremap/mprotect etc..) we set
mm->mmap_sem_contention_jiffies = jiffies and then to know if you must
not touch the mmap_sem at all, you compare jiffies against
mmap_sem_contention_jiffies, if it's equal we go speculative. If
that's not enough we can just keep going speculative for a few more
jiffies with time_before(). The srcu lock is non concerning because the
inc/dec of the fast path is in per-cpu cacheline of course, no false
sharing possible there or it wouldn't be any better than a normal lock.

The vma revalidation is already done by khugepaged and mm/userfaultfd,
both need to drop the mmap_sem and continue working on the pagetables,
so we already know it's workable and not too slow.

Summarizing.. by using a runtime contention triggered speculative
design that goes speculative only when contention is runtime-detected
using the above logic (or equivalent), and by having to revalidate the
vma by hand with find_vma without knowing instantly if the vma become
stale, we will run with a substantially slower speculative page fault
than with your current speculative always-on design, but the slower
speculative page fault runtime will still scale 100% in SMP so it
should still be faster on large SMP systems. The pros is that it won't
regress the mmap/brk vma modifications. The whole complexity of
tracking the vma modifications should also go away and the resulting
code should be more maintainable and less risky to break in subtle
ways impossible to reproduce.

Thanks!
Andrea


Re: [PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count

2017-10-26 Thread Andrea Arcangeli
Hello Laurent,

Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea...@intel.com> shows
significant slowdown even for brk/malloc ops both single and
multi threaded.

The single threaded case I think is the most important because it has
zero chance of getting back any benefit later during page faults.

Could you check if:

1. it's possible change vm_write_begin to be a noop if mm->mm_count is
   <= 1? Hint: clone() will run single threaded so there's no way it can run
   in the middle of a being/end critical section (clone could set an
   MMF flag to possibly keep the sequence counter activated if a child
   thread exits and mm_count drops to 1 while the other cpu is in the
   middle of a critical section in the other thread).

2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU
   freeing happened only once a MMF flag is set? That will at least
   reduce the risk of temporary memory waste until the next RCU grace
   period. The read of the MMF will scale fine. Of course to allow
   point 1 and 2 then the page fault should also take the mmap_sem
   until the MMF flag is set.

Could you also investigate a much bigger change: I wonder if it's
possible to drop the sequence number entirely from the vma and stop
using sequence numbers entirely (which is likely the source of the
single threaded regression in point 1 that may explain the report in
the above message-id), and just call the vma rbtree lookup once again
and check that everything is still the same in the vma and the PT lock
obtained is still a match to finish the anon page fault and fill the
pte?

Then of course we also need to add a method to the read-write
semaphore so it tells us if there's already one user holding the read
mmap_sem and we're the second one.  If we're the second one (or more
than second) only then we should skip taking the down_read mmap_sem.
Even a multithreaded app won't ever skip taking the mmap_sem until
there's sign of runtime contention, and it won't have to run the way
more expensive sequence number-less revalidation during page faults,
unless we get an immediate scalability payoff because we already know
the mmap_sem is already contended and there are multiple nested
threads in the page fault handler of the same mm.

Perhaps we'd need something more advanced than a
down_read_trylock_if_not_hold() (which has to guaranteed not to write
to any cacheline) and we'll have to count the per-thread exponential
backoff of mmap_sem frequency, but starting with
down_read_trylock_if_not_hold() would be good I think.

This is not how the current patch works, the current patch uses a
sequence number because it pretends to go lockless always and in turn
has to slow down all vma updates fast paths or the revalidation
slowsdown performance for page fault too much (as it always
revalidates).

I think it would be much better to go speculative only when there's
"detected" runtime contention on the mmap_sem with
down_read_trylock_if_not_hold() and that will make the revalidation
cost not an issue to worry about because normally we won't have to
revalidate the vma at all during page fault. In turn by making the
revalidation more expensive by starting a vma rbtree lookup from
scratch, we can drop the sequence number entirely and that should
simplify the patch tremendously because all vm_write_begin/end would
disappear from the patch and in turn the mmap/brk slowdown measured by
the message-id above, should disappear as well.
   
Thanks,
Andrea


Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless

2017-10-03 Thread Andrea Arcangeli
Hello Jerome,

On Fri, Sep 01, 2017 at 01:30:11PM -0400, Jerome Glisse wrote:
> +Case A is obvious you do not want to take the risk for the device to write to
> +a page that might now be use by some completely different task.

used

> +is true ven if the thread doing the page table update is preempted right 
> after

even

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 90731e3b7e58..5706252b828a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1167,8 +1167,15 @@ static int do_huge_pmd_wp_page_fallback(struct 
> vm_fault *vmf, pmd_t orig_pmd,
>   goto out_free_pages;
>   VM_BUG_ON_PAGE(!PageHead(page), page);
>  
> + /*
> +  * Leave pmd empty until pte is filled note we must notify here as
> +  * concurrent CPU thread might write to new page before the call to
> +  * mmu_notifier_invalidate_range_end() happen which can lead to a

happens

> +  * device seeing memory write in different order than CPU.
> +  *
> +  * See Documentation/vm/mmu_notifier.txt
> +  */
>   pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> - /* leave pmd empty until pte is filled */
>  

Here we can change the following mmu_notifier_invalidate_range_end to
skip calling ->invalidate_range. It could be called
mmu_notifier_invalidate_range_only_end, or other suggestions
welcome. Otherwise we'll repeat the call for nothing.

We need it inside the PT lock for the ordering issue, but we don't
need to run it twice.

Same in do_huge_pmd_wp_page, wp_page_copy and
migrate_vma_insert_page. Every time *clear_flush_notify is used
mmu_notifier_invalidate_range_only_end should be called after it,
instead of mmu_notifier_invalidate_range_end.

I think optimizing that part too, fits in the context of this patchset
(if not in the same patch), because the objective is still the same:
to remove unnecessary ->invalidate_range calls.

> +  * No need to notify as we downgrading page

are

> +  * table protection not changing it to point
> +  * to a new page.
> +  *

> +  * No need to notify as we downgrading page table to read only

are

> +  * No need to notify as we replacing a read only page with another

are

> @@ -1510,13 +1515,43 @@ static bool try_to_unmap_one(struct page *page, 
> struct vm_area_struct *vma,
>   if (pte_soft_dirty(pteval))
>   swp_pte = pte_swp_mksoft_dirty(swp_pte);
>   set_pte_at(mm, address, pvmw.pte, swp_pte);
> - } else
> + } else {
> + /*
> +  * We should not need to notify here as we reach this
> +  * case only from freeze_page() itself only call from
> +  * split_huge_page_to_list() so everything below must
> +  * be true:
> +  *   - page is not anonymous
> +  *   - page is locked
> +  *
> +  * So as it is a shared page and it is locked, it can
> +  * not be remove from the page cache and replace by
> +  * a new page before mmu_notifier_invalidate_range_end
> +  * so no concurrent thread might update its page table
> +  * to point at new page while a device still is using
> +  * this page.
> +  *
> +  * But we can not assume that new user of try_to_unmap
> +  * will have that in mind so just to be safe here call
> +  * mmu_notifier_invalidate_range()
> +  *
> +  * See Documentation/vm/mmu_notifier.txt
> +  */
>   dec_mm_counter(mm, mm_counter_file(page));
> + mmu_notifier_invalidate_range(mm, address,
> +   address + PAGE_SIZE);
> + }
>  discard:
> + /*
> +  * No need to call mmu_notifier_invalidate_range() as we are
> +  * either replacing a present pte with non present one (either
> +  * a swap or special one). We handling the clearing pte case
> +  * above.
> +  *
> +  * See Documentation/vm/mmu_notifier.txt
> +  */
>   page_remove_rmap(subpage, PageHuge(page));
>   put_page(page);
> - mmu_notifier_invalidate_range(mm, address,
> -   address + PAGE_SIZE);
>   }
>  
>   mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);

That is the path that unmaps filebacked pages (btw, not necessarily
shared unlike comment says, they can be private but still filebacked).

I'd like some more explanation about the inner working of "that new

Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback v2

2017-09-02 Thread Andrea Arcangeli
On Thu, Aug 31, 2017 at 05:17:25PM -0400, Jerome Glisse wrote:
> Jérôme Glisse (13):
>   dax: update to new mmu_notifier semantic
>   mm/rmap: update to new mmu_notifier semantic
>   powerpc/powernv: update to new mmu_notifier semantic
>   drm/amdgpu: update to new mmu_notifier semantic
>   IB/umem: update to new mmu_notifier semantic
>   IB/hfi1: update to new mmu_notifier semantic
>   iommu/amd: update to new mmu_notifier semantic
>   iommu/intel: update to new mmu_notifier semantic
>   misc/mic/scif: update to new mmu_notifier semantic
>   sgi-gru: update to new mmu_notifier semantic
>   xen/gntdev: update to new mmu_notifier semantic
>   KVM: update to new mmu_notifier semantic
>   mm/mmu_notifier: kill invalidate_page

Reviewed-by: Andrea Arcangeli <aarca...@redhat.com>


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-27 Thread Andrea Arcangeli
Hi Aneesh,

On Mon, Oct 27, 2014 at 11:28:41PM +0530, Aneesh Kumar K.V wrote:
   VM_BUG_ON(address  ~HPAGE_PMD_MASK);
   if (pmd_trans_huge(*pmdp)) {
   pmd = pmdp_get_and_clear(vma-vm_mm, address, pmdp);
   } else {

The only problematic path that needs IPI is the below one yes.

   /*
* khugepaged calls this for normal pmd
*/
   pmd = *pmdp;
   pmd_clear(pmdp);
   /*
* Wait for all pending hash_page to finish. This is needed
* in case of subpage collapse. When we collapse normal pages
* to hugepage, we first clear the pmd, then invalidate all
* the PTE entries. The assumption here is that any low level
* page fault will see a none pmd and take the slow path that
* will wait on mmap_sem. But we could very well be in a
* hash_page with local ptep pointer value. Such a hash page
* can result in adding new HPTE entries for normal subpages.
* That means we could be modifying the page content as we
* copy them to a huge page. So wait for parallel hash_page
* to finish before invalidating HPTE entries. We can do this
* by sending an IPI to all the cpus and executing a dummy
* function there.
*/
   kick_all_cpus_sync();

 We already do an IPI for ppc64.

Agreed, ppc64 is already covered.

sparc/arm seem to be using the generic pmdp_clear_flush implementation
instead, which just calls flush_tlb_range, so perhaps they aren't.

As above, the IPIs are only needed if the *pmd is not transhuge.

Thanks,
Andrea
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-26 Thread Andrea Arcangeli
Hello,

On Mon, Oct 27, 2014 at 07:50:41AM +1100, Benjamin Herrenschmidt wrote:
 On Fri, 2014-10-24 at 09:22 -0700, James Bottomley wrote:
 
  Parisc does this.  As soon as one CPU issues a TLB purge, it's broadcast
  to all the CPUs on the inter-CPU bus.  The next instruction isn't
  executed until they respond.
  
  But this is only for our CPU TLB.  There's no other external
  consequence, so removal from the page tables isn't effected by this TLB
  flush, therefore the theory on which Dave bases the change to
  atomic_add() should work for us (of course, atomic_add is lock add
  unlock on our CPU, so it's not going to be of much benefit).
 
 I'm not sure I follow you here.
 
 Do you or do you now perform an IPI to do TLB flushes ? If you don't
 (for example because you have HW broadcast), then you need the
 speculative get_page(). If you do (and can read a PTE atomically), you
 can get away with atomic_add().
 
 The reason is that if you remember how zap_pte_range works, we perform
 the flush before we get rid of the page.
 
 So if your using IPIs for the flush, the fact that gup_fast has
 interrupts disabled will delay the IPI response and thus effectively
 prevent the pages from being actually freed, allowing us to simply do
 the atomic_add() on x86.
 
 But if we don't use IPIs because we have HW broadcast of TLB
 invalidations, then we don't have that synchronization. atomic_add won't
 work, we need get_page_speculative() because the page could be
 concurrently being freed.

I looked at how this works more closely and I agree
get_page_unless_zero is always necessary if the TLB flush doesn't
always wait for IPIs to all CPUs where a gup_fast may be running onto.

To summarize, the pagetables are freed with RCU (arch sets
HAVE_RCU_TABLE_FREE) and that allows to walk them lockless with RCU.

After we can walk the pagetables lockless with RCU, we get to the page
lockless, but the pages themself can still be freed at any time from
under us (hence the need for get_page_unless_zero).

The additional trick gup_fast RCU does is to recheck the pte after
elevating the page count with get_page_unless_zero. Rechecking the
pte/hugepmd to be sure it didn't change from under us is critical to
be sure get_page_unless_zero didn't run after the page was freed and
reallocated which would otherwise lead to a security problem too
(i.e. it protects against get_page_unless_zero false positives).

The last bit required is to still disable irqs like on x86 to
serialize against THP splits combined with pmdp_splitting_flush always
delivering IPIs (pmdp_splitting_flush must wait all gup_fast to
complete before proceeding in mangling the page struct of the compound
page).

Preventing the irq disable while taking a gup_fast pin using
compound_lock isn't as easy as it is to do for put_page. put_page
(non-compound) fastest path remains THP agnostic because
collapse_huge_page is inhibited by any existing gup pin, but here
we're exactly taking it, so we can't depend on it to already exist to
avoid the race with collapse_huge_page. It's not just split_huge_page
we need to protect against.

So while thinking the above summary, I noticed this patch misses a IPI
in mm/huge_memory.c that must be delivered after pmdp_clear_flush
below to be safe against collapse_huge_page for the same reasons it
sends it within pmdp_splitting_flush. Without this IPI what can happen
is that the GUP pin protection in __collapse_huge_page_isolate races
against gup_fast-RCU.

If gup_fast reads the pte on one CPU before pmdp_clear_flush, and on
the other CPU __collapse_huge_page_isolate succeeds, then gup_fast
could recheck the pte that hasn't been zapped yet by
__collapse_huge_page_copy. gup_fast would succeed because the pte
wasn't zapped yet, but then __collapse_huge_page_copy would run
replacing the pte with a transhuge pmd, making gup_fast return the old
page, while the process got the copy as part of the collapsed hugepage.

/*
 * After this gup_fast can't run anymore. This also removes
   ^ - invariant broken by 
gup_fast-RCU
 * any huge TLB entry from the CPU so we won't allow
 * huge and small TLB entries for the same virtual address
 * to avoid the risk of CPU bugs in that area.
 */
_pmd = pmdp_clear_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

spin_lock(pte_ptl);
isolated = __collapse_huge_page_isolate(vma, address, pte);
spin_unlock(pte_ptl);

CPU0CPU1
-   -
gup_fast-RCU
local_irq_disable()
pte = pte_offset_map(pmd, address)

pmdp_clear_flush (not sending IPI - 
bug)

__collapse_huge_page_isolate - succeeds

(page_count != 1 gup-pin check of
   

Re: Error in frreing hugepages with preemption enabled

2013-12-03 Thread Andrea Arcangeli
Hi everyone,

On Fri, Nov 29, 2013 at 12:13:03PM +0100, Alexander Graf wrote:
 
 On 29.11.2013, at 05:38, Bharat Bhushan bharat.bhus...@freescale.com wrote:
 
  Hi Alex,
  
  I am running KVM guest with host kernel having CONFIG_PREEMPT enabled. With 
  allocated pages things seems to work fine but I uses hugepages for guest I 
  see below prints when quit from qemu.
  
  (qemu) QEMU waiting for connection on: telnet:0.0.0.0:,server
  qemu-system-ppc64: pci_add_option_rom: failed to find romfile 
  efi-virtio.rom
  q
  debug_smp_processor_id: 15 callbacks suppressed
  BUG: using smp_processor_id() in preemptible [] code: 
  qemu-system-ppc/2504
  caller is .free_hugepd_range+0xb0/0x21c
  CPU: 1 PID: 2504 Comm: qemu-system-ppc Not tainted 
  3.12.0-rc3-07733-gabf4907 #175
  Call Trace:
  [c000fb433400] [c0007d38] .show_stack+0x7c/0x1cc (unreliable)
  [c000fb4334d0] [c05e8ce0] .dump_stack+0x9c/0xf4
  [c000fb433560] [c02de5ec] .debug_smp_processor_id+0x108/0x11c
  [c000fb4335f0] [c0025e10] .free_hugepd_range+0xb0/0x21c
  [c000fb433680] [c00265bc] .hugetlb_free_pgd_range+0x2c8/0x3b0
  [c000fb4337a0] [c00e428c] .free_pgtables+0x14c/0x158
  [c000fb433840] [c00ef320] .exit_mmap+0xec/0x194
  [c000fb433960] [c004d780] .mmput+0x64/0x124
  [c000fb4339e0] [c0051f40] .do_exit+0x29c/0x9c8
  [c000fb433ae0] [c00527c8] .do_group_exit+0x50/0xc4
  [c000fb433b70] [c00606a0] .get_signal_to_deliver+0x21c/0x5d8
  [c000fb433c70] [c0009b08] .do_signal+0x54/0x278
  [c000fb433db0] [c0009e50] .do_notify_resume+0x64/0x78
  [c000fb433e30] [cb44] .ret_from_except_lite+0x70/0x74
  
  
  This mean that free_hugepd_range() must be called with preemption enabled.
 
 with preemption disabled.
 
  I tried below change and this seems to work fine (I am not having expertise 
  in this area so not sure this is correct way)
 
 Not sure - the scope looks odd to me. Let's ask Andrea - I'm sure he knows 
 what to do :).

:) So I had a look at the top of this function (0xb0) in the upstream
kernel and no smp_processor_id() call is apparent, is this stock git
or a ppc tree? The first few calls seem not to call it but I may have
overlooked something. It's just quicker if somebody with vmlinux finds
the location of it.

static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int 
pdshift,
  unsigned long start, unsigned long end,
  unsigned long floor, unsigned long ceiling)
{
pte_t *hugepte = hugepd_page(*hpdp);
int i;

unsigned long pdmask = ~((1UL  pdshift) - 1);
unsigned int num_hugepd = 1;

#ifdef CONFIG_PPC_FSL_BOOK3E
/* Note: On fsl the hpdp may be the first of several */
num_hugepd = (1  (hugepd_shift(*hpdp) - pdshift));
#else
unsigned int shift = hugepd_shift(*hpdp);
#endif

start = pdmask;
if (start  floor)
return;
if (ceiling) {
ceiling = pdmask;
if (! ceiling)
return;
}
if (end - 1  ceiling - 1)
return;

for (i = 0; i  num_hugepd; i++, hpdp++)
hpdp-pd = 0;

tlb-need_flush = 1;

#ifdef CONFIG_PPC_FSL_BOOK3E
hugepd_free(tlb, hugepte);
#else
pgtable_free_tlb(tlb, hugepte, pdshift - shift);
#endif
}

Generally smp_processor_id should never be used, exactly to avoid
problems like above with preempion enabled in .config.

Instead it should be replaced with a get_cpu()/put_cpu() pair that is
exactly meant to fix bugs like this and define proper critical
sections around the per-cpu variables.

#define get_cpu()   ({ preempt_disable(); smp_processor_id(); })
#define put_cpu()   preempt_enable()

After you find where that smp_processor_id() is located, you should
simply replace it with a get_cpu() and then you should insert a
put_cpu immediately after the cpu info is not used anymore. That
will define a proper and strict critical section around the use of the
per-cpu variables.

With a ppc vmlinux it should be immediate to find the location of
smp_processor_id but I don't have the ppc vmlinux here.

Thanks!
Andrea

 
 
 Alex
 
  
  diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
  index d67db4b..6bf8459 100644
  --- a/arch/powerpc/mm/hugetlbpage.c
  +++ b/arch/powerpc/mm/hugetlbpage.c
  @@ -563,8 +563,10 @@ static void hugetlb_free_pmd_range(struct mmu_gather 
  *tlb, pud_t *pud,
  */
 next = addr + (1  hugepd_shift(*(hugepd_t *)pmd));
  #endif
  +   preempt_disable();
 free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT,
   addr, next, floor, ceiling);
  +   preempt_enable();
 } while (addr = next, addr != end);
  

Re: [PATCH -V6 18/27] mm/THP: withdraw the pgtable after pmdp related operations

2013-04-24 Thread Andrea Arcangeli
Hi,

On Wed, Apr 24, 2013 at 02:38:01PM +0530, Aneesh Kumar K.V wrote:
 From 7444a5eda33c00eea465b51c405cb830c57513b7 Mon Sep 17 00:00:00 2001
 From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 Date: Wed, 6 Mar 2013 12:50:37 +0530
 Subject: [PATCH] mm/THP: withdraw the pgtable after pmdp related operations
 
 For architectures like ppc64 we look at deposited pgtable when
 calling pmdp_get_and_clear. So do the pgtable_trans_huge_withdraw
 after finishing pmdp related operations.
 
 Cc: Andrea Arcangeli aarca...@redhat.com
 Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 ---
  mm/huge_memory.c |8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

Reviewed-by: Andrea Arcangeli aarca...@redhat.com

 
 diff --git a/mm/huge_memory.c b/mm/huge_memory.c
 index 84f3180..21c5ebd 100644
 --- a/mm/huge_memory.c
 +++ b/mm/huge_memory.c
 @@ -1363,9 +1363,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
 vm_area_struct *vma,
   struct page *page;
   pgtable_t pgtable;
   pmd_t orig_pmd;
 - pgtable = pgtable_trans_huge_withdraw(tlb-mm, pmd);
 + /*
 +  * For architectures like ppc64 we look at deposited pgtable
 +  * when calling pmdp_get_and_clear. So do the
 +  * pgtable_trans_huge_withdraw after finishing pmdp related
 +  * operations.
 +  */
   orig_pmd = pmdp_get_and_clear(tlb-mm, addr, pmd);
   tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 + pgtable = pgtable_trans_huge_withdraw(tlb-mm, pmd);

So I assume you're going to check the pmdp pointer address in
_withdraw, as the *pmd content is already clear. And that you're
checking the deposited pmd earlier in pmdp_get_and_clear. A bit
strange overall not seeing how exactly you're using the new parameter
and the deposited pmds, but safe.

Thanks,
Andrea
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH -V6 16/27] mm/THP: HPAGE_SHIFT is not a #define on some arch

2013-04-22 Thread Andrea Arcangeli
On Mon, Apr 22, 2013 at 03:30:50PM +0530, Aneesh Kumar K.V wrote:
 From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 
 On archs like powerpc that support different hugepage sizes, HPAGE_SHIFT
 and other derived values like HPAGE_PMD_ORDER are not constants. So move
 that to hugepage_init
 
 Cc: Andrea Arcangeli aarca...@redhat.com
 
 Reviewed-by: David Gibson da...@gibson.dropbear.id.au
 Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com

Reviewed-by: Andrea Arcangeli aarca...@redhat.com 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH -V6 17/27] mm/THP: Add pmd args to pgtable deposit and withdraw APIs

2013-04-22 Thread Andrea Arcangeli
On Mon, Apr 22, 2013 at 03:30:51PM +0530, Aneesh Kumar K.V wrote:
 From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 
 This will be later used by powerpc THP support. In powerpc we want to use
 pgtable for storing the hash index values. So instead of adding them to
 mm_context list, we would like to store them in the second half of pmd
 
 Cc: Andrea Arcangeli aarca...@redhat.com

*snip*

  #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 -void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable)
 +void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 + pgtable_t pgtable)
  {
   assert_spin_locked(mm-page_table_lock);
  
 @@ -141,7 +142,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, 
 pgtable_t pgtable)
  #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
  /* no address argument so destroys page coloring of some arch */
 -pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm)
 +pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
  {
   pgtable_t pgtable;

This will add micro overhead with more variables put in certain regs
or stack. The micro overhead could be optimized away by wrapping the
call with a generic and per-arch header and by adding a __ prefix to
the above one in the generic .c file. I'm neutral but I pointed out so
others are free to comment on it.

Reviewed-by: Andrea Arcangeli aarca...@redhat.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH -V6 18/27] mm/THP: withdraw the pgtable after pmdp related operations

2013-04-22 Thread Andrea Arcangeli
Hi,

On Mon, Apr 22, 2013 at 03:30:52PM +0530, Aneesh Kumar K.V wrote:
 From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 
 For architectures like ppc64 we look at deposited pgtable when
 calling pmdp_get_and_clear. So do the pgtable_trans_huge_withdraw
 after finishing pmdp related operations.
 
 Cc: Andrea Arcangeli aarca...@redhat.com
 Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 ---
  mm/huge_memory.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/mm/huge_memory.c b/mm/huge_memory.c
 index 84f3180..2a43782 100644
 --- a/mm/huge_memory.c
 +++ b/mm/huge_memory.c
 @@ -1363,9 +1363,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
 vm_area_struct *vma,
   struct page *page;
   pgtable_t pgtable;
   pmd_t orig_pmd;
 - pgtable = pgtable_trans_huge_withdraw(tlb-mm, pmd);
 +
   orig_pmd = pmdp_get_and_clear(tlb-mm, addr, pmd);
   tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 + pgtable = pgtable_trans_huge_withdraw(tlb-mm, pmd);
   if (is_huge_zero_pmd(orig_pmd)) {
   tlb-mm-nr_ptes--;
   spin_unlock(tlb-mm-page_table_lock);

I think here a comment inline (not only in the commit msg) is in
order. Otherwise it's hard to imagine others to be aware of this arch
detail when they will read the code later. So it would be prone to
break later without a comment.

Thanks,
Andrea
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page

2012-09-25 Thread Andrea Arcangeli
Hi Kirill,

On Tue, Sep 25, 2012 at 05:27:03PM +0300, Kirill A. Shutemov wrote:
 On Fri, Sep 14, 2012 at 07:52:10AM +0200, Ingo Molnar wrote:
  Without repeatable hard numbers such code just gets into the 
  kernel and bitrots there as new CPU generations come in - a few 
  years down the line the original decisions often degrade to pure 
  noise. We've been there, we've done that, we don't want to 
  repeat it.
 
 sorry, for late answer..
 
 Hard numbers are hard.
 I've checked some workloads: Mosbench, NPB, specjvm2008. Most of time the
 patchset doesn't show any difference (within run-to-run deviation).
 On NPB it recovers THP regression, but it's probably not enough to make
 decision.
 
 It would be nice if somebody test the patchset on other system or
 workload. Especially, if the configuration shows regression with
 THP enabled.

If the only workload that gets a benefit is NPB then we've the proof
this is too hardware dependend to be a conclusive result.

It may have been slower by an accident, things like cache
associativity off by one bit, combined with the implicit coloring
provided to the lowest 512 colors could hurts more if the cache
associativity is low.

I'm saying this because NPB on a thinkpad (Intel CPU I assume) is the
benchmark that shows the most benefit among all benchmarks run on that
hardware.

http://www.phoronix.com/scan.php?page=articleitem=linux_transparent_hugepagesnum=2

I've once seen certain computations that run much slower with perfect
cache coloring but most others runs much faster with the page
coloring. Doesn't mean page coloring is bad per se. So the NPB on that
specific hardware may have been the exception and not the interesting
case. Especially considering the effect of cache-copying is opposite
on slightly different hw.

I think the the static_key should be off by default whenever the CPU
L2 cache size is = the size of the copy (2*HPAGE_PMD_SIZE). Now the
cache does random replacement so maybe we could also allow cache
copies for twice the size of the copy (L2size =
4*HPAGE_PMD_SIZE). Current CPUs have caches much larger than 2*2MB...

It would make a whole lot more sense for hugetlbfs giga pages than for
THP (unlike for THP, cache trashing with giga pages is guaranteed),
but even with giga pages, it's not like they're allocated frequently
(maybe once per OS reboot) so that's also sure totally lost in the
noise as it only saves a few accesses after the cache copy is
finished.

It's good to have tested it though.

Thanks,
Andrea
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Andrea Arcangeli
Hi Kirill,

On Thu, Aug 16, 2012 at 06:15:53PM +0300, Kirill A. Shutemov wrote:
   for (i = 0; i  pages_per_huge_page;
i++, p = mem_map_next(p, page, i)) {

It may be more optimal to avoid a multiplication/shiftleft before the
add, and to do:

for (i = 0, vaddr = haddr; i  pages_per_huge_page;
 i++, p = mem_map_next(p, page, i), vaddr += PAGE_SIZE) {

   cond_resched();
 - clear_user_highpage(p, addr + i * PAGE_SIZE);
 + vaddr = haddr + i*PAGE_SIZE;

Not sure if gcc can optimize it away because of the external calls.

 + if (!ARCH_HAS_USER_NOCACHE || i == target)
 + clear_user_highpage(page + i, vaddr);
 + else
 + clear_user_highpage_nocache(page + i, vaddr);
   }


My only worry overall is if there can be some workload where this may
actually slow down userland if the CPU cache is very large and
userland would access most of the faulted in memory after the first
fault.

So I wouldn't mind to add one more check in addition of
!ARCH_HAS_USER_NOCACHE above to check a runtime sysctl variable. It'll
waste a cacheline yes but I doubt it's measurable compared to the time
it takes to do a =2M hugepage copy.

Furthermore it would allow people to benchmark its effect without
having to rebuild the kernel themself.

All other patches looks fine to me.

Thanks!
Andrea
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Andrea Arcangeli
On Thu, Aug 16, 2012 at 07:43:56PM +0300, Kirill A. Shutemov wrote:
 Hm.. I think with static_key we can avoid cache overhead here. I'll try.

Could you elaborate on the static_key? Is it some sort of self
modifying code?

 Thanks, for review. Could you take a look at huge zero page patchset? ;)

I've noticed that too, nice :). I'm checking some detail on the
wrprotect fault behavior but I'll comment there.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address

2012-08-16 Thread Andrea Arcangeli
On Thu, Aug 16, 2012 at 09:37:25PM +0300, Kirill A. Shutemov wrote:
 On Thu, Aug 16, 2012 at 08:29:44PM +0200, Andrea Arcangeli wrote:
  On Thu, Aug 16, 2012 at 07:43:56PM +0300, Kirill A. Shutemov wrote:
   Hm.. I think with static_key we can avoid cache overhead here. I'll try.
  
  Could you elaborate on the static_key? Is it some sort of self
  modifying code?
 
 Runtime code patching. See Documentation/static-keys.txt. We can patch it
 on sysctl.

I guessed it had to be patching the code, thanks for the
pointer. It looks a perfect fit for this one agreed.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: Fix assmption of end_of_DRAM() returns end address

2012-06-06 Thread Andrea Arcangeli
Hi,

On Wed, Jun 06, 2012 at 03:30:17PM +1000, Benjamin Herrenschmidt wrote:
 On Wed, 2012-06-06 at 00:46 +, Bhushan Bharat-R65777 wrote:
 
memblock_end_of_DRAM() returns end_address + 1, not end address.
While some code assumes that it returns end address.
   
Shouldn't we instead fix it the other way around ? IE, make
memblock_end_of_DRAM() does what the name implies, which is to
  return
the last byte of DRAM, and fix the -other- callers not to make bad
assumptions ?
   
   That was my impression too when I saw this patch.
  
  Initially I also intended to do so. I initiated a email on linux-mm@
  subject memblock_end_of_DRAM()  return end address + 1 and the only
  response I received from Andrea was:
  
  
  It's normal that end means first byte offset out of the range. End
  = not ok.
  end = start+size.
  This is true for vm_end too. So it's better to keep it that way.
  My suggestion is to just fix point 1 below and audit the rest :)
  
 
 Oh well, I don't care enough to fight this battle in my current state so

I wish you to get well soon Ben!

 unless Dave has more stamina than I have today, I'm ok with the patch.

Well it doesn't really matter in the end what is decided as long as
something is decided :). I was asked through a forward so I only
expressed my preference...
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev