[PATCH 1/1] powerpc: fix off by one in max_zone_pfn initialization for ZONE_DMA
25078dc1f74be16b858e914f52cc8f4d03c2271a first introduced an off by one error in the ZONE_DMA initialization of PPC_BOOK3E_64=y and since 9739ab7eda459f0669ec9807e0d9be5020bab88c the off by one applies to PPC32=y too. This simply corrects the off by one and should resolve crashes like below: [ 65.179101] page 0x7fff outside node 0 zone DMA [ 0x0 - 0x7fff ] Unfortunately in various MM places "max" means a non inclusive end of range. free_area_init_nodes max_zone_pfn parameter is one case and MAX_ORDER is another one (unrelated) that comes by memory. Reported-by: Zorro Lang Fixes: 25078dc1f74b ("powerpc: use mm zones more sensibly") Fixes: 9739ab7eda45 ("powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac") Signed-off-by: Andrea Arcangeli --- arch/powerpc/mm/mem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 2540d3b2588c..2eda1ec36f55 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -249,7 +249,7 @@ void __init paging_init(void) #ifdef CONFIG_ZONE_DMA max_zone_pfns[ZONE_DMA] = min(max_low_pfn, - ((1UL << ARCH_ZONE_DMA_BITS) - 1) >> PAGE_SHIFT); + 1UL << (ARCH_ZONE_DMA_BITS - PAGE_SHIFT)); #endif max_zone_pfns[ZONE_NORMAL] = max_low_pfn; #ifdef CONFIG_HIGHMEM
Re: [PATCH] powerpc/powernv/npu: Remove redundant change_pte() hook
On Thu, Jan 31, 2019 at 06:30:22PM +0800, Peter Xu wrote: > The change_pte() notifier was designed to use as a quick path to > update secondary MMU PTEs on write permission changes or PFN changes. > For KVM, it could reduce the vm-exits when vcpu faults on the pages > that was touched up by KSM. It's not used to do cache invalidations, > for example, if we see the notifier will be called before the real PTE > update after all (please see set_pte_at_notify that set_pte_at was > called later). > > All the necessary cache invalidation should all be done in > invalidate_range() already. > > CC: Benjamin Herrenschmidt > CC: Paul Mackerras > CC: Michael Ellerman > CC: Alistair Popple > CC: Alexey Kardashevskiy > CC: Mark Hairgrove > CC: Balbir Singh > CC: David Gibson > CC: Andrea Arcangeli > CC: Jerome Glisse > CC: Jason Wang > CC: linuxppc-dev@lists.ozlabs.org > CC: linux-ker...@vger.kernel.org > Signed-off-by: Peter Xu > --- > arch/powerpc/platforms/powernv/npu-dma.c | 10 -- > 1 file changed, 10 deletions(-) Reviewed-by: Andrea Arcangeli It doesn't make sense to implement change_pte as an invalidate, change_pte is not compulsory to implement so if one wants to have invalidates only, change_pte method shouldn't be implemented in the first place and the common code will guarantee to invoke the range invalidates instead. Currently the whole change_pte optimization is effectively disabled as noted in past discussions with Jerome (because of the range invalidates that always surrounds it), so we need to revisit the whole change_pte logic and decide it to re-enable it or to drop it as a whole, but in the meantime it's good to cleanup spots like below that should leave change_pte alone. There are several examples of mmu_notifiers_ops in the kernel that don't implement change_pte, in fact it's the majority. Of all mmu notifier users, only nv_nmmu_notifier_ops, intel_mmuops_change and kvm_mmu_notifier_ops implements change_pte and as Peter found out by source review nv_nmmu_notifier_ops, intel_mmuops_change are wrong about it and should stop implementing it as an invalidate. In short change_pte is only implemented correctly from KVM which can really updates the spte and flushes the TLB but the spte update remains and could avoid a vmexit if we figure out how to re-enable the optimization safely (the TLB fill after change_pte in KVM EPT/shadow secondary MMU will be looked up by the CPU in hardware). If change_pte is implemented, it should update the mapping like KVM does and not do an invalidate. Thanks, Andrea > > diff --git a/arch/powerpc/platforms/powernv/npu-dma.c > b/arch/powerpc/platforms/powernv/npu-dma.c > index 3f58c7dbd581..c003b29d870e 100644 > --- a/arch/powerpc/platforms/powernv/npu-dma.c > +++ b/arch/powerpc/platforms/powernv/npu-dma.c > @@ -917,15 +917,6 @@ static void pnv_npu2_mn_release(struct mmu_notifier *mn, > mmio_invalidate(npu_context, 0, ~0UL); > } > > -static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn, > - struct mm_struct *mm, > - unsigned long address, > - pte_t pte) > -{ > - struct npu_context *npu_context = mn_to_npu_context(mn); > - mmio_invalidate(npu_context, address, PAGE_SIZE); > -} > - > static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn, > struct mm_struct *mm, > unsigned long start, unsigned long end) > @@ -936,7 +927,6 @@ static void pnv_npu2_mn_invalidate_range(struct > mmu_notifier *mn, > > static const struct mmu_notifier_ops nv_nmmu_notifier_ops = { > .release = pnv_npu2_mn_release, > - .change_pte = pnv_npu2_mn_change_pte, > .invalidate_range = pnv_npu2_mn_invalidate_range, > }; > > -- > 2.17.1 >
Re: [PATCH V6 1/4] mm/cma: Add PF flag to force non cma alloc
On Tue, Jan 08, 2019 at 10:21:07AM +0530, Aneesh Kumar K.V wrote: > This patch add PF_MEMALLOC_NOCMA which make sure any allocation in that > context > is marked non movable and hence cannot be satisfied by CMA region. > > This is useful with get_user_pages_cma_migrate where we take a page pin by > migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures > that we avoid uncessary page migration later. > > Suggested-by: Andrea Arcangeli > Signed-off-by: Aneesh Kumar K.V Reviewed-by: Andrea Arcangeli
Re: [PATCH V6 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_get
Hello, On Tue, Jan 08, 2019 at 10:21:09AM +0530, Aneesh Kumar K.V wrote: > @@ -187,41 +149,25 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, > unsigned long ua, > goto unlock_exit; > } > > + ret = get_user_pages_cma_migrate(ua, entries, 1, mem->hpages); In terms of gup APIs, I've been wondering if this shall become get_user_pages_longerm(FOLL_CMA_MIGRATE). So basically moving this CMA migrate logic inside get_user_pages_longerm. It depends if powerpc will ever need to bail on dax and/or if other non-powerpc vfio drivers which are already bailing on dax may also later optionally need to avoid interfering with CMA. Aside from the API detail above, this CMA page migration logic seems a good solution for the problem. Thanks, Andrea
Re: [PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count
On Thu, Nov 02, 2017 at 06:25:11PM +0100, Laurent Dufour wrote: > I think there is some memory barrier missing when the VMA is modified so > currently the modifications done in the VMA structure may not be written > down at the time the pte is locked. So doing that change will also requires > to call smp_wmb() before locking the page tables. In the current patch this > is ensured by the call to write_seqcount_end(). > Doing so will still require to have a memory barrier when touching the VMA. > Not sure we get far better performance compared to the sequence count > change. But I'll give it a try anyway ;) Luckily smp_wmb is a noop on x86. I would suggest to ignore the above issue completely if you give it a try, and then if this performs, we can just embed a smp_wmb() before spin_lock() somewhere in pte_offset_map_lock/pte_lockptr/spin_lock_nested for those archs whose spin_lock isn't a smp_wmb() equivalent. I would focus at flushing writes before every pagetable spin_lock for non-x86 archs, rather than after all vma modifications. That should be easier to keep under control and it's going to be more efficient too as if something there are fewer spin locks than vma modifications. For non-x86 archs we may then need a smp_wmb__before_spin_lock. That looks more self contained than surrounding all vma modifications and it's a noop on x86 anyway. I thought about the contention detection logic too yesterday: to detect contention we could have a mm->mmap_sem_contention_jiffies and if down_read_trylock_exclusive() [same as down_read_if_not_hold in prev mail] fails (and it'll fail if either read or write mmap_sem is hold, so also convering mremap/mprotect etc..) we set mm->mmap_sem_contention_jiffies = jiffies and then to know if you must not touch the mmap_sem at all, you compare jiffies against mmap_sem_contention_jiffies, if it's equal we go speculative. If that's not enough we can just keep going speculative for a few more jiffies with time_before(). The srcu lock is non concerning because the inc/dec of the fast path is in per-cpu cacheline of course, no false sharing possible there or it wouldn't be any better than a normal lock. The vma revalidation is already done by khugepaged and mm/userfaultfd, both need to drop the mmap_sem and continue working on the pagetables, so we already know it's workable and not too slow. Summarizing.. by using a runtime contention triggered speculative design that goes speculative only when contention is runtime-detected using the above logic (or equivalent), and by having to revalidate the vma by hand with find_vma without knowing instantly if the vma become stale, we will run with a substantially slower speculative page fault than with your current speculative always-on design, but the slower speculative page fault runtime will still scale 100% in SMP so it should still be faster on large SMP systems. The pros is that it won't regress the mmap/brk vma modifications. The whole complexity of tracking the vma modifications should also go away and the resulting code should be more maintainable and less risky to break in subtle ways impossible to reproduce. Thanks! Andrea
Re: [PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count
Hello Laurent, Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea...@intel.com> shows significant slowdown even for brk/malloc ops both single and multi threaded. The single threaded case I think is the most important because it has zero chance of getting back any benefit later during page faults. Could you check if: 1. it's possible change vm_write_begin to be a noop if mm->mm_count is <= 1? Hint: clone() will run single threaded so there's no way it can run in the middle of a being/end critical section (clone could set an MMF flag to possibly keep the sequence counter activated if a child thread exits and mm_count drops to 1 while the other cpu is in the middle of a critical section in the other thread). 2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU freeing happened only once a MMF flag is set? That will at least reduce the risk of temporary memory waste until the next RCU grace period. The read of the MMF will scale fine. Of course to allow point 1 and 2 then the page fault should also take the mmap_sem until the MMF flag is set. Could you also investigate a much bigger change: I wonder if it's possible to drop the sequence number entirely from the vma and stop using sequence numbers entirely (which is likely the source of the single threaded regression in point 1 that may explain the report in the above message-id), and just call the vma rbtree lookup once again and check that everything is still the same in the vma and the PT lock obtained is still a match to finish the anon page fault and fill the pte? Then of course we also need to add a method to the read-write semaphore so it tells us if there's already one user holding the read mmap_sem and we're the second one. If we're the second one (or more than second) only then we should skip taking the down_read mmap_sem. Even a multithreaded app won't ever skip taking the mmap_sem until there's sign of runtime contention, and it won't have to run the way more expensive sequence number-less revalidation during page faults, unless we get an immediate scalability payoff because we already know the mmap_sem is already contended and there are multiple nested threads in the page fault handler of the same mm. Perhaps we'd need something more advanced than a down_read_trylock_if_not_hold() (which has to guaranteed not to write to any cacheline) and we'll have to count the per-thread exponential backoff of mmap_sem frequency, but starting with down_read_trylock_if_not_hold() would be good I think. This is not how the current patch works, the current patch uses a sequence number because it pretends to go lockless always and in turn has to slow down all vma updates fast paths or the revalidation slowsdown performance for page fault too much (as it always revalidates). I think it would be much better to go speculative only when there's "detected" runtime contention on the mmap_sem with down_read_trylock_if_not_hold() and that will make the revalidation cost not an issue to worry about because normally we won't have to revalidate the vma at all during page fault. In turn by making the revalidation more expensive by starting a vma rbtree lookup from scratch, we can drop the sequence number entirely and that should simplify the patch tremendously because all vm_write_begin/end would disappear from the patch and in turn the mmap/brk slowdown measured by the message-id above, should disappear as well. Thanks, Andrea
Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless
Hello Jerome, On Fri, Sep 01, 2017 at 01:30:11PM -0400, Jerome Glisse wrote: > +Case A is obvious you do not want to take the risk for the device to write to > +a page that might now be use by some completely different task. used > +is true ven if the thread doing the page table update is preempted right > after even > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 90731e3b7e58..5706252b828a 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1167,8 +1167,15 @@ static int do_huge_pmd_wp_page_fallback(struct > vm_fault *vmf, pmd_t orig_pmd, > goto out_free_pages; > VM_BUG_ON_PAGE(!PageHead(page), page); > > + /* > + * Leave pmd empty until pte is filled note we must notify here as > + * concurrent CPU thread might write to new page before the call to > + * mmu_notifier_invalidate_range_end() happen which can lead to a happens > + * device seeing memory write in different order than CPU. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd); > - /* leave pmd empty until pte is filled */ > Here we can change the following mmu_notifier_invalidate_range_end to skip calling ->invalidate_range. It could be called mmu_notifier_invalidate_range_only_end, or other suggestions welcome. Otherwise we'll repeat the call for nothing. We need it inside the PT lock for the ordering issue, but we don't need to run it twice. Same in do_huge_pmd_wp_page, wp_page_copy and migrate_vma_insert_page. Every time *clear_flush_notify is used mmu_notifier_invalidate_range_only_end should be called after it, instead of mmu_notifier_invalidate_range_end. I think optimizing that part too, fits in the context of this patchset (if not in the same patch), because the objective is still the same: to remove unnecessary ->invalidate_range calls. > + * No need to notify as we downgrading page are > + * table protection not changing it to point > + * to a new page. > + * > + * No need to notify as we downgrading page table to read only are > + * No need to notify as we replacing a read only page with another are > @@ -1510,13 +1515,43 @@ static bool try_to_unmap_one(struct page *page, > struct vm_area_struct *vma, > if (pte_soft_dirty(pteval)) > swp_pte = pte_swp_mksoft_dirty(swp_pte); > set_pte_at(mm, address, pvmw.pte, swp_pte); > - } else > + } else { > + /* > + * We should not need to notify here as we reach this > + * case only from freeze_page() itself only call from > + * split_huge_page_to_list() so everything below must > + * be true: > + * - page is not anonymous > + * - page is locked > + * > + * So as it is a shared page and it is locked, it can > + * not be remove from the page cache and replace by > + * a new page before mmu_notifier_invalidate_range_end > + * so no concurrent thread might update its page table > + * to point at new page while a device still is using > + * this page. > + * > + * But we can not assume that new user of try_to_unmap > + * will have that in mind so just to be safe here call > + * mmu_notifier_invalidate_range() > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > dec_mm_counter(mm, mm_counter_file(page)); > + mmu_notifier_invalidate_range(mm, address, > + address + PAGE_SIZE); > + } > discard: > + /* > + * No need to call mmu_notifier_invalidate_range() as we are > + * either replacing a present pte with non present one (either > + * a swap or special one). We handling the clearing pte case > + * above. > + * > + * See Documentation/vm/mmu_notifier.txt > + */ > page_remove_rmap(subpage, PageHuge(page)); > put_page(page); > - mmu_notifier_invalidate_range(mm, address, > - address + PAGE_SIZE); > } > > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); That is the path that unmaps filebacked pages (btw, not necessarily shared unlike comment says, they can be private but still filebacked). I'd like some more explanation about the inner working of "that new
Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback v2
On Thu, Aug 31, 2017 at 05:17:25PM -0400, Jerome Glisse wrote: > Jérôme Glisse (13): > dax: update to new mmu_notifier semantic > mm/rmap: update to new mmu_notifier semantic > powerpc/powernv: update to new mmu_notifier semantic > drm/amdgpu: update to new mmu_notifier semantic > IB/umem: update to new mmu_notifier semantic > IB/hfi1: update to new mmu_notifier semantic > iommu/amd: update to new mmu_notifier semantic > iommu/intel: update to new mmu_notifier semantic > misc/mic/scif: update to new mmu_notifier semantic > sgi-gru: update to new mmu_notifier semantic > xen/gntdev: update to new mmu_notifier semantic > KVM: update to new mmu_notifier semantic > mm/mmu_notifier: kill invalidate_page Reviewed-by: Andrea Arcangeli <aarca...@redhat.com>
Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory
Hi Aneesh, On Mon, Oct 27, 2014 at 11:28:41PM +0530, Aneesh Kumar K.V wrote: VM_BUG_ON(address ~HPAGE_PMD_MASK); if (pmd_trans_huge(*pmdp)) { pmd = pmdp_get_and_clear(vma-vm_mm, address, pmdp); } else { The only problematic path that needs IPI is the below one yes. /* * khugepaged calls this for normal pmd */ pmd = *pmdp; pmd_clear(pmdp); /* * Wait for all pending hash_page to finish. This is needed * in case of subpage collapse. When we collapse normal pages * to hugepage, we first clear the pmd, then invalidate all * the PTE entries. The assumption here is that any low level * page fault will see a none pmd and take the slow path that * will wait on mmap_sem. But we could very well be in a * hash_page with local ptep pointer value. Such a hash page * can result in adding new HPTE entries for normal subpages. * That means we could be modifying the page content as we * copy them to a huge page. So wait for parallel hash_page * to finish before invalidating HPTE entries. We can do this * by sending an IPI to all the cpus and executing a dummy * function there. */ kick_all_cpus_sync(); We already do an IPI for ppc64. Agreed, ppc64 is already covered. sparc/arm seem to be using the generic pmdp_clear_flush implementation instead, which just calls flush_tlb_range, so perhaps they aren't. As above, the IPIs are only needed if the *pmd is not transhuge. Thanks, Andrea ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory
Hello, On Mon, Oct 27, 2014 at 07:50:41AM +1100, Benjamin Herrenschmidt wrote: On Fri, 2014-10-24 at 09:22 -0700, James Bottomley wrote: Parisc does this. As soon as one CPU issues a TLB purge, it's broadcast to all the CPUs on the inter-CPU bus. The next instruction isn't executed until they respond. But this is only for our CPU TLB. There's no other external consequence, so removal from the page tables isn't effected by this TLB flush, therefore the theory on which Dave bases the change to atomic_add() should work for us (of course, atomic_add is lock add unlock on our CPU, so it's not going to be of much benefit). I'm not sure I follow you here. Do you or do you now perform an IPI to do TLB flushes ? If you don't (for example because you have HW broadcast), then you need the speculative get_page(). If you do (and can read a PTE atomically), you can get away with atomic_add(). The reason is that if you remember how zap_pte_range works, we perform the flush before we get rid of the page. So if your using IPIs for the flush, the fact that gup_fast has interrupts disabled will delay the IPI response and thus effectively prevent the pages from being actually freed, allowing us to simply do the atomic_add() on x86. But if we don't use IPIs because we have HW broadcast of TLB invalidations, then we don't have that synchronization. atomic_add won't work, we need get_page_speculative() because the page could be concurrently being freed. I looked at how this works more closely and I agree get_page_unless_zero is always necessary if the TLB flush doesn't always wait for IPIs to all CPUs where a gup_fast may be running onto. To summarize, the pagetables are freed with RCU (arch sets HAVE_RCU_TABLE_FREE) and that allows to walk them lockless with RCU. After we can walk the pagetables lockless with RCU, we get to the page lockless, but the pages themself can still be freed at any time from under us (hence the need for get_page_unless_zero). The additional trick gup_fast RCU does is to recheck the pte after elevating the page count with get_page_unless_zero. Rechecking the pte/hugepmd to be sure it didn't change from under us is critical to be sure get_page_unless_zero didn't run after the page was freed and reallocated which would otherwise lead to a security problem too (i.e. it protects against get_page_unless_zero false positives). The last bit required is to still disable irqs like on x86 to serialize against THP splits combined with pmdp_splitting_flush always delivering IPIs (pmdp_splitting_flush must wait all gup_fast to complete before proceeding in mangling the page struct of the compound page). Preventing the irq disable while taking a gup_fast pin using compound_lock isn't as easy as it is to do for put_page. put_page (non-compound) fastest path remains THP agnostic because collapse_huge_page is inhibited by any existing gup pin, but here we're exactly taking it, so we can't depend on it to already exist to avoid the race with collapse_huge_page. It's not just split_huge_page we need to protect against. So while thinking the above summary, I noticed this patch misses a IPI in mm/huge_memory.c that must be delivered after pmdp_clear_flush below to be safe against collapse_huge_page for the same reasons it sends it within pmdp_splitting_flush. Without this IPI what can happen is that the GUP pin protection in __collapse_huge_page_isolate races against gup_fast-RCU. If gup_fast reads the pte on one CPU before pmdp_clear_flush, and on the other CPU __collapse_huge_page_isolate succeeds, then gup_fast could recheck the pte that hasn't been zapped yet by __collapse_huge_page_copy. gup_fast would succeed because the pte wasn't zapped yet, but then __collapse_huge_page_copy would run replacing the pte with a transhuge pmd, making gup_fast return the old page, while the process got the copy as part of the collapsed hugepage. /* * After this gup_fast can't run anymore. This also removes ^ - invariant broken by gup_fast-RCU * any huge TLB entry from the CPU so we won't allow * huge and small TLB entries for the same virtual address * to avoid the risk of CPU bugs in that area. */ _pmd = pmdp_clear_flush(vma, address, pmd); spin_unlock(pmd_ptl); mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); spin_lock(pte_ptl); isolated = __collapse_huge_page_isolate(vma, address, pte); spin_unlock(pte_ptl); CPU0CPU1 - - gup_fast-RCU local_irq_disable() pte = pte_offset_map(pmd, address) pmdp_clear_flush (not sending IPI - bug) __collapse_huge_page_isolate - succeeds (page_count != 1 gup-pin check of
Re: Error in frreing hugepages with preemption enabled
Hi everyone, On Fri, Nov 29, 2013 at 12:13:03PM +0100, Alexander Graf wrote: On 29.11.2013, at 05:38, Bharat Bhushan bharat.bhus...@freescale.com wrote: Hi Alex, I am running KVM guest with host kernel having CONFIG_PREEMPT enabled. With allocated pages things seems to work fine but I uses hugepages for guest I see below prints when quit from qemu. (qemu) QEMU waiting for connection on: telnet:0.0.0.0:,server qemu-system-ppc64: pci_add_option_rom: failed to find romfile efi-virtio.rom q debug_smp_processor_id: 15 callbacks suppressed BUG: using smp_processor_id() in preemptible [] code: qemu-system-ppc/2504 caller is .free_hugepd_range+0xb0/0x21c CPU: 1 PID: 2504 Comm: qemu-system-ppc Not tainted 3.12.0-rc3-07733-gabf4907 #175 Call Trace: [c000fb433400] [c0007d38] .show_stack+0x7c/0x1cc (unreliable) [c000fb4334d0] [c05e8ce0] .dump_stack+0x9c/0xf4 [c000fb433560] [c02de5ec] .debug_smp_processor_id+0x108/0x11c [c000fb4335f0] [c0025e10] .free_hugepd_range+0xb0/0x21c [c000fb433680] [c00265bc] .hugetlb_free_pgd_range+0x2c8/0x3b0 [c000fb4337a0] [c00e428c] .free_pgtables+0x14c/0x158 [c000fb433840] [c00ef320] .exit_mmap+0xec/0x194 [c000fb433960] [c004d780] .mmput+0x64/0x124 [c000fb4339e0] [c0051f40] .do_exit+0x29c/0x9c8 [c000fb433ae0] [c00527c8] .do_group_exit+0x50/0xc4 [c000fb433b70] [c00606a0] .get_signal_to_deliver+0x21c/0x5d8 [c000fb433c70] [c0009b08] .do_signal+0x54/0x278 [c000fb433db0] [c0009e50] .do_notify_resume+0x64/0x78 [c000fb433e30] [cb44] .ret_from_except_lite+0x70/0x74 This mean that free_hugepd_range() must be called with preemption enabled. with preemption disabled. I tried below change and this seems to work fine (I am not having expertise in this area so not sure this is correct way) Not sure - the scope looks odd to me. Let's ask Andrea - I'm sure he knows what to do :). :) So I had a look at the top of this function (0xb0) in the upstream kernel and no smp_processor_id() call is apparent, is this stock git or a ppc tree? The first few calls seem not to call it but I may have overlooked something. It's just quicker if somebody with vmlinux finds the location of it. static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshift, unsigned long start, unsigned long end, unsigned long floor, unsigned long ceiling) { pte_t *hugepte = hugepd_page(*hpdp); int i; unsigned long pdmask = ~((1UL pdshift) - 1); unsigned int num_hugepd = 1; #ifdef CONFIG_PPC_FSL_BOOK3E /* Note: On fsl the hpdp may be the first of several */ num_hugepd = (1 (hugepd_shift(*hpdp) - pdshift)); #else unsigned int shift = hugepd_shift(*hpdp); #endif start = pdmask; if (start floor) return; if (ceiling) { ceiling = pdmask; if (! ceiling) return; } if (end - 1 ceiling - 1) return; for (i = 0; i num_hugepd; i++, hpdp++) hpdp-pd = 0; tlb-need_flush = 1; #ifdef CONFIG_PPC_FSL_BOOK3E hugepd_free(tlb, hugepte); #else pgtable_free_tlb(tlb, hugepte, pdshift - shift); #endif } Generally smp_processor_id should never be used, exactly to avoid problems like above with preempion enabled in .config. Instead it should be replaced with a get_cpu()/put_cpu() pair that is exactly meant to fix bugs like this and define proper critical sections around the per-cpu variables. #define get_cpu() ({ preempt_disable(); smp_processor_id(); }) #define put_cpu() preempt_enable() After you find where that smp_processor_id() is located, you should simply replace it with a get_cpu() and then you should insert a put_cpu immediately after the cpu info is not used anymore. That will define a proper and strict critical section around the use of the per-cpu variables. With a ppc vmlinux it should be immediate to find the location of smp_processor_id but I don't have the ppc vmlinux here. Thanks! Andrea Alex diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index d67db4b..6bf8459 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -563,8 +563,10 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud, */ next = addr + (1 hugepd_shift(*(hugepd_t *)pmd)); #endif + preempt_disable(); free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT, addr, next, floor, ceiling); + preempt_enable(); } while (addr = next, addr != end);
Re: [PATCH -V6 18/27] mm/THP: withdraw the pgtable after pmdp related operations
Hi, On Wed, Apr 24, 2013 at 02:38:01PM +0530, Aneesh Kumar K.V wrote: From 7444a5eda33c00eea465b51c405cb830c57513b7 Mon Sep 17 00:00:00 2001 From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Date: Wed, 6 Mar 2013 12:50:37 +0530 Subject: [PATCH] mm/THP: withdraw the pgtable after pmdp related operations For architectures like ppc64 we look at deposited pgtable when calling pmdp_get_and_clear. So do the pgtable_trans_huge_withdraw after finishing pmdp related operations. Cc: Andrea Arcangeli aarca...@redhat.com Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- mm/huge_memory.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) Reviewed-by: Andrea Arcangeli aarca...@redhat.com diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 84f3180..21c5ebd 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1363,9 +1363,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, struct page *page; pgtable_t pgtable; pmd_t orig_pmd; - pgtable = pgtable_trans_huge_withdraw(tlb-mm, pmd); + /* + * For architectures like ppc64 we look at deposited pgtable + * when calling pmdp_get_and_clear. So do the + * pgtable_trans_huge_withdraw after finishing pmdp related + * operations. + */ orig_pmd = pmdp_get_and_clear(tlb-mm, addr, pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); + pgtable = pgtable_trans_huge_withdraw(tlb-mm, pmd); So I assume you're going to check the pmdp pointer address in _withdraw, as the *pmd content is already clear. And that you're checking the deposited pmd earlier in pmdp_get_and_clear. A bit strange overall not seeing how exactly you're using the new parameter and the deposited pmds, but safe. Thanks, Andrea ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH -V6 16/27] mm/THP: HPAGE_SHIFT is not a #define on some arch
On Mon, Apr 22, 2013 at 03:30:50PM +0530, Aneesh Kumar K.V wrote: From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com On archs like powerpc that support different hugepage sizes, HPAGE_SHIFT and other derived values like HPAGE_PMD_ORDER are not constants. So move that to hugepage_init Cc: Andrea Arcangeli aarca...@redhat.com Reviewed-by: David Gibson da...@gibson.dropbear.id.au Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com Reviewed-by: Andrea Arcangeli aarca...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH -V6 17/27] mm/THP: Add pmd args to pgtable deposit and withdraw APIs
On Mon, Apr 22, 2013 at 03:30:51PM +0530, Aneesh Kumar K.V wrote: From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com This will be later used by powerpc THP support. In powerpc we want to use pgtable for storing the hash index values. So instead of adding them to mm_context list, we would like to store them in the second half of pmd Cc: Andrea Arcangeli aarca...@redhat.com *snip* #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT #ifdef CONFIG_TRANSPARENT_HUGEPAGE -void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable) +void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp, + pgtable_t pgtable) { assert_spin_locked(mm-page_table_lock); @@ -141,7 +142,7 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pgtable_t pgtable) #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* no address argument so destroys page coloring of some arch */ -pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm) +pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) { pgtable_t pgtable; This will add micro overhead with more variables put in certain regs or stack. The micro overhead could be optimized away by wrapping the call with a generic and per-arch header and by adding a __ prefix to the above one in the generic .c file. I'm neutral but I pointed out so others are free to comment on it. Reviewed-by: Andrea Arcangeli aarca...@redhat.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH -V6 18/27] mm/THP: withdraw the pgtable after pmdp related operations
Hi, On Mon, Apr 22, 2013 at 03:30:52PM +0530, Aneesh Kumar K.V wrote: From: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com For architectures like ppc64 we look at deposited pgtable when calling pmdp_get_and_clear. So do the pgtable_trans_huge_withdraw after finishing pmdp related operations. Cc: Andrea Arcangeli aarca...@redhat.com Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com --- mm/huge_memory.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 84f3180..2a43782 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1363,9 +1363,10 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, struct page *page; pgtable_t pgtable; pmd_t orig_pmd; - pgtable = pgtable_trans_huge_withdraw(tlb-mm, pmd); + orig_pmd = pmdp_get_and_clear(tlb-mm, addr, pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); + pgtable = pgtable_trans_huge_withdraw(tlb-mm, pmd); if (is_huge_zero_pmd(orig_pmd)) { tlb-mm-nr_ptes--; spin_unlock(tlb-mm-page_table_lock); I think here a comment inline (not only in the commit msg) is in order. Otherwise it's hard to imagine others to be aware of this arch detail when they will read the code later. So it would be prone to break later without a comment. Thanks, Andrea ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 0/8] Avoid cache trashing on clearing huge/gigantic page
Hi Kirill, On Tue, Sep 25, 2012 at 05:27:03PM +0300, Kirill A. Shutemov wrote: On Fri, Sep 14, 2012 at 07:52:10AM +0200, Ingo Molnar wrote: Without repeatable hard numbers such code just gets into the kernel and bitrots there as new CPU generations come in - a few years down the line the original decisions often degrade to pure noise. We've been there, we've done that, we don't want to repeat it. sorry, for late answer.. Hard numbers are hard. I've checked some workloads: Mosbench, NPB, specjvm2008. Most of time the patchset doesn't show any difference (within run-to-run deviation). On NPB it recovers THP regression, but it's probably not enough to make decision. It would be nice if somebody test the patchset on other system or workload. Especially, if the configuration shows regression with THP enabled. If the only workload that gets a benefit is NPB then we've the proof this is too hardware dependend to be a conclusive result. It may have been slower by an accident, things like cache associativity off by one bit, combined with the implicit coloring provided to the lowest 512 colors could hurts more if the cache associativity is low. I'm saying this because NPB on a thinkpad (Intel CPU I assume) is the benchmark that shows the most benefit among all benchmarks run on that hardware. http://www.phoronix.com/scan.php?page=articleitem=linux_transparent_hugepagesnum=2 I've once seen certain computations that run much slower with perfect cache coloring but most others runs much faster with the page coloring. Doesn't mean page coloring is bad per se. So the NPB on that specific hardware may have been the exception and not the interesting case. Especially considering the effect of cache-copying is opposite on slightly different hw. I think the the static_key should be off by default whenever the CPU L2 cache size is = the size of the copy (2*HPAGE_PMD_SIZE). Now the cache does random replacement so maybe we could also allow cache copies for twice the size of the copy (L2size = 4*HPAGE_PMD_SIZE). Current CPUs have caches much larger than 2*2MB... It would make a whole lot more sense for hugetlbfs giga pages than for THP (unlike for THP, cache trashing with giga pages is guaranteed), but even with giga pages, it's not like they're allocated frequently (maybe once per OS reboot) so that's also sure totally lost in the noise as it only saves a few accesses after the cache copy is finished. It's good to have tested it though. Thanks, Andrea ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address
Hi Kirill, On Thu, Aug 16, 2012 at 06:15:53PM +0300, Kirill A. Shutemov wrote: for (i = 0; i pages_per_huge_page; i++, p = mem_map_next(p, page, i)) { It may be more optimal to avoid a multiplication/shiftleft before the add, and to do: for (i = 0, vaddr = haddr; i pages_per_huge_page; i++, p = mem_map_next(p, page, i), vaddr += PAGE_SIZE) { cond_resched(); - clear_user_highpage(p, addr + i * PAGE_SIZE); + vaddr = haddr + i*PAGE_SIZE; Not sure if gcc can optimize it away because of the external calls. + if (!ARCH_HAS_USER_NOCACHE || i == target) + clear_user_highpage(page + i, vaddr); + else + clear_user_highpage_nocache(page + i, vaddr); } My only worry overall is if there can be some workload where this may actually slow down userland if the CPU cache is very large and userland would access most of the faulted in memory after the first fault. So I wouldn't mind to add one more check in addition of !ARCH_HAS_USER_NOCACHE above to check a runtime sysctl variable. It'll waste a cacheline yes but I doubt it's measurable compared to the time it takes to do a =2M hugepage copy. Furthermore it would allow people to benchmark its effect without having to rebuild the kernel themself. All other patches looks fine to me. Thanks! Andrea ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address
On Thu, Aug 16, 2012 at 07:43:56PM +0300, Kirill A. Shutemov wrote: Hm.. I think with static_key we can avoid cache overhead here. I'll try. Could you elaborate on the static_key? Is it some sort of self modifying code? Thanks, for review. Could you take a look at huge zero page patchset? ;) I've noticed that too, nice :). I'm checking some detail on the wrprotect fault behavior but I'll comment there. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address
On Thu, Aug 16, 2012 at 09:37:25PM +0300, Kirill A. Shutemov wrote: On Thu, Aug 16, 2012 at 08:29:44PM +0200, Andrea Arcangeli wrote: On Thu, Aug 16, 2012 at 07:43:56PM +0300, Kirill A. Shutemov wrote: Hm.. I think with static_key we can avoid cache overhead here. I'll try. Could you elaborate on the static_key? Is it some sort of self modifying code? Runtime code patching. See Documentation/static-keys.txt. We can patch it on sysctl. I guessed it had to be patching the code, thanks for the pointer. It looks a perfect fit for this one agreed. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Fix assmption of end_of_DRAM() returns end address
Hi, On Wed, Jun 06, 2012 at 03:30:17PM +1000, Benjamin Herrenschmidt wrote: On Wed, 2012-06-06 at 00:46 +, Bhushan Bharat-R65777 wrote: memblock_end_of_DRAM() returns end_address + 1, not end address. While some code assumes that it returns end address. Shouldn't we instead fix it the other way around ? IE, make memblock_end_of_DRAM() does what the name implies, which is to return the last byte of DRAM, and fix the -other- callers not to make bad assumptions ? That was my impression too when I saw this patch. Initially I also intended to do so. I initiated a email on linux-mm@ subject memblock_end_of_DRAM() return end address + 1 and the only response I received from Andrea was: It's normal that end means first byte offset out of the range. End = not ok. end = start+size. This is true for vm_end too. So it's better to keep it that way. My suggestion is to just fix point 1 below and audit the rest :) Oh well, I don't care enough to fight this battle in my current state so I wish you to get well soon Ben! unless Dave has more stamina than I have today, I'm ok with the patch. Well it doesn't really matter in the end what is decided as long as something is decided :). I was asked through a forward so I only expressed my preference... ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev