Re: [PATCH v2 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
John Hubbard writes: >> Is that what you meant? > > Yes. > I am still trying to understand this issue. I am also analyzing some cases where interrupt disable is not done before the lockless pagetable walk (patch 3 discussion). But given I forgot to add the mm mailing list before, I think it would be wiser to send a v3 and gather feedback while I keep trying to understand how it works, and if it needs additional memory barrier here. Thanks! Leonardo Bras signature.asc Description: This is a digitally signed message part
Re: [PATCH v2 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
On 9/23/19 1:23 PM, Leonardo Bras wrote: > On Mon, 2019-09-23 at 12:58 -0700, John Hubbard wrote: >> >> CPU 0CPU 1 >> -- -- >>READ(pte) (re-ordered at run time) >>atomic_inc(val) (no run-time memory barrier!) >> >> pmd_clear(pte) >> if (val) >> run_on_all_cpus(): IPI >>local_irq_disable() (also not a mem barrier) >> >>if(pte) >> walk page tables > > Let me see if I can understand, > On most patches, it would be: > > CPU 0CPU 1 > ---- > ptep = __find_linux_pte > (re-ordered at run time) > atomic_inc(val) > pmd_clear(pte) > smp_mb() > if (val) > run_on_all_cpus(): IPI >local_irq_disable() > >if(ptep) > pte = *ptep; > > Is that what you meant? > > Yes. thanks, -- John Hubbard NVIDIA
Re: [PATCH v2 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
On Mon, 2019-09-23 at 12:58 -0700, John Hubbard wrote: > > CPU 0CPU 1 > -- -- >READ(pte) (re-ordered at run time) >atomic_inc(val) (no run-time memory barrier!) > > pmd_clear(pte) > if (val) > run_on_all_cpus(): IPI >local_irq_disable() (also not a mem barrier) > >if(pte) > walk page tables Let me see if I can understand, On most patches, it would be: CPU 0CPU 1 -- -- ptep = __find_linux_pte (re-ordered at run time) atomic_inc(val) pmd_clear(pte) smp_mb() if (val) run_on_all_cpus(): IPI local_irq_disable() if(ptep) pte = *ptep; Is that what you meant? signature.asc Description: This is a digitally signed message part
Re: [PATCH v2 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
On Mon, 2019-09-23 at 11:14 -0700, John Hubbard wrote: > On 9/23/19 10:25 AM, Leonardo Bras wrote: > [...] > That part is all fine, but there are no run-time memory barriers in the > atomic_inc() and atomic_dec() additions, which means that this is not > safe, because memory operations on CPU 1 can be reordered. It's safe > as shown *if* there are memory barriers to keep the order as shown: > > CPU 0CPU 1 > -- -- >atomic_inc(val) (no run-time memory barrier!) > pmd_clear(pte) > if (val) > run_on_all_cpus(): IPI >local_irq_disable() (also not a mem barrier) > >READ(pte) >if(pte) > walk page tables > >local_irq_enable() (still not a barrier) >atomic_dec(val) > > free(pte) > > thanks, This is serialize: void serialize_against_pte_lookup(struct mm_struct *mm) { smp_mb(); if (running_lockless_pgtbl_walk(mm)) smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); } That would mean: CPU 0CPU 1 -- -- atomic_inc(val) pmd_clear(pte) smp_mb() if (val) run_on_all_cpus(): IPI local_irq_disable() READ(pte) if(pte) walk page tables local_irq_enable() (still not a barrier) atomic_dec(val) By https://www.kernel.org/doc/Documentation/memory-barriers.txt : 'If you need all the CPUs to see a given store at the same time, use smp_mb().' Is it not enough? Do you suggest adding 'smp_mb()' after atomic_{inc,dec} ? signature.asc Description: This is a digitally signed message part
Re: [PATCH v2 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
On Fri, 2019-09-20 at 17:48 -0700, John Hubbard wrote: > [...] > So it seems that full memory barriers (not just compiler barriers) are > required. > If the irq enable/disable somehow provides that, then your new code just goes > along for the ride and Just Works. (You don't have any memory barriers in > start_lockless_pgtbl_walk() / end_lockless_pgtbl_walk(), just the compiler > barriers provided by the atomic inc/dec.) > > So it's really a pre-existing question about the correctness of the gup_fast() > irq disabling approach. I am not experienced in other archs, and I am still pretty new to Power, but by what I could understand, this behavior is better explained in serialize_against_pte_lookup. What happens here is that, before doing a THP split/collapse, the function does a update of the pmd and a serialize_against_pte_lookup, in order do avoid a invalid output on a lockless pagetable walk. Serialize basically runs a do_nothing in every cpu related to the process, and wait for it to return. This running depends on interrupt being enabled, so disabling it before gup_pgd_range() and re-enabling after the end, makes the THP split/collapse wait for gup_pgd_range() completion in every cpu before continuing. (here happens the lock) (As told before, every gup_pgd_range() that occurs after it uses a updated pmd, so no problem.) I am sure other archs may have a similar mechanism using local_irq_{disable,enable}. Did it answer your questions? Best regards, Leonardo Bras signature.asc Description: This is a digitally signed message part
Re: [PATCH v2 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
On Fri, 2019-09-20 at 13:11 -0700, John Hubbard wrote: > On 9/20/19 12:50 PM, Leonardo Bras wrote: > > Skips slow part of serialize_against_pte_lookup if there is no running > > lockless pagetable walk. > > > > Signed-off-by: Leonardo Bras > > --- > > arch/powerpc/mm/book3s64/pgtable.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/arch/powerpc/mm/book3s64/pgtable.c > > b/arch/powerpc/mm/book3s64/pgtable.c > > index 13239b17a22c..41ca30269fa3 100644 > > --- a/arch/powerpc/mm/book3s64/pgtable.c > > +++ b/arch/powerpc/mm/book3s64/pgtable.c > > @@ -95,7 +95,8 @@ static void do_nothing(void *unused) > > void serialize_against_pte_lookup(struct mm_struct *mm) > > { > > smp_mb(); > > - smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); > > + if (running_lockless_pgtbl_walk(mm)) > > + smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); > > Hi, > > If you do this, then you are left without any synchronization. So it will > have race conditions: a page table walk could begin right after the above > check returns "false", and then code such as hash__pmdp_huge_get_and_clear() > will continue on right away, under the false assumption that it has let > all the current page table walks complete. > > The current code uses either interrupts or RCU to synchronize, and in > either case, you end up scheduling something on each CPU. If you remove > that entirely, I don't see anything left. ("Pure" atomic counting is not > a synchronization technique all by itself.) > > thanks, Hello John, Thanks for the fast feedback. See, before calling serialize_against_pte_lookup(), there is always an update or clear on the pmd. So, if a page table walk begin right after the check returns "false", there is no problem, since it will use the updated pmd. Think about serialize, on a process with a bunch of cpus. After you check the last processor (wait part), there is no guarantee that the first one is not starting a lockless pagetable walk. The same mechanism protect both methods. Does it make sense? Best regards, Leonardo Bras signature.asc Description: This is a digitally signed message part