Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-27 Thread Andrea Arcangeli
Hi Aneesh,

On Mon, Oct 27, 2014 at 11:28:41PM +0530, Aneesh Kumar K.V wrote:
>   VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>   if (pmd_trans_huge(*pmdp)) {
>   pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
>   } else {

The only problematic path that needs IPI is the below one yes.

>   /*
>* khugepaged calls this for normal pmd
>*/
>   pmd = *pmdp;
>   pmd_clear(pmdp);
>   /*
>* Wait for all pending hash_page to finish. This is needed
>* in case of subpage collapse. When we collapse normal pages
>* to hugepage, we first clear the pmd, then invalidate all
>* the PTE entries. The assumption here is that any low level
>* page fault will see a none pmd and take the slow path that
>* will wait on mmap_sem. But we could very well be in a
>* hash_page with local ptep pointer value. Such a hash page
>* can result in adding new HPTE entries for normal subpages.
>* That means we could be modifying the page content as we
>* copy them to a huge page. So wait for parallel hash_page
>* to finish before invalidating HPTE entries. We can do this
>* by sending an IPI to all the cpus and executing a dummy
>* function there.
>*/
>   kick_all_cpus_sync();
>
> We already do an IPI for ppc64.

Agreed, ppc64 is already covered.

sparc/arm seem to be using the generic pmdp_clear_flush implementation
instead, which just calls flush_tlb_range, so perhaps they aren't.

As above, the IPIs are only needed if the *pmd is not transhuge.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-27 Thread Aneesh Kumar K.V
Andrea Arcangeli  writes:

> Hello,
>
> On Mon, Oct 27, 2014 at 07:50:41AM +1100, Benjamin Herrenschmidt wrote:
>> On Fri, 2014-10-24 at 09:22 -0700, James Bottomley wrote:
>> 
>> > Parisc does this.  As soon as one CPU issues a TLB purge, it's broadcast
>> > to all the CPUs on the inter-CPU bus.  The next instruction isn't
>> > executed until they respond.
>> > 
>> > But this is only for our CPU TLB.  There's no other external
>> > consequence, so removal from the page tables isn't effected by this TLB
>> > flush, therefore the theory on which Dave bases the change to
>> > atomic_add() should work for us (of course, atomic_add is lock add
>> > unlock on our CPU, so it's not going to be of much benefit).
>> 
>> I'm not sure I follow you here.
>> 
>> Do you or do you now perform an IPI to do TLB flushes ? If you don't
>> (for example because you have HW broadcast), then you need the
>> speculative get_page(). If you do (and can read a PTE atomically), you
>> can get away with atomic_add().
>> 
>> The reason is that if you remember how zap_pte_range works, we perform
>> the flush before we get rid of the page.
>> 
>> So if your using IPIs for the flush, the fact that gup_fast has
>> interrupts disabled will delay the IPI response and thus effectively
>> prevent the pages from being actually freed, allowing us to simply do
>> the atomic_add() on x86.
>> 
>> But if we don't use IPIs because we have HW broadcast of TLB
>> invalidations, then we don't have that synchronization. atomic_add won't
>> work, we need get_page_speculative() because the page could be
>> concurrently being freed.
>
> I looked at how this works more closely and I agree
> get_page_unless_zero is always necessary if the TLB flush doesn't
> always wait for IPIs to all CPUs where a gup_fast may be running onto.
>
> To summarize, the pagetables are freed with RCU (arch sets
> HAVE_RCU_TABLE_FREE) and that allows to walk them lockless with RCU.
>
> After we can walk the pagetables lockless with RCU, we get to the page
> lockless, but the pages themself can still be freed at any time from
> under us (hence the need for get_page_unless_zero).
>
> The additional trick gup_fast RCU does is to recheck the pte after
> elevating the page count with get_page_unless_zero. Rechecking the
> pte/hugepmd to be sure it didn't change from under us is critical to
> be sure get_page_unless_zero didn't run after the page was freed and
> reallocated which would otherwise lead to a security problem too
> (i.e. it protects against get_page_unless_zero false positives).
>
> The last bit required is to still disable irqs like on x86 to
> serialize against THP splits combined with pmdp_splitting_flush always
> delivering IPIs (pmdp_splitting_flush must wait all gup_fast to
> complete before proceeding in mangling the page struct of the compound
> page).
>
> Preventing the irq disable while taking a gup_fast pin using
> compound_lock isn't as "easy" as it is to do for put_page. put_page
> (non-compound) fastest path remains THP agnostic because
> collapse_huge_page is inhibited by any existing gup pin, but here
> we're exactly taking it, so we can't depend on it to already exist to
> avoid the race with collapse_huge_page. It's not just split_huge_page
> we need to protect against.
>
> So while thinking the above summary, I noticed this patch misses a IPI
> in mm/huge_memory.c that must be delivered after pmdp_clear_flush
> below to be safe against collapse_huge_page for the same reasons it
> sends it within pmdp_splitting_flush. Without this IPI what can happen
> is that the GUP pin protection in __collapse_huge_page_isolate races
> against gup_fast-RCU.
>
> If gup_fast reads the pte on one CPU before pmdp_clear_flush, and on
> the other CPU __collapse_huge_page_isolate succeeds, then gup_fast
> could recheck the pte that hasn't been zapped yet by
> __collapse_huge_page_copy. gup_fast would succeed because the pte
> wasn't zapped yet, but then __collapse_huge_page_copy would run
> replacing the pte with a transhuge pmd, making gup_fast return the old
> page, while the process got the copy as part of the collapsed hugepage.
>
>   /*
>* After this gup_fast can't run anymore. This also removes
>  ^ -> invariant broken by 
> gup_fast-RCU
>* any huge TLB entry from the CPU so we won't allow
>* huge and small TLB entries for the same virtual address
>* to avoid the risk of CPU bugs in that area.
>*/
>   _pmd = pmdp_clear_flush(vma, address, pmd);
>   spin_unlock(pmd_ptl);
>   mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>
>   spin_lock(pte_ptl);
>   isolated = __collapse_huge_page_isolate(vma, address, pte);
>   spin_unlock(pte_ptl);


That is the transition from pmd pointing to a PTE page to a hugepage
right ? On ppc64 we do the below. Though not for the same reason
mentioned above (we did that to handle the hash insertion case) that
should take 

Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-27 Thread Aneesh Kumar K.V
Andrea Arcangeli aarca...@redhat.com writes:

 Hello,

 On Mon, Oct 27, 2014 at 07:50:41AM +1100, Benjamin Herrenschmidt wrote:
 On Fri, 2014-10-24 at 09:22 -0700, James Bottomley wrote:
 
  Parisc does this.  As soon as one CPU issues a TLB purge, it's broadcast
  to all the CPUs on the inter-CPU bus.  The next instruction isn't
  executed until they respond.
  
  But this is only for our CPU TLB.  There's no other external
  consequence, so removal from the page tables isn't effected by this TLB
  flush, therefore the theory on which Dave bases the change to
  atomic_add() should work for us (of course, atomic_add is lock add
  unlock on our CPU, so it's not going to be of much benefit).
 
 I'm not sure I follow you here.
 
 Do you or do you now perform an IPI to do TLB flushes ? If you don't
 (for example because you have HW broadcast), then you need the
 speculative get_page(). If you do (and can read a PTE atomically), you
 can get away with atomic_add().
 
 The reason is that if you remember how zap_pte_range works, we perform
 the flush before we get rid of the page.
 
 So if your using IPIs for the flush, the fact that gup_fast has
 interrupts disabled will delay the IPI response and thus effectively
 prevent the pages from being actually freed, allowing us to simply do
 the atomic_add() on x86.
 
 But if we don't use IPIs because we have HW broadcast of TLB
 invalidations, then we don't have that synchronization. atomic_add won't
 work, we need get_page_speculative() because the page could be
 concurrently being freed.

 I looked at how this works more closely and I agree
 get_page_unless_zero is always necessary if the TLB flush doesn't
 always wait for IPIs to all CPUs where a gup_fast may be running onto.

 To summarize, the pagetables are freed with RCU (arch sets
 HAVE_RCU_TABLE_FREE) and that allows to walk them lockless with RCU.

 After we can walk the pagetables lockless with RCU, we get to the page
 lockless, but the pages themself can still be freed at any time from
 under us (hence the need for get_page_unless_zero).

 The additional trick gup_fast RCU does is to recheck the pte after
 elevating the page count with get_page_unless_zero. Rechecking the
 pte/hugepmd to be sure it didn't change from under us is critical to
 be sure get_page_unless_zero didn't run after the page was freed and
 reallocated which would otherwise lead to a security problem too
 (i.e. it protects against get_page_unless_zero false positives).

 The last bit required is to still disable irqs like on x86 to
 serialize against THP splits combined with pmdp_splitting_flush always
 delivering IPIs (pmdp_splitting_flush must wait all gup_fast to
 complete before proceeding in mangling the page struct of the compound
 page).

 Preventing the irq disable while taking a gup_fast pin using
 compound_lock isn't as easy as it is to do for put_page. put_page
 (non-compound) fastest path remains THP agnostic because
 collapse_huge_page is inhibited by any existing gup pin, but here
 we're exactly taking it, so we can't depend on it to already exist to
 avoid the race with collapse_huge_page. It's not just split_huge_page
 we need to protect against.

 So while thinking the above summary, I noticed this patch misses a IPI
 in mm/huge_memory.c that must be delivered after pmdp_clear_flush
 below to be safe against collapse_huge_page for the same reasons it
 sends it within pmdp_splitting_flush. Without this IPI what can happen
 is that the GUP pin protection in __collapse_huge_page_isolate races
 against gup_fast-RCU.

 If gup_fast reads the pte on one CPU before pmdp_clear_flush, and on
 the other CPU __collapse_huge_page_isolate succeeds, then gup_fast
 could recheck the pte that hasn't been zapped yet by
 __collapse_huge_page_copy. gup_fast would succeed because the pte
 wasn't zapped yet, but then __collapse_huge_page_copy would run
 replacing the pte with a transhuge pmd, making gup_fast return the old
 page, while the process got the copy as part of the collapsed hugepage.

   /*
* After this gup_fast can't run anymore. This also removes
  ^ - invariant broken by 
 gup_fast-RCU
* any huge TLB entry from the CPU so we won't allow
* huge and small TLB entries for the same virtual address
* to avoid the risk of CPU bugs in that area.
*/
   _pmd = pmdp_clear_flush(vma, address, pmd);
   spin_unlock(pmd_ptl);
   mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

   spin_lock(pte_ptl);
   isolated = __collapse_huge_page_isolate(vma, address, pte);
   spin_unlock(pte_ptl);


That is the transition from pmd pointing to a PTE page to a hugepage
right ? On ppc64 we do the below. Though not for the same reason
mentioned above (we did that to handle the hash insertion case) that
should take care of the gup case too right ?


pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 

Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-27 Thread Andrea Arcangeli
Hi Aneesh,

On Mon, Oct 27, 2014 at 11:28:41PM +0530, Aneesh Kumar K.V wrote:
   VM_BUG_ON(address  ~HPAGE_PMD_MASK);
   if (pmd_trans_huge(*pmdp)) {
   pmd = pmdp_get_and_clear(vma-vm_mm, address, pmdp);
   } else {

The only problematic path that needs IPI is the below one yes.

   /*
* khugepaged calls this for normal pmd
*/
   pmd = *pmdp;
   pmd_clear(pmdp);
   /*
* Wait for all pending hash_page to finish. This is needed
* in case of subpage collapse. When we collapse normal pages
* to hugepage, we first clear the pmd, then invalidate all
* the PTE entries. The assumption here is that any low level
* page fault will see a none pmd and take the slow path that
* will wait on mmap_sem. But we could very well be in a
* hash_page with local ptep pointer value. Such a hash page
* can result in adding new HPTE entries for normal subpages.
* That means we could be modifying the page content as we
* copy them to a huge page. So wait for parallel hash_page
* to finish before invalidating HPTE entries. We can do this
* by sending an IPI to all the cpus and executing a dummy
* function there.
*/
   kick_all_cpus_sync();

 We already do an IPI for ppc64.

Agreed, ppc64 is already covered.

sparc/arm seem to be using the generic pmdp_clear_flush implementation
instead, which just calls flush_tlb_range, so perhaps they aren't.

As above, the IPIs are only needed if the *pmd is not transhuge.

Thanks,
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-26 Thread Andrea Arcangeli
Hello,

On Mon, Oct 27, 2014 at 07:50:41AM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2014-10-24 at 09:22 -0700, James Bottomley wrote:
> 
> > Parisc does this.  As soon as one CPU issues a TLB purge, it's broadcast
> > to all the CPUs on the inter-CPU bus.  The next instruction isn't
> > executed until they respond.
> > 
> > But this is only for our CPU TLB.  There's no other external
> > consequence, so removal from the page tables isn't effected by this TLB
> > flush, therefore the theory on which Dave bases the change to
> > atomic_add() should work for us (of course, atomic_add is lock add
> > unlock on our CPU, so it's not going to be of much benefit).
> 
> I'm not sure I follow you here.
> 
> Do you or do you now perform an IPI to do TLB flushes ? If you don't
> (for example because you have HW broadcast), then you need the
> speculative get_page(). If you do (and can read a PTE atomically), you
> can get away with atomic_add().
> 
> The reason is that if you remember how zap_pte_range works, we perform
> the flush before we get rid of the page.
> 
> So if your using IPIs for the flush, the fact that gup_fast has
> interrupts disabled will delay the IPI response and thus effectively
> prevent the pages from being actually freed, allowing us to simply do
> the atomic_add() on x86.
> 
> But if we don't use IPIs because we have HW broadcast of TLB
> invalidations, then we don't have that synchronization. atomic_add won't
> work, we need get_page_speculative() because the page could be
> concurrently being freed.

I looked at how this works more closely and I agree
get_page_unless_zero is always necessary if the TLB flush doesn't
always wait for IPIs to all CPUs where a gup_fast may be running onto.

To summarize, the pagetables are freed with RCU (arch sets
HAVE_RCU_TABLE_FREE) and that allows to walk them lockless with RCU.

After we can walk the pagetables lockless with RCU, we get to the page
lockless, but the pages themself can still be freed at any time from
under us (hence the need for get_page_unless_zero).

The additional trick gup_fast RCU does is to recheck the pte after
elevating the page count with get_page_unless_zero. Rechecking the
pte/hugepmd to be sure it didn't change from under us is critical to
be sure get_page_unless_zero didn't run after the page was freed and
reallocated which would otherwise lead to a security problem too
(i.e. it protects against get_page_unless_zero false positives).

The last bit required is to still disable irqs like on x86 to
serialize against THP splits combined with pmdp_splitting_flush always
delivering IPIs (pmdp_splitting_flush must wait all gup_fast to
complete before proceeding in mangling the page struct of the compound
page).

Preventing the irq disable while taking a gup_fast pin using
compound_lock isn't as "easy" as it is to do for put_page. put_page
(non-compound) fastest path remains THP agnostic because
collapse_huge_page is inhibited by any existing gup pin, but here
we're exactly taking it, so we can't depend on it to already exist to
avoid the race with collapse_huge_page. It's not just split_huge_page
we need to protect against.

So while thinking the above summary, I noticed this patch misses a IPI
in mm/huge_memory.c that must be delivered after pmdp_clear_flush
below to be safe against collapse_huge_page for the same reasons it
sends it within pmdp_splitting_flush. Without this IPI what can happen
is that the GUP pin protection in __collapse_huge_page_isolate races
against gup_fast-RCU.

If gup_fast reads the pte on one CPU before pmdp_clear_flush, and on
the other CPU __collapse_huge_page_isolate succeeds, then gup_fast
could recheck the pte that hasn't been zapped yet by
__collapse_huge_page_copy. gup_fast would succeed because the pte
wasn't zapped yet, but then __collapse_huge_page_copy would run
replacing the pte with a transhuge pmd, making gup_fast return the old
page, while the process got the copy as part of the collapsed hugepage.

/*
 * After this gup_fast can't run anymore. This also removes
   ^ -> invariant broken by 
gup_fast-RCU
 * any huge TLB entry from the CPU so we won't allow
 * huge and small TLB entries for the same virtual address
 * to avoid the risk of CPU bugs in that area.
 */
_pmd = pmdp_clear_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

spin_lock(pte_ptl);
isolated = __collapse_huge_page_isolate(vma, address, pte);
spin_unlock(pte_ptl);

CPU0CPU1
-   -
gup_fast-RCU
local_irq_disable()
pte = pte_offset_map(pmd, address)

pmdp_clear_flush (not sending IPI -> 
bug)

__collapse_huge_page_isolate -> succeeds


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-26 Thread Benjamin Herrenschmidt
On Fri, 2014-10-24 at 09:22 -0700, James Bottomley wrote:

> Parisc does this.  As soon as one CPU issues a TLB purge, it's broadcast
> to all the CPUs on the inter-CPU bus.  The next instruction isn't
> executed until they respond.
> 
> But this is only for our CPU TLB.  There's no other external
> consequence, so removal from the page tables isn't effected by this TLB
> flush, therefore the theory on which Dave bases the change to
> atomic_add() should work for us (of course, atomic_add is lock add
> unlock on our CPU, so it's not going to be of much benefit).

I'm not sure I follow you here.

Do you or do you now perform an IPI to do TLB flushes ? If you don't
(for example because you have HW broadcast), then you need the
speculative get_page(). If you do (and can read a PTE atomically), you
can get away with atomic_add().

The reason is that if you remember how zap_pte_range works, we perform
the flush before we get rid of the page.

So if your using IPIs for the flush, the fact that gup_fast has
interrupts disabled will delay the IPI response and thus effectively
prevent the pages from being actually freed, allowing us to simply do
the atomic_add() on x86.

But if we don't use IPIs because we have HW broadcast of TLB
invalidations, then we don't have that synchronization. atomic_add won't
work, we need get_page_speculative() because the page could be
concurrently being freed.

Cheers,
Ben.

> James
> 
> > Another option would be to make the generic code use something defined
> > by the arch to decide whether to use speculative get or
> > not. I like the idea of keeping the bulk of that code generic...
> > 
> > Cheers,
> > Ben.
> > 
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to majord...@kvack.org.  For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> > 
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majord...@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-26 Thread Benjamin Herrenschmidt
On Fri, 2014-10-24 at 09:22 -0700, James Bottomley wrote:

 Parisc does this.  As soon as one CPU issues a TLB purge, it's broadcast
 to all the CPUs on the inter-CPU bus.  The next instruction isn't
 executed until they respond.
 
 But this is only for our CPU TLB.  There's no other external
 consequence, so removal from the page tables isn't effected by this TLB
 flush, therefore the theory on which Dave bases the change to
 atomic_add() should work for us (of course, atomic_add is lock add
 unlock on our CPU, so it's not going to be of much benefit).

I'm not sure I follow you here.

Do you or do you now perform an IPI to do TLB flushes ? If you don't
(for example because you have HW broadcast), then you need the
speculative get_page(). If you do (and can read a PTE atomically), you
can get away with atomic_add().

The reason is that if you remember how zap_pte_range works, we perform
the flush before we get rid of the page.

So if your using IPIs for the flush, the fact that gup_fast has
interrupts disabled will delay the IPI response and thus effectively
prevent the pages from being actually freed, allowing us to simply do
the atomic_add() on x86.

But if we don't use IPIs because we have HW broadcast of TLB
invalidations, then we don't have that synchronization. atomic_add won't
work, we need get_page_speculative() because the page could be
concurrently being freed.

Cheers,
Ben.

 James
 
  Another option would be to make the generic code use something defined
  by the arch to decide whether to use speculative get or
  not. I like the idea of keeping the bulk of that code generic...
  
  Cheers,
  Ben.
  
   --
   To unsubscribe, send a message with 'unsubscribe linux-mm' in
   the body to majord...@kvack.org.  For more info on Linux MM,
   see: http://www.linux-mm.org/ .
   Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
  
  
  --
  To unsubscribe, send a message with 'unsubscribe linux-mm' in
  the body to majord...@kvack.org.  For more info on Linux MM,
  see: http://www.linux-mm.org/ .
  Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
  
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-26 Thread Andrea Arcangeli
Hello,

On Mon, Oct 27, 2014 at 07:50:41AM +1100, Benjamin Herrenschmidt wrote:
 On Fri, 2014-10-24 at 09:22 -0700, James Bottomley wrote:
 
  Parisc does this.  As soon as one CPU issues a TLB purge, it's broadcast
  to all the CPUs on the inter-CPU bus.  The next instruction isn't
  executed until they respond.
  
  But this is only for our CPU TLB.  There's no other external
  consequence, so removal from the page tables isn't effected by this TLB
  flush, therefore the theory on which Dave bases the change to
  atomic_add() should work for us (of course, atomic_add is lock add
  unlock on our CPU, so it's not going to be of much benefit).
 
 I'm not sure I follow you here.
 
 Do you or do you now perform an IPI to do TLB flushes ? If you don't
 (for example because you have HW broadcast), then you need the
 speculative get_page(). If you do (and can read a PTE atomically), you
 can get away with atomic_add().
 
 The reason is that if you remember how zap_pte_range works, we perform
 the flush before we get rid of the page.
 
 So if your using IPIs for the flush, the fact that gup_fast has
 interrupts disabled will delay the IPI response and thus effectively
 prevent the pages from being actually freed, allowing us to simply do
 the atomic_add() on x86.
 
 But if we don't use IPIs because we have HW broadcast of TLB
 invalidations, then we don't have that synchronization. atomic_add won't
 work, we need get_page_speculative() because the page could be
 concurrently being freed.

I looked at how this works more closely and I agree
get_page_unless_zero is always necessary if the TLB flush doesn't
always wait for IPIs to all CPUs where a gup_fast may be running onto.

To summarize, the pagetables are freed with RCU (arch sets
HAVE_RCU_TABLE_FREE) and that allows to walk them lockless with RCU.

After we can walk the pagetables lockless with RCU, we get to the page
lockless, but the pages themself can still be freed at any time from
under us (hence the need for get_page_unless_zero).

The additional trick gup_fast RCU does is to recheck the pte after
elevating the page count with get_page_unless_zero. Rechecking the
pte/hugepmd to be sure it didn't change from under us is critical to
be sure get_page_unless_zero didn't run after the page was freed and
reallocated which would otherwise lead to a security problem too
(i.e. it protects against get_page_unless_zero false positives).

The last bit required is to still disable irqs like on x86 to
serialize against THP splits combined with pmdp_splitting_flush always
delivering IPIs (pmdp_splitting_flush must wait all gup_fast to
complete before proceeding in mangling the page struct of the compound
page).

Preventing the irq disable while taking a gup_fast pin using
compound_lock isn't as easy as it is to do for put_page. put_page
(non-compound) fastest path remains THP agnostic because
collapse_huge_page is inhibited by any existing gup pin, but here
we're exactly taking it, so we can't depend on it to already exist to
avoid the race with collapse_huge_page. It's not just split_huge_page
we need to protect against.

So while thinking the above summary, I noticed this patch misses a IPI
in mm/huge_memory.c that must be delivered after pmdp_clear_flush
below to be safe against collapse_huge_page for the same reasons it
sends it within pmdp_splitting_flush. Without this IPI what can happen
is that the GUP pin protection in __collapse_huge_page_isolate races
against gup_fast-RCU.

If gup_fast reads the pte on one CPU before pmdp_clear_flush, and on
the other CPU __collapse_huge_page_isolate succeeds, then gup_fast
could recheck the pte that hasn't been zapped yet by
__collapse_huge_page_copy. gup_fast would succeed because the pte
wasn't zapped yet, but then __collapse_huge_page_copy would run
replacing the pte with a transhuge pmd, making gup_fast return the old
page, while the process got the copy as part of the collapsed hugepage.

/*
 * After this gup_fast can't run anymore. This also removes
   ^ - invariant broken by 
gup_fast-RCU
 * any huge TLB entry from the CPU so we won't allow
 * huge and small TLB entries for the same virtual address
 * to avoid the risk of CPU bugs in that area.
 */
_pmd = pmdp_clear_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

spin_lock(pte_ptl);
isolated = __collapse_huge_page_isolate(vma, address, pte);
spin_unlock(pte_ptl);

CPU0CPU1
-   -
gup_fast-RCU
local_irq_disable()
pte = pte_offset_map(pmd, address)

pmdp_clear_flush (not sending IPI - 
bug)

__collapse_huge_page_isolate - succeeds

(page_count != 1 gup-pin check of
   

Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-25 Thread Aneesh Kumar K.V
David Miller  writes:

> Hey guys, was looking over the generic GUP while working on a sparc64
> issue and I noticed that you guys do speculative page gets, and after
> talking with Johannes Weiner (CC:'d) about this we don't see how it
> could be necessary.
>
> If interrupts are disabled during the page table scan (which they
> are), no IPI tlb flushes can arrive.  Therefore any removal from the
> page tables is guarded by interrupts being re-enabled.  And as a
> result, page counts of pages we see in the page tables must always
> have a count > 0.
>
> x86 does direct atomic_add() on >_count because of this
> invariant and I would rather see the generic version do this too.


But that won't work with RCU GUP. For example on powerpc the tlb flush
doesn't involve an IPI and we can essentially find page count 0.

-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-25 Thread Aneesh Kumar K.V
David Miller da...@davemloft.net writes:

 Hey guys, was looking over the generic GUP while working on a sparc64
 issue and I noticed that you guys do speculative page gets, and after
 talking with Johannes Weiner (CC:'d) about this we don't see how it
 could be necessary.

 If interrupts are disabled during the page table scan (which they
 are), no IPI tlb flushes can arrive.  Therefore any removal from the
 page tables is guarded by interrupts being re-enabled.  And as a
 result, page counts of pages we see in the page tables must always
 have a count  0.

 x86 does direct atomic_add() on page-_count because of this
 invariant and I would rather see the generic version do this too.


But that won't work with RCU GUP. For example on powerpc the tlb flush
doesn't involve an IPI and we can essentially find page count 0.

-aneesh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-24 Thread James Bottomley
On Fri, 2014-10-24 at 10:40 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2014-10-23 at 18:40 -0400, David Miller wrote:
> > Hey guys, was looking over the generic GUP while working on a sparc64
> > issue and I noticed that you guys do speculative page gets, and after
> > talking with Johannes Weiner (CC:'d) about this we don't see how it
> > could be necessary.
> > 
> > If interrupts are disabled during the page table scan (which they
> > are), no IPI tlb flushes can arrive.  Therefore any removal from the
> > page tables is guarded by interrupts being re-enabled.  And as a
> > result, page counts of pages we see in the page tables must always
> > have a count > 0.
> > 
> > x86 does direct atomic_add() on >_count because of this
> > invariant and I would rather see the generic version do this too.
> 
> This is of course only true of archs who use IPIs for TLB flushes, so if
> we are going down the path of not being speculative, powerpc would have
> to go back to doing its own since our broadcast TLB flush means we
> aren't protected (we are only protected vs. the page tables themselves
> being freed since we do that via sched RCU).
> 
> AFAIK, ARM also broadcasts TLB flushes...

Parisc does this.  As soon as one CPU issues a TLB purge, it's broadcast
to all the CPUs on the inter-CPU bus.  The next instruction isn't
executed until they respond.

But this is only for our CPU TLB.  There's no other external
consequence, so removal from the page tables isn't effected by this TLB
flush, therefore the theory on which Dave bases the change to
atomic_add() should work for us (of course, atomic_add is lock add
unlock on our CPU, so it's not going to be of much benefit).

James

> Another option would be to make the generic code use something defined
> by the arch to decide whether to use speculative get or
> not. I like the idea of keeping the bulk of that code generic...
> 
> Cheers,
> Ben.
> 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majord...@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-24 Thread Steve Capper
On 24 October 2014 00:40, Benjamin Herrenschmidt
 wrote:
> On Thu, 2014-10-23 at 18:40 -0400, David Miller wrote:
>> Hey guys, was looking over the generic GUP while working on a sparc64
>> issue and I noticed that you guys do speculative page gets, and after
>> talking with Johannes Weiner (CC:'d) about this we don't see how it
>> could be necessary.
>>
>> If interrupts are disabled during the page table scan (which they
>> are), no IPI tlb flushes can arrive.  Therefore any removal from the
>> page tables is guarded by interrupts being re-enabled.  And as a
>> result, page counts of pages we see in the page tables must always
>> have a count > 0.
>>
>> x86 does direct atomic_add() on >_count because of this
>> invariant and I would rather see the generic version do this too.
>
> This is of course only true of archs who use IPIs for TLB flushes, so if
> we are going down the path of not being speculative, powerpc would have
> to go back to doing its own since our broadcast TLB flush means we
> aren't protected (we are only protected vs. the page tables themselves
> being freed since we do that via sched RCU).
>
> AFAIK, ARM also broadcasts TLB flushes...

Indeed, for most ARM cores we have hardware TLB broadcasts, thus we
need the speculative path.

>
> Another option would be to make the generic code use something defined
> by the arch to decide whether to use speculative get or
> not. I like the idea of keeping the bulk of that code generic...

It would be nice to have the code generalised further.
In addition to the speculative/atomic helpers the implementation would
need to be renamed from GENERIC_RCU_GUP to GENERIC_GUP.
The other noteworthy assumption made in the RCU GUP is that pte's can
be read atomically. For x86 this isn't true when running with 64-bit
pte's, thus a helper would be needed.

Cheers,
-- 
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-24 Thread Steve Capper
On 24 October 2014 00:40, Benjamin Herrenschmidt
b...@kernel.crashing.org wrote:
 On Thu, 2014-10-23 at 18:40 -0400, David Miller wrote:
 Hey guys, was looking over the generic GUP while working on a sparc64
 issue and I noticed that you guys do speculative page gets, and after
 talking with Johannes Weiner (CC:'d) about this we don't see how it
 could be necessary.

 If interrupts are disabled during the page table scan (which they
 are), no IPI tlb flushes can arrive.  Therefore any removal from the
 page tables is guarded by interrupts being re-enabled.  And as a
 result, page counts of pages we see in the page tables must always
 have a count  0.

 x86 does direct atomic_add() on page-_count because of this
 invariant and I would rather see the generic version do this too.

 This is of course only true of archs who use IPIs for TLB flushes, so if
 we are going down the path of not being speculative, powerpc would have
 to go back to doing its own since our broadcast TLB flush means we
 aren't protected (we are only protected vs. the page tables themselves
 being freed since we do that via sched RCU).

 AFAIK, ARM also broadcasts TLB flushes...

Indeed, for most ARM cores we have hardware TLB broadcasts, thus we
need the speculative path.


 Another option would be to make the generic code use something defined
 by the arch to decide whether to use speculative get or
 not. I like the idea of keeping the bulk of that code generic...

It would be nice to have the code generalised further.
In addition to the speculative/atomic helpers the implementation would
need to be renamed from GENERIC_RCU_GUP to GENERIC_GUP.
The other noteworthy assumption made in the RCU GUP is that pte's can
be read atomically. For x86 this isn't true when running with 64-bit
pte's, thus a helper would be needed.

Cheers,
-- 
Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-24 Thread James Bottomley
On Fri, 2014-10-24 at 10:40 +1100, Benjamin Herrenschmidt wrote:
 On Thu, 2014-10-23 at 18:40 -0400, David Miller wrote:
  Hey guys, was looking over the generic GUP while working on a sparc64
  issue and I noticed that you guys do speculative page gets, and after
  talking with Johannes Weiner (CC:'d) about this we don't see how it
  could be necessary.
  
  If interrupts are disabled during the page table scan (which they
  are), no IPI tlb flushes can arrive.  Therefore any removal from the
  page tables is guarded by interrupts being re-enabled.  And as a
  result, page counts of pages we see in the page tables must always
  have a count  0.
  
  x86 does direct atomic_add() on page-_count because of this
  invariant and I would rather see the generic version do this too.
 
 This is of course only true of archs who use IPIs for TLB flushes, so if
 we are going down the path of not being speculative, powerpc would have
 to go back to doing its own since our broadcast TLB flush means we
 aren't protected (we are only protected vs. the page tables themselves
 being freed since we do that via sched RCU).
 
 AFAIK, ARM also broadcasts TLB flushes...

Parisc does this.  As soon as one CPU issues a TLB purge, it's broadcast
to all the CPUs on the inter-CPU bus.  The next instruction isn't
executed until they respond.

But this is only for our CPU TLB.  There's no other external
consequence, so removal from the page tables isn't effected by this TLB
flush, therefore the theory on which Dave bases the change to
atomic_add() should work for us (of course, atomic_add is lock add
unlock on our CPU, so it's not going to be of much benefit).

James

 Another option would be to make the generic code use something defined
 by the arch to decide whether to use speculative get or
 not. I like the idea of keeping the bulk of that code generic...
 
 Cheers,
 Ben.
 
  --
  To unsubscribe, send a message with 'unsubscribe linux-mm' in
  the body to majord...@kvack.org.  For more info on Linux MM,
  see: http://www.linux-mm.org/ .
  Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-23 Thread David Miller
From: Benjamin Herrenschmidt 
Date: Fri, 24 Oct 2014 10:40:35 +1100

> Another option would be to make the generic code use something defined
> by the arch to decide whether to use speculative get or
> not. I like the idea of keeping the bulk of that code generic...

Me too.  We could have inlines that do either speculative or
non-speculative gets on the pages in some header file and hide
the ifdefs in there.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-23 Thread Benjamin Herrenschmidt
On Thu, 2014-10-23 at 18:40 -0400, David Miller wrote:
> Hey guys, was looking over the generic GUP while working on a sparc64
> issue and I noticed that you guys do speculative page gets, and after
> talking with Johannes Weiner (CC:'d) about this we don't see how it
> could be necessary.
> 
> If interrupts are disabled during the page table scan (which they
> are), no IPI tlb flushes can arrive.  Therefore any removal from the
> page tables is guarded by interrupts being re-enabled.  And as a
> result, page counts of pages we see in the page tables must always
> have a count > 0.
> 
> x86 does direct atomic_add() on >_count because of this
> invariant and I would rather see the generic version do this too.

This is of course only true of archs who use IPIs for TLB flushes, so if
we are going down the path of not being speculative, powerpc would have
to go back to doing its own since our broadcast TLB flush means we
aren't protected (we are only protected vs. the page tables themselves
being freed since we do that via sched RCU).

AFAIK, ARM also broadcasts TLB flushes...

Another option would be to make the generic code use something defined
by the arch to decide whether to use speculative get or
not. I like the idea of keeping the bulk of that code generic...

Cheers,
Ben.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-23 Thread David Miller

Hey guys, was looking over the generic GUP while working on a sparc64
issue and I noticed that you guys do speculative page gets, and after
talking with Johannes Weiner (CC:'d) about this we don't see how it
could be necessary.

If interrupts are disabled during the page table scan (which they
are), no IPI tlb flushes can arrive.  Therefore any removal from the
page tables is guarded by interrupts being re-enabled.  And as a
result, page counts of pages we see in the page tables must always
have a count > 0.

x86 does direct atomic_add() on >_count because of this
invariant and I would rather see the generic version do this too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-23 Thread Aneesh Kumar K.V
Andrew Morton  writes:

> On Fri, 17 Oct 2014 10:08:06 +0530 "Aneesh Kumar K.V" 
>  wrote:
>
>> Update generic gup implementation with powerpc specific details.
>> On powerpc at pmd level we can have hugepte, normal pmd pointer
>> or a pointer to the hugepage directory.
>> 
>> ...
>>
>> --- a/arch/arm/include/asm/pgtable.h
>> +++ b/arch/arm/include/asm/pgtable.h
>> @@ -181,6 +181,8 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
>>  /* to find an entry in a kernel page-table-directory */
>>  #define pgd_offset_k(addr)  pgd_offset(_mm, addr)
>>  
>> +#define pgd_huge(pgd)   (0)
>> +
>>  #define pmd_none(pmd)   (!pmd_val(pmd))
>>  #define pmd_present(pmd)(pmd_val(pmd))
>>  
>> diff --git a/arch/arm64/include/asm/pgtable.h 
>> b/arch/arm64/include/asm/pgtable.h
>> index cefd3e825612..ed8f42497ac4 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -464,6 +464,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t 
>> newprot)
>>  extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
>>  extern pgd_t idmap_pg_dir[PTRS_PER_PGD];
>>  
>> +#define pgd_huge(pgd)   (0)
>> +
>
> So only arm, arm64 and powerpc implement CONFIG_HAVE_GENERIC_RCU_GUP
> and only powerpc impements pgd_huge().
>
> Could we get a bit of documentation in place for pgd_huge() so that
> people who aren't familiar with powerpc can understand what's going
> on?


I ended up moving that to include/linux/hugetlb.h with the below
comments added. Let me know if the below is ok 

/*
 * hugepages at page global directory. If arch support
 * hugepages at pgd level, they need to define this.
 */
#ifndef pgd_huge
#define pgd_huge(x) 0
#endif

#ifndef is_hugepd
/*
 * Some architectures requires a hugepage directory format that is
 * required to support multiple hugepage sizes. For example
 * a4fe3ce7699bfe1bd88f816b55d42d8fe1dac655 introduced the same
 * on powerpc. This allows for a more flexible hugepage pagetable
 * layout.
 */
typedef struct { unsigned long pd; } hugepd_t;
#define is_hugepd(hugepd) (0)
#define __hugepd(x) ((hugepd_t) { (x) })
static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
  unsigned pdshift, unsigned long end,
  int write, struct page **pages, int *nr)
{
return 0;
}
#else
extern int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
   unsigned pdshift, unsigned long end,
   int write, struct page **pages, int *nr);
#endif


-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-23 Thread Aneesh Kumar K.V
Andrew Morton a...@linux-foundation.org writes:

 On Fri, 17 Oct 2014 10:08:06 +0530 Aneesh Kumar K.V 
 aneesh.ku...@linux.vnet.ibm.com wrote:

 Update generic gup implementation with powerpc specific details.
 On powerpc at pmd level we can have hugepte, normal pmd pointer
 or a pointer to the hugepage directory.
 
 ...

 --- a/arch/arm/include/asm/pgtable.h
 +++ b/arch/arm/include/asm/pgtable.h
 @@ -181,6 +181,8 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
  /* to find an entry in a kernel page-table-directory */
  #define pgd_offset_k(addr)  pgd_offset(init_mm, addr)
  
 +#define pgd_huge(pgd)   (0)
 +
  #define pmd_none(pmd)   (!pmd_val(pmd))
  #define pmd_present(pmd)(pmd_val(pmd))
  
 diff --git a/arch/arm64/include/asm/pgtable.h 
 b/arch/arm64/include/asm/pgtable.h
 index cefd3e825612..ed8f42497ac4 100644
 --- a/arch/arm64/include/asm/pgtable.h
 +++ b/arch/arm64/include/asm/pgtable.h
 @@ -464,6 +464,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t 
 newprot)
  extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
  extern pgd_t idmap_pg_dir[PTRS_PER_PGD];
  
 +#define pgd_huge(pgd)   (0)
 +

 So only arm, arm64 and powerpc implement CONFIG_HAVE_GENERIC_RCU_GUP
 and only powerpc impements pgd_huge().

 Could we get a bit of documentation in place for pgd_huge() so that
 people who aren't familiar with powerpc can understand what's going
 on?


I ended up moving that to include/linux/hugetlb.h with the below
comments added. Let me know if the below is ok 

/*
 * hugepages at page global directory. If arch support
 * hugepages at pgd level, they need to define this.
 */
#ifndef pgd_huge
#define pgd_huge(x) 0
#endif

#ifndef is_hugepd
/*
 * Some architectures requires a hugepage directory format that is
 * required to support multiple hugepage sizes. For example
 * a4fe3ce7699bfe1bd88f816b55d42d8fe1dac655 introduced the same
 * on powerpc. This allows for a more flexible hugepage pagetable
 * layout.
 */
typedef struct { unsigned long pd; } hugepd_t;
#define is_hugepd(hugepd) (0)
#define __hugepd(x) ((hugepd_t) { (x) })
static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
  unsigned pdshift, unsigned long end,
  int write, struct page **pages, int *nr)
{
return 0;
}
#else
extern int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
   unsigned pdshift, unsigned long end,
   int write, struct page **pages, int *nr);
#endif


-aneesh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-23 Thread David Miller

Hey guys, was looking over the generic GUP while working on a sparc64
issue and I noticed that you guys do speculative page gets, and after
talking with Johannes Weiner (CC:'d) about this we don't see how it
could be necessary.

If interrupts are disabled during the page table scan (which they
are), no IPI tlb flushes can arrive.  Therefore any removal from the
page tables is guarded by interrupts being re-enabled.  And as a
result, page counts of pages we see in the page tables must always
have a count  0.

x86 does direct atomic_add() on page-_count because of this
invariant and I would rather see the generic version do this too.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-23 Thread Benjamin Herrenschmidt
On Thu, 2014-10-23 at 18:40 -0400, David Miller wrote:
 Hey guys, was looking over the generic GUP while working on a sparc64
 issue and I noticed that you guys do speculative page gets, and after
 talking with Johannes Weiner (CC:'d) about this we don't see how it
 could be necessary.
 
 If interrupts are disabled during the page table scan (which they
 are), no IPI tlb flushes can arrive.  Therefore any removal from the
 page tables is guarded by interrupts being re-enabled.  And as a
 result, page counts of pages we see in the page tables must always
 have a count  0.
 
 x86 does direct atomic_add() on page-_count because of this
 invariant and I would rather see the generic version do this too.

This is of course only true of archs who use IPIs for TLB flushes, so if
we are going down the path of not being speculative, powerpc would have
to go back to doing its own since our broadcast TLB flush means we
aren't protected (we are only protected vs. the page tables themselves
being freed since we do that via sched RCU).

AFAIK, ARM also broadcasts TLB flushes...

Another option would be to make the generic code use something defined
by the arch to decide whether to use speculative get or
not. I like the idea of keeping the bulk of that code generic...

Cheers,
Ben.

 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-23 Thread David Miller
From: Benjamin Herrenschmidt b...@kernel.crashing.org
Date: Fri, 24 Oct 2014 10:40:35 +1100

 Another option would be to make the generic code use something defined
 by the arch to decide whether to use speculative get or
 not. I like the idea of keeping the bulk of that code generic...

Me too.  We could have inlines that do either speculative or
non-speculative gets on the pages in some header file and hide
the ifdefs in there.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-22 Thread Aneesh Kumar K.V
Andrew Morton  writes:

> On Fri, 17 Oct 2014 10:08:06 +0530 "Aneesh Kumar K.V" 
>  wrote:
>
>> Update generic gup implementation with powerpc specific details.
>> On powerpc at pmd level we can have hugepte, normal pmd pointer
>> or a pointer to the hugepage directory.
>> 
>> ...
>>
>> --- a/arch/arm/include/asm/pgtable.h
>> +++ b/arch/arm/include/asm/pgtable.h
>> @@ -181,6 +181,8 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
>>  /* to find an entry in a kernel page-table-directory */
>>  #define pgd_offset_k(addr)  pgd_offset(_mm, addr)
>>  
>> +#define pgd_huge(pgd)   (0)
>> +
>>  #define pmd_none(pmd)   (!pmd_val(pmd))
>>  #define pmd_present(pmd)(pmd_val(pmd))
>>  
>> diff --git a/arch/arm64/include/asm/pgtable.h 
>> b/arch/arm64/include/asm/pgtable.h
>> index cefd3e825612..ed8f42497ac4 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -464,6 +464,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t 
>> newprot)
>>  extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
>>  extern pgd_t idmap_pg_dir[PTRS_PER_PGD];
>>  
>> +#define pgd_huge(pgd)   (0)
>> +
>
> So only arm, arm64 and powerpc implement CONFIG_HAVE_GENERIC_RCU_GUP
> and only powerpc impements pgd_huge().
>
> Could we get a bit of documentation in place for pgd_huge() so that
> people who aren't familiar with powerpc can understand what's going
> on?

Will update

>
>>  /*
>>   * Encode and decode a swap entry:
>>   *  bits 0-1:   present (must be zero)
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 02d11ee7f19d..f97732412cb4 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1219,6 +1219,32 @@ long get_user_pages(struct task_struct *tsk, struct 
>> mm_struct *mm,
>>  struct vm_area_struct **vmas);
>>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>  struct page **pages);
>> +
>> +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
>> +#ifndef is_hugepd
>
> And is_hugepd is a bit of a mystery.  Let's get some description in
> place for this as well?  Why it exists, what its role is.  Also,
> specifically which arch header file is responsible for defining it.
>
> It takes a hugepd_t argument, but hugepd_t is defined later in this
> header file.  This is weird because any preceding implementation of
> is_hugepd() can't actually be implemented because it hasn't seen the
> hugepd_t definition yet!  So any is_hugepd() implementation is forced
> to be a simple macro which punts to a C function which *has* seen the
> hugepd_t definition.  What a twisty maze.

arch can definitely do

#defne is_hugepd is_hugepd
typedef struct { unsigned long pd; } hugepd_t;
static inline int is_hugepd(hugepd_t hpd)
{

}

I wanted to make sure arch can have their own definition of hugepd_t . 

>
> It all seems messy, confusing and poorly documented.  Can we clean this
> up?
>
>> +/*
>> + * Some architectures support hugepage directory format that is
>> + * required to support different hugetlbfs sizes.
>> + */
>> +typedef struct { unsigned long pd; } hugepd_t;
>> +#define is_hugepd(hugepd) (0)
>> +#define __hugepd(x) ((hugepd_t) { (x) })
>
> What's this.

macro that is used to convert value to type hugepd_t. We use that style
already for __pte() etc.

>
>> +static inline int gup_hugepd(hugepd_t hugepd, unsigned long addr,
>> + unsigned pdshift, unsigned long end,
>> + int write, struct page **pages, int *nr)
>> +{
>> +return 0;
>> +}
>> +#else
>> +extern int gup_hugepd(hugepd_t hugepd, unsigned long addr,
>> +  unsigned pdshift, unsigned long end,
>> +  int write, struct page **pages, int *nr);
>> +#endif
>> +extern int gup_huge_pte(pte_t orig, pte_t *ptep, unsigned long addr,
>> +unsigned long sz, unsigned long end, int write,
>> +struct page **pages, int *nr);
>> +#endif

-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-22 Thread Andrew Morton
On Fri, 17 Oct 2014 10:08:06 +0530 "Aneesh Kumar K.V" 
 wrote:

> Update generic gup implementation with powerpc specific details.
> On powerpc at pmd level we can have hugepte, normal pmd pointer
> or a pointer to the hugepage directory.
> 
> ...
>
> --- a/arch/arm/include/asm/pgtable.h
> +++ b/arch/arm/include/asm/pgtable.h
> @@ -181,6 +181,8 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
>  /* to find an entry in a kernel page-table-directory */
>  #define pgd_offset_k(addr)   pgd_offset(_mm, addr)
>  
> +#define pgd_huge(pgd)(0)
> +
>  #define pmd_none(pmd)(!pmd_val(pmd))
>  #define pmd_present(pmd) (pmd_val(pmd))
>  
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index cefd3e825612..ed8f42497ac4 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -464,6 +464,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t 
> newprot)
>  extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
>  extern pgd_t idmap_pg_dir[PTRS_PER_PGD];
>  
> +#define pgd_huge(pgd)(0)
> +

So only arm, arm64 and powerpc implement CONFIG_HAVE_GENERIC_RCU_GUP
and only powerpc impements pgd_huge().

Could we get a bit of documentation in place for pgd_huge() so that
people who aren't familiar with powerpc can understand what's going on?

>  /*
>   * Encode and decode a swap entry:
>   *   bits 0-1:   present (must be zero)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 02d11ee7f19d..f97732412cb4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1219,6 +1219,32 @@ long get_user_pages(struct task_struct *tsk, struct 
> mm_struct *mm,
>   struct vm_area_struct **vmas);
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   struct page **pages);
> +
> +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
> +#ifndef is_hugepd

And is_hugepd is a bit of a mystery.  Let's get some description in
place for this as well?  Why it exists, what its role is.  Also,
specifically which arch header file is responsible for defining it.

It takes a hugepd_t argument, but hugepd_t is defined later in this
header file.  This is weird because any preceding implementation of
is_hugepd() can't actually be implemented because it hasn't seen the
hugepd_t definition yet!  So any is_hugepd() implementation is forced
to be a simple macro which punts to a C function which *has* seen the
hugepd_t definition.  What a twisty maze.

It all seems messy, confusing and poorly documented.  Can we clean this
up?

> +/*
> + * Some architectures support hugepage directory format that is
> + * required to support different hugetlbfs sizes.
> + */
> +typedef struct { unsigned long pd; } hugepd_t;
> +#define is_hugepd(hugepd) (0)
> +#define __hugepd(x) ((hugepd_t) { (x) })

What's this.

> +static inline int gup_hugepd(hugepd_t hugepd, unsigned long addr,
> +  unsigned pdshift, unsigned long end,
> +  int write, struct page **pages, int *nr)
> +{
> + return 0;
> +}
> +#else
> +extern int gup_hugepd(hugepd_t hugepd, unsigned long addr,
> +   unsigned pdshift, unsigned long end,
> +   int write, struct page **pages, int *nr);
> +#endif
> +extern int gup_huge_pte(pte_t orig, pte_t *ptep, unsigned long addr,
> + unsigned long sz, unsigned long end, int write,
> + struct page **pages, int *nr);
> +#endif
> +
>
> ...
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-22 Thread Andrew Morton
On Fri, 17 Oct 2014 10:08:06 +0530 Aneesh Kumar K.V 
aneesh.ku...@linux.vnet.ibm.com wrote:

 Update generic gup implementation with powerpc specific details.
 On powerpc at pmd level we can have hugepte, normal pmd pointer
 or a pointer to the hugepage directory.
 
 ...

 --- a/arch/arm/include/asm/pgtable.h
 +++ b/arch/arm/include/asm/pgtable.h
 @@ -181,6 +181,8 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
  /* to find an entry in a kernel page-table-directory */
  #define pgd_offset_k(addr)   pgd_offset(init_mm, addr)
  
 +#define pgd_huge(pgd)(0)
 +
  #define pmd_none(pmd)(!pmd_val(pmd))
  #define pmd_present(pmd) (pmd_val(pmd))
  
 diff --git a/arch/arm64/include/asm/pgtable.h 
 b/arch/arm64/include/asm/pgtable.h
 index cefd3e825612..ed8f42497ac4 100644
 --- a/arch/arm64/include/asm/pgtable.h
 +++ b/arch/arm64/include/asm/pgtable.h
 @@ -464,6 +464,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t 
 newprot)
  extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
  extern pgd_t idmap_pg_dir[PTRS_PER_PGD];
  
 +#define pgd_huge(pgd)(0)
 +

So only arm, arm64 and powerpc implement CONFIG_HAVE_GENERIC_RCU_GUP
and only powerpc impements pgd_huge().

Could we get a bit of documentation in place for pgd_huge() so that
people who aren't familiar with powerpc can understand what's going on?

  /*
   * Encode and decode a swap entry:
   *   bits 0-1:   present (must be zero)
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index 02d11ee7f19d..f97732412cb4 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -1219,6 +1219,32 @@ long get_user_pages(struct task_struct *tsk, struct 
 mm_struct *mm,
   struct vm_area_struct **vmas);
  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
   struct page **pages);
 +
 +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
 +#ifndef is_hugepd

And is_hugepd is a bit of a mystery.  Let's get some description in
place for this as well?  Why it exists, what its role is.  Also,
specifically which arch header file is responsible for defining it.

It takes a hugepd_t argument, but hugepd_t is defined later in this
header file.  This is weird because any preceding implementation of
is_hugepd() can't actually be implemented because it hasn't seen the
hugepd_t definition yet!  So any is_hugepd() implementation is forced
to be a simple macro which punts to a C function which *has* seen the
hugepd_t definition.  What a twisty maze.

It all seems messy, confusing and poorly documented.  Can we clean this
up?

 +/*
 + * Some architectures support hugepage directory format that is
 + * required to support different hugetlbfs sizes.
 + */
 +typedef struct { unsigned long pd; } hugepd_t;
 +#define is_hugepd(hugepd) (0)
 +#define __hugepd(x) ((hugepd_t) { (x) })

What's this.

 +static inline int gup_hugepd(hugepd_t hugepd, unsigned long addr,
 +  unsigned pdshift, unsigned long end,
 +  int write, struct page **pages, int *nr)
 +{
 + return 0;
 +}
 +#else
 +extern int gup_hugepd(hugepd_t hugepd, unsigned long addr,
 +   unsigned pdshift, unsigned long end,
 +   int write, struct page **pages, int *nr);
 +#endif
 +extern int gup_huge_pte(pte_t orig, pte_t *ptep, unsigned long addr,
 + unsigned long sz, unsigned long end, int write,
 + struct page **pages, int *nr);
 +#endif
 +

 ...

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-22 Thread Aneesh Kumar K.V
Andrew Morton a...@linux-foundation.org writes:

 On Fri, 17 Oct 2014 10:08:06 +0530 Aneesh Kumar K.V 
 aneesh.ku...@linux.vnet.ibm.com wrote:

 Update generic gup implementation with powerpc specific details.
 On powerpc at pmd level we can have hugepte, normal pmd pointer
 or a pointer to the hugepage directory.
 
 ...

 --- a/arch/arm/include/asm/pgtable.h
 +++ b/arch/arm/include/asm/pgtable.h
 @@ -181,6 +181,8 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
  /* to find an entry in a kernel page-table-directory */
  #define pgd_offset_k(addr)  pgd_offset(init_mm, addr)
  
 +#define pgd_huge(pgd)   (0)
 +
  #define pmd_none(pmd)   (!pmd_val(pmd))
  #define pmd_present(pmd)(pmd_val(pmd))
  
 diff --git a/arch/arm64/include/asm/pgtable.h 
 b/arch/arm64/include/asm/pgtable.h
 index cefd3e825612..ed8f42497ac4 100644
 --- a/arch/arm64/include/asm/pgtable.h
 +++ b/arch/arm64/include/asm/pgtable.h
 @@ -464,6 +464,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t 
 newprot)
  extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
  extern pgd_t idmap_pg_dir[PTRS_PER_PGD];
  
 +#define pgd_huge(pgd)   (0)
 +

 So only arm, arm64 and powerpc implement CONFIG_HAVE_GENERIC_RCU_GUP
 and only powerpc impements pgd_huge().

 Could we get a bit of documentation in place for pgd_huge() so that
 people who aren't familiar with powerpc can understand what's going
 on?

Will update


  /*
   * Encode and decode a swap entry:
   *  bits 0-1:   present (must be zero)
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index 02d11ee7f19d..f97732412cb4 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -1219,6 +1219,32 @@ long get_user_pages(struct task_struct *tsk, struct 
 mm_struct *mm,
  struct vm_area_struct **vmas);
  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages);
 +
 +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
 +#ifndef is_hugepd

 And is_hugepd is a bit of a mystery.  Let's get some description in
 place for this as well?  Why it exists, what its role is.  Also,
 specifically which arch header file is responsible for defining it.

 It takes a hugepd_t argument, but hugepd_t is defined later in this
 header file.  This is weird because any preceding implementation of
 is_hugepd() can't actually be implemented because it hasn't seen the
 hugepd_t definition yet!  So any is_hugepd() implementation is forced
 to be a simple macro which punts to a C function which *has* seen the
 hugepd_t definition.  What a twisty maze.

arch can definitely do

#defne is_hugepd is_hugepd
typedef struct { unsigned long pd; } hugepd_t;
static inline int is_hugepd(hugepd_t hpd)
{

}

I wanted to make sure arch can have their own definition of hugepd_t . 


 It all seems messy, confusing and poorly documented.  Can we clean this
 up?

 +/*
 + * Some architectures support hugepage directory format that is
 + * required to support different hugetlbfs sizes.
 + */
 +typedef struct { unsigned long pd; } hugepd_t;
 +#define is_hugepd(hugepd) (0)
 +#define __hugepd(x) ((hugepd_t) { (x) })

 What's this.

macro that is used to convert value to type hugepd_t. We use that style
already for __pte() etc.


 +static inline int gup_hugepd(hugepd_t hugepd, unsigned long addr,
 + unsigned pdshift, unsigned long end,
 + int write, struct page **pages, int *nr)
 +{
 +return 0;
 +}
 +#else
 +extern int gup_hugepd(hugepd_t hugepd, unsigned long addr,
 +  unsigned pdshift, unsigned long end,
 +  int write, struct page **pages, int *nr);
 +#endif
 +extern int gup_huge_pte(pte_t orig, pte_t *ptep, unsigned long addr,
 +unsigned long sz, unsigned long end, int write,
 +struct page **pages, int *nr);
 +#endif

-aneesh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-17 Thread Steve Capper
On Fri, Oct 17, 2014 at 10:08:06AM +0530, Aneesh Kumar K.V wrote:
> Update generic gup implementation with powerpc specific details.
> On powerpc at pmd level we can have hugepte, normal pmd pointer
> or a pointer to the hugepage directory.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
> Changes from V1: 
> * Folded arm/arm64 related changes into the patch
> * Dropped pgd_huge from generic header
> 
>  arch/arm/include/asm/pgtable.h   |   2 +
>  arch/arm64/include/asm/pgtable.h |   2 +
>  include/linux/mm.h   |  26 +
>  mm/gup.c | 113 
> +++
>  4 files changed, 84 insertions(+), 59 deletions(-)
> 

Hi Aneesh,
Thanks for coding this up. I've tested this for arm (Arndale board) and
arm64 (Juno); it builds without any issues and passes my futex on THP
tail test.

Please add my:
Tested-by: Steve Capper 

As this patch progresses through -mm, the arm maintainer:
Russell King 

and arm64 maintainers:
Catalin Marinas 
Will Deacon 

should also be on CC.

Cheers,
-- 
Steve

> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> index 90aa4583b308..46f81fbaa4a5 100644
> --- a/arch/arm/include/asm/pgtable.h
> +++ b/arch/arm/include/asm/pgtable.h
> @@ -181,6 +181,8 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
>  /* to find an entry in a kernel page-table-directory */
>  #define pgd_offset_k(addr)   pgd_offset(_mm, addr)
>  
> +#define pgd_huge(pgd)(0)
> +
>  #define pmd_none(pmd)(!pmd_val(pmd))
>  #define pmd_present(pmd) (pmd_val(pmd))
>  
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index cefd3e825612..ed8f42497ac4 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -464,6 +464,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t 
> newprot)
>  extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
>  extern pgd_t idmap_pg_dir[PTRS_PER_PGD];
>  
> +#define pgd_huge(pgd)(0)
> +
>  /*
>   * Encode and decode a swap entry:
>   *   bits 0-1:   present (must be zero)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 02d11ee7f19d..f97732412cb4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1219,6 +1219,32 @@ long get_user_pages(struct task_struct *tsk, struct 
> mm_struct *mm,
>   struct vm_area_struct **vmas);
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   struct page **pages);
> +
> +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
> +#ifndef is_hugepd
> +/*
> + * Some architectures support hugepage directory format that is
> + * required to support different hugetlbfs sizes.
> + */
> +typedef struct { unsigned long pd; } hugepd_t;
> +#define is_hugepd(hugepd) (0)
> +#define __hugepd(x) ((hugepd_t) { (x) })
> +static inline int gup_hugepd(hugepd_t hugepd, unsigned long addr,
> +  unsigned pdshift, unsigned long end,
> +  int write, struct page **pages, int *nr)
> +{
> + return 0;
> +}
> +#else
> +extern int gup_hugepd(hugepd_t hugepd, unsigned long addr,
> +   unsigned pdshift, unsigned long end,
> +   int write, struct page **pages, int *nr);
> +#endif
> +extern int gup_huge_pte(pte_t orig, pte_t *ptep, unsigned long addr,
> + unsigned long sz, unsigned long end, int write,
> + struct page **pages, int *nr);
> +#endif
> +
>  struct kvec;
>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
>   struct page **pages);
> diff --git a/mm/gup.c b/mm/gup.c
> index cd62c8c90d4a..13c560ef9ddf 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -786,65 +786,31 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
> unsigned long end,
>  }
>  #endif /* __HAVE_ARCH_PTE_SPECIAL */
>  
> -static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, int write, struct page **pages, int *nr)
> +int gup_huge_pte(pte_t orig, pte_t *ptep, unsigned long addr,
> +  unsigned long sz, unsigned long end, int write,
> +  struct page **pages, int *nr)
>  {
> - struct page *head, *page, *tail;
>   int refs;
> + unsigned long pte_end;
> + struct page *head, *page, *tail;
>  
> - if (write && !pmd_write(orig))
> - return 0;
> -
> - refs = 0;
> - head = pmd_page(orig);
> - page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> - tail = page;
> - do {
> - VM_BUG_ON_PAGE(compound_head(page) != head, page);
> - pages[*nr] = page;
> - (*nr)++;
> - page++;
> - refs++;
> - } while (addr += PAGE_SIZE, addr != end);
>  
> - if (!page_cache_add_speculative(head, refs)) {
> - *nr -= refs;
> + if (write && !pte_write(orig))
>   return 0;
> - }
>  
> - if 

Re: [PATCH V2 1/2] mm: Update generic gup implementation to handle hugepage directory

2014-10-17 Thread Steve Capper
On Fri, Oct 17, 2014 at 10:08:06AM +0530, Aneesh Kumar K.V wrote:
 Update generic gup implementation with powerpc specific details.
 On powerpc at pmd level we can have hugepte, normal pmd pointer
 or a pointer to the hugepage directory.
 
 Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 ---
 Changes from V1: 
 * Folded arm/arm64 related changes into the patch
 * Dropped pgd_huge from generic header
 
  arch/arm/include/asm/pgtable.h   |   2 +
  arch/arm64/include/asm/pgtable.h |   2 +
  include/linux/mm.h   |  26 +
  mm/gup.c | 113 
 +++
  4 files changed, 84 insertions(+), 59 deletions(-)
 

Hi Aneesh,
Thanks for coding this up. I've tested this for arm (Arndale board) and
arm64 (Juno); it builds without any issues and passes my futex on THP
tail test.

Please add my:
Tested-by: Steve Capper steve.cap...@linaro.org

As this patch progresses through -mm, the arm maintainer:
Russell King li...@arm.linux.org.uk

and arm64 maintainers:
Catalin Marinas catalin.mari...@arm.com
Will Deacon will.dea...@arm.com

should also be on CC.

Cheers,
-- 
Steve

 diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
 index 90aa4583b308..46f81fbaa4a5 100644
 --- a/arch/arm/include/asm/pgtable.h
 +++ b/arch/arm/include/asm/pgtable.h
 @@ -181,6 +181,8 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
  /* to find an entry in a kernel page-table-directory */
  #define pgd_offset_k(addr)   pgd_offset(init_mm, addr)
  
 +#define pgd_huge(pgd)(0)
 +
  #define pmd_none(pmd)(!pmd_val(pmd))
  #define pmd_present(pmd) (pmd_val(pmd))
  
 diff --git a/arch/arm64/include/asm/pgtable.h 
 b/arch/arm64/include/asm/pgtable.h
 index cefd3e825612..ed8f42497ac4 100644
 --- a/arch/arm64/include/asm/pgtable.h
 +++ b/arch/arm64/include/asm/pgtable.h
 @@ -464,6 +464,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t 
 newprot)
  extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
  extern pgd_t idmap_pg_dir[PTRS_PER_PGD];
  
 +#define pgd_huge(pgd)(0)
 +
  /*
   * Encode and decode a swap entry:
   *   bits 0-1:   present (must be zero)
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index 02d11ee7f19d..f97732412cb4 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -1219,6 +1219,32 @@ long get_user_pages(struct task_struct *tsk, struct 
 mm_struct *mm,
   struct vm_area_struct **vmas);
  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
   struct page **pages);
 +
 +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
 +#ifndef is_hugepd
 +/*
 + * Some architectures support hugepage directory format that is
 + * required to support different hugetlbfs sizes.
 + */
 +typedef struct { unsigned long pd; } hugepd_t;
 +#define is_hugepd(hugepd) (0)
 +#define __hugepd(x) ((hugepd_t) { (x) })
 +static inline int gup_hugepd(hugepd_t hugepd, unsigned long addr,
 +  unsigned pdshift, unsigned long end,
 +  int write, struct page **pages, int *nr)
 +{
 + return 0;
 +}
 +#else
 +extern int gup_hugepd(hugepd_t hugepd, unsigned long addr,
 +   unsigned pdshift, unsigned long end,
 +   int write, struct page **pages, int *nr);
 +#endif
 +extern int gup_huge_pte(pte_t orig, pte_t *ptep, unsigned long addr,
 + unsigned long sz, unsigned long end, int write,
 + struct page **pages, int *nr);
 +#endif
 +
  struct kvec;
  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
   struct page **pages);
 diff --git a/mm/gup.c b/mm/gup.c
 index cd62c8c90d4a..13c560ef9ddf 100644
 --- a/mm/gup.c
 +++ b/mm/gup.c
 @@ -786,65 +786,31 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
 unsigned long end,
  }
  #endif /* __HAVE_ARCH_PTE_SPECIAL */
  
 -static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 - unsigned long end, int write, struct page **pages, int *nr)
 +int gup_huge_pte(pte_t orig, pte_t *ptep, unsigned long addr,
 +  unsigned long sz, unsigned long end, int write,
 +  struct page **pages, int *nr)
  {
 - struct page *head, *page, *tail;
   int refs;
 + unsigned long pte_end;
 + struct page *head, *page, *tail;
  
 - if (write  !pmd_write(orig))
 - return 0;
 -
 - refs = 0;
 - head = pmd_page(orig);
 - page = head + ((addr  ~PMD_MASK)  PAGE_SHIFT);
 - tail = page;
 - do {
 - VM_BUG_ON_PAGE(compound_head(page) != head, page);
 - pages[*nr] = page;
 - (*nr)++;
 - page++;
 - refs++;
 - } while (addr += PAGE_SIZE, addr != end);
  
 - if (!page_cache_add_speculative(head, refs)) {
 - *nr -= refs;
 + if (write  !pte_write(orig))
   return 0;
 - }
  
 - if