Re: [PATCH 04/13] powerpc: Update hugetlb huge_pte_alloc and tablewalk code for FSL BOOKE

2011-11-29 Thread Becky Bruce

On Nov 28, 2011, at 11:25 PM, Benjamin Herrenschmidt wrote:

 On Mon, 2011-10-10 at 15:50 -0500, Becky Bruce wrote:
 From: Becky Bruce bec...@kernel.crashing.org
 
 This updates the hugetlb page table code to handle 64-bit FSL_BOOKE.
 The previous 32-bit work counted on the inner levels of the page table
 collapsing.
 
 Seriously, my brain hurts !!!

Now you know how I felt when I got the original code from David :)

 
 So I've tried to understand that code and so far, what I came up with is
 this, please reply and let me know if I'm full of crack or what ...
 
 - David's code has entire levels branching off into hugepd's which
 are hugetlb specific page dirs. That requires address space constrainst.

Yes.

 
 - The hack you do with HUGEPD_PGD_SHIFT defined to PGDIR_SHIFT and
 HUGEPD_PUD_SHIFT to PUD_SHIFT consists of collasping that back into the
 normal levels.

That exists so Jimi's code for A2 and mine can coexist peacefully without 
ifdeffing huge blocks because we peel off the page table at different points 
for a hugepd.  In my case, if I have a 4M page, that is handled by having 2 
entries at the 2M layer, each of which is a pointer to the same pte stored in a 
kmem_cache created for that purpose.  In the server/A2 case, they peel off at 
the layer above 4M and have a sub-table kmem_cache that has a bunch of 
same-size huge page ptes (this is all because of the slice constraint).

 
 - I really don't understand what you are doing in __hugepte_alloc(). It
 seems to me that you are trying to make things still point to some kind
 of separate hugepd dir with the hugeptes in it and have the page tables
 point to that but I don't quite get it.

In your example below, the alloc code is:
1) allocating a small kmem_cache for the pte

2) filling in 8 entries at the 2M level with the pointers to that pte, with the 
upper bit munged to indicate huge and bits in the lower region that store the 
huge page size because it can't be encoded in the book3e pte format

 
 - Couldn't we just instead ditch the whole hugepd concept alltogether
 and simply have the huge ptes in the page table at the right level,
 using possibly multiple consecutive of them for a single page when
 needed ?
 
 Example: 4K base page size. PMD maps 2M. a 16M page could be
 representing by having 8 consecutive hugeptes pointing to the same huge
 page in the PMD directory.

We currently have 8 consecutive PMD entries that are pointers to the same 
kmem_cache that holds the actual PTE.  I did this for a few reasons:

1) I was trying to stay as close to what David had done as possible

2) symmetry - in every other case entries at higher levels of the normal page 
table are pointers to something, and it's easy to identify that something is a 
pointer to hugepte using David's upper-bit-flipping trick.  If we have an 
actual entry mixed in with the pointers it might be hard to tell that's it's an 
actual PTE and not a pointer without getting information from somewhere else 
(i.e. the vma)

3) I was trying to avoid having multiple copies of the actual pte - this way 
it's easy to do stuff like change the perms on the PTE, since I only have to 
modify one copy

4) I needed the information laid out for quick TLB miss fault-handling of 
hugepte tlb misses (see below).

 
 I believe we always know when accessing a PTE whether it's going to be
 huge or not and if it's huge, the page size. IE. All the functions we
 care about either get the vma (which gets you everything you need) or
 the size (huge_pte_alloc).

An exception is the 32-bit fault hander asm code.  It does a walk of the page 
table to reload the tlb.  We need to be able to easily identify that we're 
walking into a hugepage area so we know to load the tlbcam.  Having the pointer 
there with the munged upper bit that David devised is very convenient for that. 
 Also, the Book3e page table format does not allow us to represent page sizes  
32m.  So that's encoded in the hugepd instead (and not in the pte).

I'm not sure how to get around this without slowing things down.  I originally 
had a slower handler and it turned out to impact performance of several 
important workloads and my perf guys griped at me. I was actually eventually 
planning to rewrite the 64b fsl book3e handler to deal with this in asm as 
well.   Large workloads on our systems do a lot of tlbcam entry replacement due 
to 1) the small size of the tlbcam and 2) the lack of any hardware replacement 
policy on that array.

There are other places where we'd have to modify the code to have the vma 
available (not that it's hard to do, but it's not floating around everywhere).  
And there may be other places where this is an issue - I'd have to go dig 
around a bit to answer that.

For the record, I hate the idea of not being able to walk the page table 
without going elsewhere for information.  IMHO I should be able to tell 
everything I need to load a TLB entry from there without digging up a vma.

 
 This should be much 

Re: [PATCH 04/13] powerpc: Update hugetlb huge_pte_alloc and tablewalk code for FSL BOOKE

2011-11-29 Thread David Gibson
On Tue, Nov 29, 2011 at 10:36:42AM -0600, Becky Bruce wrote:
 
 On Nov 28, 2011, at 11:25 PM, Benjamin Herrenschmidt wrote:
 
  On Mon, 2011-10-10 at 15:50 -0500, Becky Bruce wrote:
  From: Becky Bruce bec...@kernel.crashing.org
  
  This updates the hugetlb page table code to handle 64-bit FSL_BOOKE.
  The previous 32-bit work counted on the inner levels of the page table
  collapsing.
  
  Seriously, my brain hurts !!!
 
 Now you know how I felt when I got the original code from David :)

Heh, I can't blame you.  Between the constraints of the hardware and
fitting in with x86, that thing's accreted into a horrible pile.

  So I've tried to understand that code and so far, what I came up with is
  this, please reply and let me know if I'm full of crack or what ...
  
  - David's code has entire levels branching off into hugepd's which
  are hugetlb specific page dirs. That requires address space constrainst.
 
 Yes.
 
  - The hack you do with HUGEPD_PGD_SHIFT defined to PGDIR_SHIFT and
  HUGEPD_PUD_SHIFT to PUD_SHIFT consists of collasping that back into the
  normal levels.
 
 That exists so Jimi's code for A2 and mine can coexist peacefully
 without ifdeffing huge blocks because we peel off the page table at
 different points for a hugepd.  In my case, if I have a 4M page,
 that is handled by having 2 entries at the 2M layer, each of which
 is a pointer to the same pte stored in a kmem_cache created for that
 purpose.  In the server/A2 case, they peel off at the layer above 4M
 and have a sub-table kmem_cache that has a bunch of same-size huge
 page ptes (this is all because of the slice constraint).
 
  - I really don't understand what you are doing in __hugepte_alloc(). It
  seems to me that you are trying to make things still point to some kind
  of separate hugepd dir with the hugeptes in it and have the page tables
  point to that but I don't quite get it.
 
 In your example below, the alloc code is:
 1) allocating a small kmem_cache for the pte
 
 2) filling in 8 entries at the 2M level with the pointers to that pte, with 
 the upper bit munged to indicate huge and bits in the lower region that store 
 the huge page size because it can't be encoded in the book3e pte format
 
  - Couldn't we just instead ditch the whole hugepd concept alltogether
  and simply have the huge ptes in the page table at the right level,
  using possibly multiple consecutive of them for a single page when
  needed ?
  
  Example: 4K base page size. PMD maps 2M. a 16M page could be
  representing by having 8 consecutive hugeptes pointing to the same huge
  page in the PMD directory.
 
 We currently have 8 consecutive PMD entries that are pointers to the
 same kmem_cache that holds the actual PTE.  I did this for a few
 reasons:
 
 1) I was trying to stay as close to what David had done as possible
 
 2) symmetry - in every other case entries at higher levels of the
 normal page table are pointers to something, and it's easy to
 identify that something is a pointer to hugepte using David's
 upper-bit-flipping trick.  If we have an actual entry mixed in with
 the pointers it might be hard to tell that's it's an actual PTE and
 not a pointer without getting information from somewhere else
 (i.e. the vma)
 
 3) I was trying to avoid having multiple copies of the actual pte -
 this way it's easy to do stuff like change the perms on the PTE,
 since I only have to modify one copy
 
 4) I needed the information laid out for quick TLB miss
 fault-handling of hugepte tlb misses (see below).
 
  I believe we always know when accessing a PTE whether it's going to be
  huge or not and if it's huge, the page size. IE. All the functions we
  care about either get the vma (which gets you everything you need) or
  the size (huge_pte_alloc).
 
 An exception is the 32-bit fault hander asm code.  It does a walk of
 the page table to reload the tlb.  We need to be able to easily
 identify that we're walking into a hugepage area so we know to load
 the tlbcam.  Having the pointer there with the munged upper bit that
 David devised is very convenient for that.  Also, the Book3e page
 table format does not allow us to represent page sizes  32m.  So
 that's encoded in the hugepd instead (and not in the pte).
 
 I'm not sure how to get around this without slowing things down.  I
 originally had a slower handler and it turned out to impact
 performance of several important workloads and my perf guys griped
 at me. I was actually eventually planning to rewrite the 64b fsl
 book3e handler to deal with this in asm as well.  Large workloads on
 our systems do a lot of tlbcam entry replacement due to 1) the small
 size of the tlbcam and 2) the lack of any hardware replacement
 policy on that array.
 
 There are other places where we'd have to modify the code to have
 the vma available (not that it's hard to do, but it's not floating
 around everywhere).  And there may be other places where this is an
 issue - I'd have to go dig around a bit to answer 

Re: [PATCH 04/13] powerpc: Update hugetlb huge_pte_alloc and tablewalk code for FSL BOOKE

2011-11-28 Thread Benjamin Herrenschmidt
On Mon, 2011-10-10 at 15:50 -0500, Becky Bruce wrote:
 From: Becky Bruce bec...@kernel.crashing.org
 
 This updates the hugetlb page table code to handle 64-bit FSL_BOOKE.
 The previous 32-bit work counted on the inner levels of the page table
 collapsing.

Seriously, my brain hurts !!!

So I've tried to understand that code and so far, what I came up with is
this, please reply and let me know if I'm full of crack or what ...

 - David's code has entire levels branching off into hugepd's which
are hugetlb specific page dirs. That requires address space constrainst.

 - The hack you do with HUGEPD_PGD_SHIFT defined to PGDIR_SHIFT and
HUGEPD_PUD_SHIFT to PUD_SHIFT consists of collasping that back into the
normal levels.

 - I really don't understand what you are doing in __hugepte_alloc(). It
seems to me that you are trying to make things still point to some kind
of separate hugepd dir with the hugeptes in it and have the page tables
point to that but I don't quite get it.

 - Couldn't we just instead ditch the whole hugepd concept alltogether
and simply have the huge ptes in the page table at the right level,
using possibly multiple consecutive of them for a single page when
needed ?

Example: 4K base page size. PMD maps 2M. a 16M page could be
representing by having 8 consecutive hugeptes pointing to the same huge
page in the PMD directory.

I believe we always know when accessing a PTE whether it's going to be
huge or not and if it's huge, the page size. IE. All the functions we
care about either get the vma (which gets you everything you need) or
the size (huge_pte_alloc).

This should be much simpler than what we have today.

That way, we have a completely generic accross-the-board way to store
huge pte's in our page tables, which is totally disconnected from the
slices. The later remains a purely server-only thing which only affects
the SLB code and get_unmapped_area().

That means that we'll have to find another option for PowerEN giant
indirects but that's a non issue at the moment. I think we can keep the
complexity local to the PowerEN code by doing shadows there if we need
to.

What do you think ?

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 04/13] powerpc: Update hugetlb huge_pte_alloc and tablewalk code for FSL BOOKE

2011-10-10 Thread Becky Bruce
From: Becky Bruce bec...@kernel.crashing.org

This updates the hugetlb page table code to handle 64-bit FSL_BOOKE.
The previous 32-bit work counted on the inner levels of the page table
collapsing.

Signed-off-by: Becky Bruce bec...@kernel.crashing.org
---
 arch/powerpc/mm/hugetlbpage.c |   48 +++-
 1 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 71c6533..b4a4884 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -155,11 +155,28 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t 
*hpdp,
hpdp-pd = 0;
kmem_cache_free(cachep, new);
}
+#else
+   if (!hugepd_none(*hpdp))
+   kmem_cache_free(cachep, new);
+   else
+   hpdp-pd = ((unsigned long)new  ~PD_HUGE) | pshift;
 #endif
spin_unlock(mm-page_table_lock);
return 0;
 }
 
+/*
+ * These macros define how to determine which level of the page table holds
+ * the hpdp.
+ */
+#ifdef CONFIG_PPC_FSL_BOOK3E
+#define HUGEPD_PGD_SHIFT PGDIR_SHIFT
+#define HUGEPD_PUD_SHIFT PUD_SHIFT
+#else
+#define HUGEPD_PGD_SHIFT PUD_SHIFT
+#define HUGEPD_PUD_SHIFT PMD_SHIFT
+#endif
+
 pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long 
sz)
 {
pgd_t *pg;
@@ -172,12 +189,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long 
addr, unsigned long sz
addr = ~(sz-1);
 
pg = pgd_offset(mm, addr);
-   if (pshift = PUD_SHIFT) {
+
+   if (pshift = HUGEPD_PGD_SHIFT) {
hpdp = (hugepd_t *)pg;
} else {
pdshift = PUD_SHIFT;
pu = pud_alloc(mm, pg, addr);
-   if (pshift = PMD_SHIFT) {
+   if (pshift = HUGEPD_PUD_SHIFT) {
hpdp = (hugepd_t *)pu;
} else {
pdshift = PMD_SHIFT;
@@ -453,14 +471,23 @@ static void hugetlb_free_pmd_range(struct mmu_gather 
*tlb, pud_t *pud,
unsigned long start;
 
start = addr;
-   pmd = pmd_offset(pud, addr);
do {
+   pmd = pmd_offset(pud, addr);
next = pmd_addr_end(addr, end);
if (pmd_none(*pmd))
continue;
+#ifdef CONFIG_PPC_FSL_BOOK3E
+   /*
+* Increment next by the size of the huge mapping since
+* there may be more than one entry at this level for a
+* single hugepage, but all of them point to
+* the same kmem cache that holds the hugepte.
+*/
+   next = addr + (1  hugepd_shift(*(hugepd_t *)pmd));
+#endif
free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT,
  addr, next, floor, ceiling);
-   } while (pmd++, addr = next, addr != end);
+   } while (addr = next, addr != end);
 
start = PUD_MASK;
if (start  floor)
@@ -487,8 +514,8 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, 
pgd_t *pgd,
unsigned long start;
 
start = addr;
-   pud = pud_offset(pgd, addr);
do {
+   pud = pud_offset(pgd, addr);
next = pud_addr_end(addr, end);
if (!is_hugepd(pud)) {
if (pud_none_or_clear_bad(pud))
@@ -496,10 +523,19 @@ static void hugetlb_free_pud_range(struct mmu_gather 
*tlb, pgd_t *pgd,
hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
   ceiling);
} else {
+#ifdef CONFIG_PPC_FSL_BOOK3E
+   /*
+* Increment next by the size of the huge mapping since
+* there may be more than one entry at this level for a
+* single hugepage, but all of them point to
+* the same kmem cache that holds the hugepte.
+*/
+   next = addr + (1  hugepd_shift(*(hugepd_t *)pud));
+#endif
free_hugepd_range(tlb, (hugepd_t *)pud, PUD_SHIFT,
  addr, next, floor, ceiling);
}
-   } while (pud++, addr = next, addr != end);
+   } while (addr = next, addr != end);
 
start = PGDIR_MASK;
if (start  floor)
-- 
1.5.6.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev