Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-18 Thread Kirill A. Shutemov
On Wed, Oct 17, 2012 at 10:32:13AM +0800, Ni zhan Chen wrote:
> On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote:
> >On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:
> >>On Tue,  2 Oct 2012 18:19:22 +0300
> >>"Kirill A. Shutemov"  wrote:
> >>
> >>>During testing I noticed big (up to 2.5 times) memory consumption overhead
> >>>on some workloads (e.g. ft.A from NPB) if THP is enabled.
> >>>
> >>>The main reason for that big difference is lacking zero page in THP case.
> >>>We have to allocate a real page on read page fault.
> >>>
> >>>A program to demonstrate the issue:
> >>>#include 
> >>>#include 
> >>>#include 
> >>>
> >>>#define MB 1024*1024
> >>>
> >>>int main(int argc, char **argv)
> >>>{
> >>> char *p;
> >>> int i;
> >>>
> >>> posix_memalign((void **), 2 * MB, 200 * MB);
> >>> for (i = 0; i < 200 * MB; i+= 4096)
> >>> assert(p[i] == 0);
> >>> pause();
> >>> return 0;
> >>>}
> >>>
> >>>With thp-never RSS is about 400k, but with thp-always it's 200M.
> >>>After the patcheset thp-always RSS is 400k too.
> >>I'd like to see a full description of the design, please.
> >Okay. Design overview.
> >
> >Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
> >zeros.  The way how we allocate it changes in the patchset:
> >
> >- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
> >- [09/10] lazy allocation on first use;
> >- [10/10] lockless refcounting + shrinker-reclaimable hzp;
> >
> >We setup it in do_huge_pmd_anonymous_page() if area around fault address
> >is suitable for THP and we've got read page fault.
> >If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
> >normally do in THP.
> >
> >On wp fault to hzp we allocate real memory for the huge page and clear it.
> >If ENOMEM, graceful fallback: we create a new pmd table and set pte around
> >fault address to newly allocated normal (4k) page. All other ptes in the
> >pmd set to normal zero page.
> >
> >We cannot split hzp (and it's bug if we try), but we can split the pmd
> >which points to it. On splitting the pmd we create a table with all ptes
> >set to normal zero page.
> >
> >Patchset organized in bisect-friendly way:
> >  Patches 01-07: prepare all code paths for hzp
> >  Patch 08: all code paths are covered: safe to setup hzp
> >  Patch 09: lazy allocation
> >  Patch 10: lockless refcounting for hzp
> >
> >--
> >
> >By hpa request I've tried alternative approach for hzp implementation (see
> >Virtual huge zero page patchset): pmd table with all entries set to zero
> >page. This way should be more cache friendly, but it increases TLB
> >pressure.
> >
> >The problem with virtual huge zero page: it requires per-arch enabling.
> >We need a way to mark that pmd table has all ptes set to zero page.
> >
> >Some numbers to compare two implementations (on 4s Westmere-EX):
> >
> >Mirobenchmark1
> >==
> >
> >test:
> > posix_memalign((void **), 2 * MB, 8 * GB);
> > for (i = 0; i < 100; i++) {
> > assert(memcmp(p, p + 4*GB, 4*GB) == 0);
> > asm volatile ("": : :"memory");
> > }
> >
> >hzp:
> >  Performance counter stats for './test_memcmp' (5 runs):
> >
> >   32356.272845 task-clock#0.998 CPUs utilized   
> >  ( +-  0.13% )
> > 40 context-switches  #0.001 K/sec   
> >  ( +-  0.94% )
> >  0 CPU-migrations#0.000 K/sec
> >  4,218 page-faults   #0.130 K/sec   
> >  ( +-  0.00% )
> > 76,712,481,765 cycles#2.371 GHz 
> >  ( +-  0.13% ) [83.31%]
> > 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles 
> > idle ( +-  0.28% ) [83.35%]
> >  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles 
> > idle ( +-  2.96% ) [66.67%]
> >134,355,715,816 instructions  #1.75  insns per cycle
> >  #0.27  stalled cycles per 
> > insn  ( +-  0.10% ) [83.35%]
> > 13,526,169,702 branches  #  418.039 M/sec   
> >  ( +-  0.10% ) [83.31%]
> >  1,058,230 branch-misses #0.01% of all branches 
> >  ( +-  0.91% ) [83.36%]
> >
> >   32.413866442 seconds time elapsed 
> >  ( +-  0.13% )
> >
> >vhzp:
> >  Performance counter stats for './test_memcmp' (5 runs):
> >
> >   30327.183829 task-clock#0.998 CPUs utilized   
> >  ( +-  0.13% )
> > 38 context-switches  #0.001 K/sec   
> >  ( +-  1.53% )
> >  0 CPU-migrations#0.000 K/sec
> >  4,218 page-faults   #0.139 K/sec   

Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-18 Thread Kirill A. Shutemov
On Wed, Oct 17, 2012 at 10:32:13AM +0800, Ni zhan Chen wrote:
 On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote:
 On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:
 On Tue,  2 Oct 2012 18:19:22 +0300
 Kirill A. Shutemov kirill.shute...@linux.intel.com wrote:
 
 During testing I noticed big (up to 2.5 times) memory consumption overhead
 on some workloads (e.g. ft.A from NPB) if THP is enabled.
 
 The main reason for that big difference is lacking zero page in THP case.
 We have to allocate a real page on read page fault.
 
 A program to demonstrate the issue:
 #include assert.h
 #include stdlib.h
 #include unistd.h
 
 #define MB 1024*1024
 
 int main(int argc, char **argv)
 {
  char *p;
  int i;
 
  posix_memalign((void **)p, 2 * MB, 200 * MB);
  for (i = 0; i  200 * MB; i+= 4096)
  assert(p[i] == 0);
  pause();
  return 0;
 }
 
 With thp-never RSS is about 400k, but with thp-always it's 200M.
 After the patcheset thp-always RSS is 400k too.
 I'd like to see a full description of the design, please.
 Okay. Design overview.
 
 Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
 zeros.  The way how we allocate it changes in the patchset:
 
 - [01/10] simplest way: hzp allocated on boot time in hugepage_init();
 - [09/10] lazy allocation on first use;
 - [10/10] lockless refcounting + shrinker-reclaimable hzp;
 
 We setup it in do_huge_pmd_anonymous_page() if area around fault address
 is suitable for THP and we've got read page fault.
 If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
 normally do in THP.
 
 On wp fault to hzp we allocate real memory for the huge page and clear it.
 If ENOMEM, graceful fallback: we create a new pmd table and set pte around
 fault address to newly allocated normal (4k) page. All other ptes in the
 pmd set to normal zero page.
 
 We cannot split hzp (and it's bug if we try), but we can split the pmd
 which points to it. On splitting the pmd we create a table with all ptes
 set to normal zero page.
 
 Patchset organized in bisect-friendly way:
   Patches 01-07: prepare all code paths for hzp
   Patch 08: all code paths are covered: safe to setup hzp
   Patch 09: lazy allocation
   Patch 10: lockless refcounting for hzp
 
 --
 
 By hpa request I've tried alternative approach for hzp implementation (see
 Virtual huge zero page patchset): pmd table with all entries set to zero
 page. This way should be more cache friendly, but it increases TLB
 pressure.
 
 The problem with virtual huge zero page: it requires per-arch enabling.
 We need a way to mark that pmd table has all ptes set to zero page.
 
 Some numbers to compare two implementations (on 4s Westmere-EX):
 
 Mirobenchmark1
 ==
 
 test:
  posix_memalign((void **)p, 2 * MB, 8 * GB);
  for (i = 0; i  100; i++) {
  assert(memcmp(p, p + 4*GB, 4*GB) == 0);
  asm volatile (: : :memory);
  }
 
 hzp:
   Performance counter stats for './test_memcmp' (5 runs):
 
32356.272845 task-clock#0.998 CPUs utilized   
   ( +-  0.13% )
  40 context-switches  #0.001 K/sec   
   ( +-  0.94% )
   0 CPU-migrations#0.000 K/sec
   4,218 page-faults   #0.130 K/sec   
   ( +-  0.00% )
  76,712,481,765 cycles#2.371 GHz 
   ( +-  0.13% ) [83.31%]
  36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles 
  idle ( +-  0.28% ) [83.35%]
   1,684,049,110 stalled-cycles-backend#2.20% backend  cycles 
  idle ( +-  2.96% ) [66.67%]
 134,355,715,816 instructions  #1.75  insns per cycle
   #0.27  stalled cycles per 
  insn  ( +-  0.10% ) [83.35%]
  13,526,169,702 branches  #  418.039 M/sec   
   ( +-  0.10% ) [83.31%]
   1,058,230 branch-misses #0.01% of all branches 
   ( +-  0.91% ) [83.36%]
 
32.413866442 seconds time elapsed 
   ( +-  0.13% )
 
 vhzp:
   Performance counter stats for './test_memcmp' (5 runs):
 
30327.183829 task-clock#0.998 CPUs utilized   
   ( +-  0.13% )
  38 context-switches  #0.001 K/sec   
   ( +-  1.53% )
   0 CPU-migrations#0.000 K/sec
   4,218 page-faults   #0.139 K/sec   
   ( +-  0.01% )
  71,964,773,660 cycles#2.373 GHz 
   ( +-  0.13% ) [83.35%]
  31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles 
  idle ( +-  0.40% ) [83.32%]

Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote:

On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:

On Tue,  2 Oct 2012 18:19:22 +0300
"Kirill A. Shutemov"  wrote:


During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include 
#include 
#include 

#define MB 1024*1024

int main(int argc, char **argv)
{
 char *p;
 int i;

 posix_memalign((void **), 2 * MB, 200 * MB);
 for (i = 0; i < 200 * MB; i+= 4096)
 assert(p[i] == 0);
 pause();
 return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

I'd like to see a full description of the design, please.

Okay. Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
  Patches 01-07: prepare all code paths for hzp
  Patch 08: all code paths are covered: safe to setup hzp
  Patch 09: lazy allocation
  Patch 10: lockless refcounting for hzp

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **), 2 * MB, 8 * GB);
 for (i = 0; i < 100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile ("": : :"memory");
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
  #0.23  stalled 

Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote:

On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:

On Tue,  2 Oct 2012 18:19:22 +0300
Kirill A. Shutemov kirill.shute...@linux.intel.com wrote:


During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include assert.h
#include stdlib.h
#include unistd.h

#define MB 1024*1024

int main(int argc, char **argv)
{
 char *p;
 int i;

 posix_memalign((void **)p, 2 * MB, 200 * MB);
 for (i = 0; i  200 * MB; i+= 4096)
 assert(p[i] == 0);
 pause();
 return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

I'd like to see a full description of the design, please.

Okay. Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
  Patches 01-07: prepare all code paths for hzp
  Patch 08: all code paths are covered: safe to setup hzp
  Patch 09: lazy allocation
  Patch 10: lockless refcounting for hzp

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB);
 for (i = 0; i  100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile (: : :memory);
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
 

Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Andrew Morton
On Wed, 3 Oct 2012 03:04:02 +0300
"Kirill A. Shutemov"  wrote:

> Is the overview complete enough? Have I answered all you questions here?

Yes, thanks!

The design overview is short enough to be put in as code comments in
suitable places.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Kirill A. Shutemov
On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:
> On Tue,  2 Oct 2012 18:19:22 +0300
> "Kirill A. Shutemov"  wrote:
> 
> > During testing I noticed big (up to 2.5 times) memory consumption overhead
> > on some workloads (e.g. ft.A from NPB) if THP is enabled.
> > 
> > The main reason for that big difference is lacking zero page in THP case.
> > We have to allocate a real page on read page fault.
> > 
> > A program to demonstrate the issue:
> > #include 
> > #include 
> > #include 
> > 
> > #define MB 1024*1024
> > 
> > int main(int argc, char **argv)
> > {
> > char *p;
> > int i;
> > 
> > posix_memalign((void **), 2 * MB, 200 * MB);
> > for (i = 0; i < 200 * MB; i+= 4096)
> > assert(p[i] == 0);
> > pause();
> > return 0;
> > }
> > 
> > With thp-never RSS is about 400k, but with thp-always it's 200M.
> > After the patcheset thp-always RSS is 400k too.
> 
> I'd like to see a full description of the design, please.

Okay. Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
 Patches 01-07: prepare all code paths for hzp
 Patch 08: all code paths are covered: safe to setup hzp
 Patch 09: lazy allocation
 Patch 10: lockless refcounting for hzp

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
posix_memalign((void **), 2 * MB, 8 * GB);
for (i = 0; i < 100; i++) {
assert(memcmp(p, p + 4*GB, 4*GB) == 0);
asm volatile ("": : :"memory");
}

hzp:
 Performance counter stats for './test_memcmp' (5 runs):

  32356.272845 task-clock#0.998 CPUs utilized   
 ( +-  0.13% )
40 context-switches  #0.001 K/sec   
 ( +-  0.94% )
 0 CPU-migrations#0.000 K/sec
 4,218 page-faults   #0.130 K/sec   
 ( +-  0.00% )
76,712,481,765 cycles#2.371 GHz 
 ( +-  0.13% ) [83.31%]
36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle
 ( +-  0.28% ) [83.35%]
 1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle
 ( +-  2.96% ) [66.67%]
   134,355,715,816 instructions  #1.75  insns per cycle
 #0.27  stalled cycles per insn 
 ( +-  0.10% ) [83.35%]
13,526,169,702 branches  #  418.039 M/sec   
 ( +-  0.10% ) [83.31%]
 1,058,230 branch-misses #0.01% of all branches 
 ( +-  0.91% ) [83.36%]

  32.413866442 seconds time elapsed 
 ( +-  0.13% )

vhzp:
 Performance counter stats for './test_memcmp' (5 runs):

  30327.183829 task-clock#0.998 CPUs utilized   
 ( +-  0.13% )
38 context-switches  #0.001 K/sec   
 ( +-  1.53% )
 0 CPU-migrations#0.000 K/sec
 4,218 page-faults   #0.139 K/sec   
 ( +-  0.01% )
71,964,773,660 cycles#2.373 GHz 
 ( +-  0.13% ) [83.35%]
31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle
 ( +-  0.40% ) [83.32%]
   773,484,474 stalled-cycles-backend#1.07% backend  cycles idle
 ( +-  6.61% ) [66.67%]
   134,982,215,437 instructions  #1.88  insns per cycle

Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Andrea Arcangeli
Hi Andrew,

On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:
> From reading the code, it appears that we initially allocate a huge
> page and point the pmd at that.  If/when there is a write fault against
> that page we then populate the mm with ptes which point at the normal
> 4k zero page and populate the pte at the fault address with a newly
> allocated page?   Correct and complete?  If not, please fix ;)

During the cow, we never use 4k ptes, unless the 2m page allocation
fails.

> Also, IIRC, the early versions of the patch did not allocate the
> initial huge page at all - it immediately filled the mm with ptes which
> point at the normal 4k zero page.  Is that a correct recollection?
> If so, why the change?

That was a different design yes. The design in this patchset will not
do that.

> Also IIRC, Andrea had a little test app which demonstrated the TLB
> costs of the inital approach, and they were high?

Yes we run the benchmarks yesterday, this version is the one that will
decrease the TLB cost and that seems the safest tradeoff.

> Please, let's capture all this knowledge in a single place, right here
> in the changelog.  And in code comments, where appropriate.  Otherwise
> people won't know why we made these decisions unless they go off and
> find lengthy, years-old and quite possibly obsolete email threads.

Agreed ;).

> Also, you've presented some data on the memory savings, but no
> quantitative testing results on the performance cost.  Both you and
> Andrea have run these tests and those results are important.  Let's
> capture them here.  And when designing such tests we should not just
> try to demonstrate the benefits of a code change - we should think of
> test cases whcih might be adversely affected and run those as well.

Right.

> It's not an appropriate time to be merging new features - please plan
> on preparing this patchset against 3.7-rc1.

Ok, I assume Kirill will take care of it.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Andrew Morton
On Tue,  2 Oct 2012 18:19:22 +0300
"Kirill A. Shutemov"  wrote:

> During testing I noticed big (up to 2.5 times) memory consumption overhead
> on some workloads (e.g. ft.A from NPB) if THP is enabled.
> 
> The main reason for that big difference is lacking zero page in THP case.
> We have to allocate a real page on read page fault.
> 
> A program to demonstrate the issue:
> #include 
> #include 
> #include 
> 
> #define MB 1024*1024
> 
> int main(int argc, char **argv)
> {
> char *p;
> int i;
> 
> posix_memalign((void **), 2 * MB, 200 * MB);
> for (i = 0; i < 200 * MB; i+= 4096)
> assert(p[i] == 0);
> pause();
> return 0;
> }
> 
> With thp-never RSS is about 400k, but with thp-always it's 200M.
> After the patcheset thp-always RSS is 400k too.

I'd like to see a full description of the design, please.

>From reading the code, it appears that we initially allocate a huge
page and point the pmd at that.  If/when there is a write fault against
that page we then populate the mm with ptes which point at the normal
4k zero page and populate the pte at the fault address with a newly
allocated page?   Correct and complete?  If not, please fix ;)

Also, IIRC, the early versions of the patch did not allocate the
initial huge page at all - it immediately filled the mm with ptes which
point at the normal 4k zero page.  Is that a correct recollection?
If so, why the change?

Also IIRC, Andrea had a little test app which demonstrated the TLB
costs of the inital approach, and they were high?

Please, let's capture all this knowledge in a single place, right here
in the changelog.  And in code comments, where appropriate.  Otherwise
people won't know why we made these decisions unless they go off and
find lengthy, years-old and quite possibly obsolete email threads.


Also, you've presented some data on the memory savings, but no
quantitative testing results on the performance cost.  Both you and
Andrea have run these tests and those results are important.  Let's
capture them here.  And when designing such tests we should not just
try to demonstrate the benefits of a code change - we should think of
test cases whcih might be adversely affected and run those as well.


It's not an appropriate time to be merging new features - please plan
on preparing this patchset against 3.7-rc1.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Andrea Arcangeli
On Tue, Oct 02, 2012 at 06:19:22PM +0300, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" 
> 
> During testing I noticed big (up to 2.5 times) memory consumption overhead
> on some workloads (e.g. ft.A from NPB) if THP is enabled.
> 
> The main reason for that big difference is lacking zero page in THP case.
> We have to allocate a real page on read page fault.
> 
> A program to demonstrate the issue:
> #include 
> #include 
> #include 
> 
> #define MB 1024*1024
> 
> int main(int argc, char **argv)
> {
> char *p;
> int i;
> 
> posix_memalign((void **), 2 * MB, 200 * MB);
> for (i = 0; i < 200 * MB; i+= 4096)
> assert(p[i] == 0);
> pause();
> return 0;
> }
> 
> With thp-never RSS is about 400k, but with thp-always it's 200M.
> After the patcheset thp-always RSS is 400k too.
> 
> v3:
>  - fix potential deadlock in refcounting code on preemptive kernel.
>  - do not mark huge zero page as movable.
>  - fix typo in comment.
>  - Reviewed-by tag from Andrea Arcangeli.
> v2:
>  - Avoid find_vma() if we've already had vma on stack.
>Suggested by Andrea Arcangeli.
>  - Implement refcounting for huge zero page.
> 
> Kirill A. Shutemov (10):
>   thp: huge zero page: basic preparation
>   thp: zap_huge_pmd(): zap huge zero pmd
>   thp: copy_huge_pmd(): copy huge zero page
>   thp: do_huge_pmd_wp_page(): handle huge zero page
>   thp: change_huge_pmd(): keep huge zero page write-protected
>   thp: change split_huge_page_pmd() interface
>   thp: implement splitting pmd for huge zero page
>   thp: setup huge zero page on non-write page fault
>   thp: lazy huge zero page allocation
>   thp: implement refcounting for huge zero page

Reviewed-by: Andrea Arcangeli 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include 
#include 
#include 

#define MB 1024*1024

int main(int argc, char **argv)
{
char *p;
int i;

posix_memalign((void **), 2 * MB, 200 * MB);
for (i = 0; i < 200 * MB; i+= 4096)
assert(p[i] == 0);
pause();
return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

v3:
 - fix potential deadlock in refcounting code on preemptive kernel.
 - do not mark huge zero page as movable.
 - fix typo in comment.
 - Reviewed-by tag from Andrea Arcangeli.
v2:
 - Avoid find_vma() if we've already had vma on stack.
   Suggested by Andrea Arcangeli.
 - Implement refcounting for huge zero page.

Kirill A. Shutemov (10):
  thp: huge zero page: basic preparation
  thp: zap_huge_pmd(): zap huge zero pmd
  thp: copy_huge_pmd(): copy huge zero page
  thp: do_huge_pmd_wp_page(): handle huge zero page
  thp: change_huge_pmd(): keep huge zero page write-protected
  thp: change split_huge_page_pmd() interface
  thp: implement splitting pmd for huge zero page
  thp: setup huge zero page on non-write page fault
  thp: lazy huge zero page allocation
  thp: implement refcounting for huge zero page

 Documentation/vm/transhuge.txt |4 +-
 arch/x86/kernel/vm86_32.c  |2 +-
 fs/proc/task_mmu.c |2 +-
 include/linux/huge_mm.h|   14 ++-
 include/linux/mm.h |8 +
 mm/huge_memory.c   |  307 
 mm/memory.c|   11 +--
 mm/mempolicy.c |2 +-
 mm/mprotect.c  |2 +-
 mm/mremap.c|2 +-
 mm/pagewalk.c  |2 +-
 11 files changed, 305 insertions(+), 51 deletions(-)

-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Kirill A. Shutemov
From: Kirill A. Shutemov kirill.shute...@linux.intel.com

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include assert.h
#include stdlib.h
#include unistd.h

#define MB 1024*1024

int main(int argc, char **argv)
{
char *p;
int i;

posix_memalign((void **)p, 2 * MB, 200 * MB);
for (i = 0; i  200 * MB; i+= 4096)
assert(p[i] == 0);
pause();
return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

v3:
 - fix potential deadlock in refcounting code on preemptive kernel.
 - do not mark huge zero page as movable.
 - fix typo in comment.
 - Reviewed-by tag from Andrea Arcangeli.
v2:
 - Avoid find_vma() if we've already had vma on stack.
   Suggested by Andrea Arcangeli.
 - Implement refcounting for huge zero page.

Kirill A. Shutemov (10):
  thp: huge zero page: basic preparation
  thp: zap_huge_pmd(): zap huge zero pmd
  thp: copy_huge_pmd(): copy huge zero page
  thp: do_huge_pmd_wp_page(): handle huge zero page
  thp: change_huge_pmd(): keep huge zero page write-protected
  thp: change split_huge_page_pmd() interface
  thp: implement splitting pmd for huge zero page
  thp: setup huge zero page on non-write page fault
  thp: lazy huge zero page allocation
  thp: implement refcounting for huge zero page

 Documentation/vm/transhuge.txt |4 +-
 arch/x86/kernel/vm86_32.c  |2 +-
 fs/proc/task_mmu.c |2 +-
 include/linux/huge_mm.h|   14 ++-
 include/linux/mm.h |8 +
 mm/huge_memory.c   |  307 
 mm/memory.c|   11 +--
 mm/mempolicy.c |2 +-
 mm/mprotect.c  |2 +-
 mm/mremap.c|2 +-
 mm/pagewalk.c  |2 +-
 11 files changed, 305 insertions(+), 51 deletions(-)

-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Andrea Arcangeli
On Tue, Oct 02, 2012 at 06:19:22PM +0300, Kirill A. Shutemov wrote:
 From: Kirill A. Shutemov kirill.shute...@linux.intel.com
 
 During testing I noticed big (up to 2.5 times) memory consumption overhead
 on some workloads (e.g. ft.A from NPB) if THP is enabled.
 
 The main reason for that big difference is lacking zero page in THP case.
 We have to allocate a real page on read page fault.
 
 A program to demonstrate the issue:
 #include assert.h
 #include stdlib.h
 #include unistd.h
 
 #define MB 1024*1024
 
 int main(int argc, char **argv)
 {
 char *p;
 int i;
 
 posix_memalign((void **)p, 2 * MB, 200 * MB);
 for (i = 0; i  200 * MB; i+= 4096)
 assert(p[i] == 0);
 pause();
 return 0;
 }
 
 With thp-never RSS is about 400k, but with thp-always it's 200M.
 After the patcheset thp-always RSS is 400k too.
 
 v3:
  - fix potential deadlock in refcounting code on preemptive kernel.
  - do not mark huge zero page as movable.
  - fix typo in comment.
  - Reviewed-by tag from Andrea Arcangeli.
 v2:
  - Avoid find_vma() if we've already had vma on stack.
Suggested by Andrea Arcangeli.
  - Implement refcounting for huge zero page.
 
 Kirill A. Shutemov (10):
   thp: huge zero page: basic preparation
   thp: zap_huge_pmd(): zap huge zero pmd
   thp: copy_huge_pmd(): copy huge zero page
   thp: do_huge_pmd_wp_page(): handle huge zero page
   thp: change_huge_pmd(): keep huge zero page write-protected
   thp: change split_huge_page_pmd() interface
   thp: implement splitting pmd for huge zero page
   thp: setup huge zero page on non-write page fault
   thp: lazy huge zero page allocation
   thp: implement refcounting for huge zero page

Reviewed-by: Andrea Arcangeli aarca...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Andrew Morton
On Tue,  2 Oct 2012 18:19:22 +0300
Kirill A. Shutemov kirill.shute...@linux.intel.com wrote:

 During testing I noticed big (up to 2.5 times) memory consumption overhead
 on some workloads (e.g. ft.A from NPB) if THP is enabled.
 
 The main reason for that big difference is lacking zero page in THP case.
 We have to allocate a real page on read page fault.
 
 A program to demonstrate the issue:
 #include assert.h
 #include stdlib.h
 #include unistd.h
 
 #define MB 1024*1024
 
 int main(int argc, char **argv)
 {
 char *p;
 int i;
 
 posix_memalign((void **)p, 2 * MB, 200 * MB);
 for (i = 0; i  200 * MB; i+= 4096)
 assert(p[i] == 0);
 pause();
 return 0;
 }
 
 With thp-never RSS is about 400k, but with thp-always it's 200M.
 After the patcheset thp-always RSS is 400k too.

I'd like to see a full description of the design, please.

From reading the code, it appears that we initially allocate a huge
page and point the pmd at that.  If/when there is a write fault against
that page we then populate the mm with ptes which point at the normal
4k zero page and populate the pte at the fault address with a newly
allocated page?   Correct and complete?  If not, please fix ;)

Also, IIRC, the early versions of the patch did not allocate the
initial huge page at all - it immediately filled the mm with ptes which
point at the normal 4k zero page.  Is that a correct recollection?
If so, why the change?

Also IIRC, Andrea had a little test app which demonstrated the TLB
costs of the inital approach, and they were high?

Please, let's capture all this knowledge in a single place, right here
in the changelog.  And in code comments, where appropriate.  Otherwise
people won't know why we made these decisions unless they go off and
find lengthy, years-old and quite possibly obsolete email threads.


Also, you've presented some data on the memory savings, but no
quantitative testing results on the performance cost.  Both you and
Andrea have run these tests and those results are important.  Let's
capture them here.  And when designing such tests we should not just
try to demonstrate the benefits of a code change - we should think of
test cases whcih might be adversely affected and run those as well.


It's not an appropriate time to be merging new features - please plan
on preparing this patchset against 3.7-rc1.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Andrea Arcangeli
Hi Andrew,

On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:
 From reading the code, it appears that we initially allocate a huge
 page and point the pmd at that.  If/when there is a write fault against
 that page we then populate the mm with ptes which point at the normal
 4k zero page and populate the pte at the fault address with a newly
 allocated page?   Correct and complete?  If not, please fix ;)

During the cow, we never use 4k ptes, unless the 2m page allocation
fails.

 Also, IIRC, the early versions of the patch did not allocate the
 initial huge page at all - it immediately filled the mm with ptes which
 point at the normal 4k zero page.  Is that a correct recollection?
 If so, why the change?

That was a different design yes. The design in this patchset will not
do that.

 Also IIRC, Andrea had a little test app which demonstrated the TLB
 costs of the inital approach, and they were high?

Yes we run the benchmarks yesterday, this version is the one that will
decrease the TLB cost and that seems the safest tradeoff.

 Please, let's capture all this knowledge in a single place, right here
 in the changelog.  And in code comments, where appropriate.  Otherwise
 people won't know why we made these decisions unless they go off and
 find lengthy, years-old and quite possibly obsolete email threads.

Agreed ;).

 Also, you've presented some data on the memory savings, but no
 quantitative testing results on the performance cost.  Both you and
 Andrea have run these tests and those results are important.  Let's
 capture them here.  And when designing such tests we should not just
 try to demonstrate the benefits of a code change - we should think of
 test cases whcih might be adversely affected and run those as well.

Right.

 It's not an appropriate time to be merging new features - please plan
 on preparing this patchset against 3.7-rc1.

Ok, I assume Kirill will take care of it.

Thanks,
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Kirill A. Shutemov
On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:
 On Tue,  2 Oct 2012 18:19:22 +0300
 Kirill A. Shutemov kirill.shute...@linux.intel.com wrote:
 
  During testing I noticed big (up to 2.5 times) memory consumption overhead
  on some workloads (e.g. ft.A from NPB) if THP is enabled.
  
  The main reason for that big difference is lacking zero page in THP case.
  We have to allocate a real page on read page fault.
  
  A program to demonstrate the issue:
  #include assert.h
  #include stdlib.h
  #include unistd.h
  
  #define MB 1024*1024
  
  int main(int argc, char **argv)
  {
  char *p;
  int i;
  
  posix_memalign((void **)p, 2 * MB, 200 * MB);
  for (i = 0; i  200 * MB; i+= 4096)
  assert(p[i] == 0);
  pause();
  return 0;
  }
  
  With thp-never RSS is about 400k, but with thp-always it's 200M.
  After the patcheset thp-always RSS is 400k too.
 
 I'd like to see a full description of the design, please.

Okay. Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
 Patches 01-07: prepare all code paths for hzp
 Patch 08: all code paths are covered: safe to setup hzp
 Patch 09: lazy allocation
 Patch 10: lockless refcounting for hzp

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
posix_memalign((void **)p, 2 * MB, 8 * GB);
for (i = 0; i  100; i++) {
assert(memcmp(p, p + 4*GB, 4*GB) == 0);
asm volatile (: : :memory);
}

hzp:
 Performance counter stats for './test_memcmp' (5 runs):

  32356.272845 task-clock#0.998 CPUs utilized   
 ( +-  0.13% )
40 context-switches  #0.001 K/sec   
 ( +-  0.94% )
 0 CPU-migrations#0.000 K/sec
 4,218 page-faults   #0.130 K/sec   
 ( +-  0.00% )
76,712,481,765 cycles#2.371 GHz 
 ( +-  0.13% ) [83.31%]
36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle
 ( +-  0.28% ) [83.35%]
 1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle
 ( +-  2.96% ) [66.67%]
   134,355,715,816 instructions  #1.75  insns per cycle
 #0.27  stalled cycles per insn 
 ( +-  0.10% ) [83.35%]
13,526,169,702 branches  #  418.039 M/sec   
 ( +-  0.10% ) [83.31%]
 1,058,230 branch-misses #0.01% of all branches 
 ( +-  0.91% ) [83.36%]

  32.413866442 seconds time elapsed 
 ( +-  0.13% )

vhzp:
 Performance counter stats for './test_memcmp' (5 runs):

  30327.183829 task-clock#0.998 CPUs utilized   
 ( +-  0.13% )
38 context-switches  #0.001 K/sec   
 ( +-  1.53% )
 0 CPU-migrations#0.000 K/sec
 4,218 page-faults   #0.139 K/sec   
 ( +-  0.01% )
71,964,773,660 cycles#2.373 GHz 
 ( +-  0.13% ) [83.35%]
31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle
 ( +-  0.40% ) [83.32%]
   773,484,474 stalled-cycles-backend#1.07% backend  cycles idle
 ( +-  6.61% ) [66.67%]
   134,982,215,437 instructions  #1.88  insns per cycle
 #

Re: [PATCH v3 00/10] Introduce huge zero page

2012-10-02 Thread Andrew Morton
On Wed, 3 Oct 2012 03:04:02 +0300
Kirill A. Shutemov kir...@shutemov.name wrote:

 Is the overview complete enough? Have I answered all you questions here?

Yes, thanks!

The design overview is short enough to be put in as code comments in
suitable places.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/