Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/16/2012 07:28 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 07:13:07PM +0800, Ni zhan Chen wrote:

On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

Thanks for your excellent works. But could you explain me why
current implementation not cache friendly and hpa's request cache
friendly? Thanks in advance.

In workloads like microbenchmark1 you need N * size(zero page) cache
space to get zero page fully cached, where N is cache associativity.
If zero page is 2M, cache pressure is significant.

On other hand with table of 4k zero pages (hpa's proposal) will increase
pressure on TLB, since we have more pages for the same memory area. So we
have to do more page translation in this case.

On my test machine with simple memcmp() virtual huge zero page is faster.
But it highly depends on TLB size, cache size, memory access and page
translation costs.

It looks like cache size in modern processors grows faster than TLB size.

Oh, I see, thanks for your quick response. Another one question below,


The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **), 2 * MB, 8 * GB);
 for (i = 0; i < 100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile ("": : :"memory");
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
  #0.23  stalled cycles per 
insn  ( +-  0.11% ) [83.32%]
 13,509,150,683 branches  #  445.447 M/sec  
  ( +-  0.11% ) [83.34%]
  1,017,667 branch-misses #0.01% of all branches
  ( +-  1.07% ) [83.32%]

   30.381324695 seconds time elapsed
  ( +-  0.13% )

Could you tell me which data I should care in this performance
counter. And what's the benefit of your current implementation
compare to hpa's request?

Sorry for my unintelligent. Could you tell me which data I should
care in this performance counter stats. The same question about the
second benchmark counter stats, thanks in adance. :-)

I've missed relevant counters in this run, you can see them in the second
benchmark.

Relevant counters:
L1-dcache-*, LLC-*: shows cache related stats (hits/misses);
dTLB-*: shows data TLB hits and misses.

Indirect relevant counters:
stalled-cycles-*: how long CPU pipeline has to wait for data.


Oh, I see, thanks for your patient. :-)




Mirobenchmark2
==

test:
 posix_memalign((void **), 2 * MB, 8 * GB);
 

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Kirill A. Shutemov
On Tue, Oct 16, 2012 at 07:13:07PM +0800, Ni zhan Chen wrote:
> On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:
> >On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:
> >>>By hpa request I've tried alternative approach for hzp implementation (see
> >>>Virtual huge zero page patchset): pmd table with all entries set to zero
> >>>page. This way should be more cache friendly, but it increases TLB
> >>>pressure.
> >>Thanks for your excellent works. But could you explain me why
> >>current implementation not cache friendly and hpa's request cache
> >>friendly? Thanks in advance.
> >In workloads like microbenchmark1 you need N * size(zero page) cache
> >space to get zero page fully cached, where N is cache associativity.
> >If zero page is 2M, cache pressure is significant.
> >
> >On other hand with table of 4k zero pages (hpa's proposal) will increase
> >pressure on TLB, since we have more pages for the same memory area. So we
> >have to do more page translation in this case.
> >
> >On my test machine with simple memcmp() virtual huge zero page is faster.
> >But it highly depends on TLB size, cache size, memory access and page
> >translation costs.
> >
> >It looks like cache size in modern processors grows faster than TLB size.
> 
> Oh, I see, thanks for your quick response. Another one question below,
> 
> >
> >>>The problem with virtual huge zero page: it requires per-arch enabling.
> >>>We need a way to mark that pmd table has all ptes set to zero page.
> >>>
> >>>Some numbers to compare two implementations (on 4s Westmere-EX):
> >>>
> >>>Mirobenchmark1
> >>>==
> >>>
> >>>test:
> >>> posix_memalign((void **), 2 * MB, 8 * GB);
> >>> for (i = 0; i < 100; i++) {
> >>> assert(memcmp(p, p + 4*GB, 4*GB) == 0);
> >>> asm volatile ("": : :"memory");
> >>> }
> >>>
> >>>hzp:
> >>>  Performance counter stats for './test_memcmp' (5 runs):
> >>>
> >>>   32356.272845 task-clock#0.998 CPUs utilized 
> >>>( +-  0.13% )
> >>> 40 context-switches  #0.001 K/sec 
> >>>( +-  0.94% )
> >>>  0 CPU-migrations#0.000 K/sec
> >>>  4,218 page-faults   #0.130 K/sec 
> >>>( +-  0.00% )
> >>> 76,712,481,765 cycles#2.371 GHz   
> >>>( +-  0.13% ) [83.31%]
> >>> 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles 
> >>> idle ( +-  0.28% ) [83.35%]
> >>>  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles 
> >>> idle ( +-  2.96% ) [66.67%]
> >>>134,355,715,816 instructions  #1.75  insns per cycle
> >>>  #0.27  stalled cycles 
> >>> per insn  ( +-  0.10% ) [83.35%]
> >>> 13,526,169,702 branches  #  418.039 M/sec 
> >>>( +-  0.10% ) [83.31%]
> >>>  1,058,230 branch-misses #0.01% of all branches   
> >>>( +-  0.91% ) [83.36%]
> >>>
> >>>   32.413866442 seconds time elapsed   
> >>>( +-  0.13% )
> >>>
> >>>vhzp:
> >>>  Performance counter stats for './test_memcmp' (5 runs):
> >>>
> >>>   30327.183829 task-clock#0.998 CPUs utilized 
> >>>( +-  0.13% )
> >>> 38 context-switches  #0.001 K/sec 
> >>>( +-  1.53% )
> >>>  0 CPU-migrations#0.000 K/sec
> >>>  4,218 page-faults   #0.139 K/sec 
> >>>( +-  0.01% )
> >>> 71,964,773,660 cycles#2.373 GHz   
> >>>( +-  0.13% ) [83.35%]
> >>> 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles 
> >>> idle ( +-  0.40% ) [83.32%]
> >>>773,484,474 stalled-cycles-backend#1.07% backend  cycles 
> >>> idle ( +-  6.61% ) [66.67%]
> >>>134,982,215,437 instructions  #1.88  insns per cycle
> >>>  #0.23  stalled cycles 
> >>> per insn  ( +-  0.11% ) [83.32%]
> >>> 13,509,150,683 branches  #  445.447 M/sec 
> >>>( +-  0.11% ) [83.34%]
> >>>  1,017,667 branch-misses #0.01% of all branches   
> >>>( +-  1.07% ) [83.32%]
> >>>
> >>>   30.381324695 seconds time elapsed   
> >>>( +-  0.13% )
> >>Could you tell me which data I should care in this performance
> >>counter. And what's the benefit of your current implementation
> >>compare to hpa's request?
> 
> Sorry for my unintelligent. Could you tell me which data I should
> care in this performance counter stats. The same question about the
> second benchmark counter stats, thanks in adance. :-)

I've missed relevant counters in 

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

Thanks for your excellent works. But could you explain me why
current implementation not cache friendly and hpa's request cache
friendly? Thanks in advance.

In workloads like microbenchmark1 you need N * size(zero page) cache
space to get zero page fully cached, where N is cache associativity.
If zero page is 2M, cache pressure is significant.

On other hand with table of 4k zero pages (hpa's proposal) will increase
pressure on TLB, since we have more pages for the same memory area. So we
have to do more page translation in this case.

On my test machine with simple memcmp() virtual huge zero page is faster.
But it highly depends on TLB size, cache size, memory access and page
translation costs.

It looks like cache size in modern processors grows faster than TLB size.


Oh, I see, thanks for your quick response. Another one question below,




The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **), 2 * MB, 8 * GB);
 for (i = 0; i < 100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile ("": : :"memory");
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
  #0.23  stalled cycles per 
insn  ( +-  0.11% ) [83.32%]
 13,509,150,683 branches  #  445.447 M/sec  
  ( +-  0.11% ) [83.34%]
  1,017,667 branch-misses #0.01% of all branches
  ( +-  1.07% ) [83.32%]

   30.381324695 seconds time elapsed
  ( +-  0.13% )

Could you tell me which data I should care in this performance
counter. And what's the benefit of your current implementation
compare to hpa's request?


Sorry for my unintelligent. Could you tell me which data I should care 
in this performance counter stats. The same question about the second 
benchmark counter stats, thanks in adance. :-)

Mirobenchmark2
==

test:
 posix_memalign((void **), 2 * MB, 8 * GB);
 for (i = 0; i < 1000; i++) {
 char *_p = p;
 while (_p < p+4*GB) {
 assert(*_p == *(_p+4*GB));
 _p += 4096;
 asm volatile ("": : :"memory");
 }
 }

hzp:
  Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

3505.727639 task-clock#0.998 CPUs utilized  
  ( +-  0.26% )
  

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Kirill A. Shutemov
On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:
> >By hpa request I've tried alternative approach for hzp implementation (see
> >Virtual huge zero page patchset): pmd table with all entries set to zero
> >page. This way should be more cache friendly, but it increases TLB
> >pressure.
> 
> Thanks for your excellent works. But could you explain me why
> current implementation not cache friendly and hpa's request cache
> friendly? Thanks in advance.

In workloads like microbenchmark1 you need N * size(zero page) cache
space to get zero page fully cached, where N is cache associativity.
If zero page is 2M, cache pressure is significant.

On other hand with table of 4k zero pages (hpa's proposal) will increase
pressure on TLB, since we have more pages for the same memory area. So we
have to do more page translation in this case.

On my test machine with simple memcmp() virtual huge zero page is faster.
But it highly depends on TLB size, cache size, memory access and page
translation costs.

It looks like cache size in modern processors grows faster than TLB size.

> >The problem with virtual huge zero page: it requires per-arch enabling.
> >We need a way to mark that pmd table has all ptes set to zero page.
> >
> >Some numbers to compare two implementations (on 4s Westmere-EX):
> >
> >Mirobenchmark1
> >==
> >
> >test:
> > posix_memalign((void **), 2 * MB, 8 * GB);
> > for (i = 0; i < 100; i++) {
> > assert(memcmp(p, p + 4*GB, 4*GB) == 0);
> > asm volatile ("": : :"memory");
> > }
> >
> >hzp:
> >  Performance counter stats for './test_memcmp' (5 runs):
> >
> >   32356.272845 task-clock#0.998 CPUs utilized   
> >  ( +-  0.13% )
> > 40 context-switches  #0.001 K/sec   
> >  ( +-  0.94% )
> >  0 CPU-migrations#0.000 K/sec
> >  4,218 page-faults   #0.130 K/sec   
> >  ( +-  0.00% )
> > 76,712,481,765 cycles#2.371 GHz 
> >  ( +-  0.13% ) [83.31%]
> > 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles 
> > idle ( +-  0.28% ) [83.35%]
> >  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles 
> > idle ( +-  2.96% ) [66.67%]
> >134,355,715,816 instructions  #1.75  insns per cycle
> >  #0.27  stalled cycles per 
> > insn  ( +-  0.10% ) [83.35%]
> > 13,526,169,702 branches  #  418.039 M/sec   
> >  ( +-  0.10% ) [83.31%]
> >  1,058,230 branch-misses #0.01% of all branches 
> >  ( +-  0.91% ) [83.36%]
> >
> >   32.413866442 seconds time elapsed 
> >  ( +-  0.13% )
> >
> >vhzp:
> >  Performance counter stats for './test_memcmp' (5 runs):
> >
> >   30327.183829 task-clock#0.998 CPUs utilized   
> >  ( +-  0.13% )
> > 38 context-switches  #0.001 K/sec   
> >  ( +-  1.53% )
> >  0 CPU-migrations#0.000 K/sec
> >  4,218 page-faults   #0.139 K/sec   
> >  ( +-  0.01% )
> > 71,964,773,660 cycles#2.373 GHz 
> >  ( +-  0.13% ) [83.35%]
> > 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles 
> > idle ( +-  0.40% ) [83.32%]
> >773,484,474 stalled-cycles-backend#1.07% backend  cycles 
> > idle ( +-  6.61% ) [66.67%]
> >134,982,215,437 instructions  #1.88  insns per cycle
> >  #0.23  stalled cycles per 
> > insn  ( +-  0.11% ) [83.32%]
> > 13,509,150,683 branches  #  445.447 M/sec   
> >  ( +-  0.11% ) [83.34%]
> >  1,017,667 branch-misses #0.01% of all branches 
> >  ( +-  1.07% ) [83.32%]
> >
> >   30.381324695 seconds time elapsed 
> >  ( +-  0.13% )
> 
> Could you tell me which data I should care in this performance
> counter. And what's the benefit of your current implementation
> compare to hpa's request?
> 
> >
> >Mirobenchmark2
> >==
> >
> >test:
> > posix_memalign((void **), 2 * MB, 8 * GB);
> > for (i = 0; i < 1000; i++) {
> > char *_p = p;
> > while (_p < p+4*GB) {
> > assert(*_p == *(_p+4*GB));
> > _p += 4096;
> > asm volatile ("": : :"memory");
> > }
> > }
> >
> >hzp:
> >  Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
> >
> >3505.727639 task-clock#0.998 CPUs utilized   
> >  ( +-  0.26% )
> >

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/15/2012 02:00 PM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

Hi,

Andrew, here's huge zero page patchset rebased to v3.7-rc1.

Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during
rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts.
Mostly due new MMU notifiers interface.

=

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include 
#include 
#include 

#define MB 1024*1024

int main(int argc, char **argv)
{
 char *p;
 int i;

 posix_memalign((void **), 2 * MB, 200 * MB);
 for (i = 0; i < 200 * MB; i+= 4096)
 assert(p[i] == 0);
 pause();
 return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
  Patches 01-07: prepare all code paths for hzp
  Patch 08: all code paths are covered: safe to setup hzp
  Patch 09: lazy allocation
  Patch 10: lockless refcounting for hzp

v4:
  - Rebase to v3.7-rc1;
  - Update commit message;
v3:
  - fix potential deadlock in refcounting code on preemptive kernel.
  - do not mark huge zero page as movable.
  - fix typo in comment.
  - Reviewed-by tag from Andrea Arcangeli.
v2:
  - Avoid find_vma() if we've already had vma on stack.
Suggested by Andrea Arcangeli.
  - Implement refcounting for huge zero page.

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.


Thanks for your excellent works. But could you explain me why current 
implementation not cache friendly and hpa's request cache friendly? 
Thanks in advance.




The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **), 2 * MB, 8 * GB);
 for (i = 0; i < 100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile ("": : :"memory");
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/15/2012 02:00 PM, Kirill A. Shutemov wrote:

From: Kirill A. Shutemov kirill.shute...@linux.intel.com

Hi,

Andrew, here's huge zero page patchset rebased to v3.7-rc1.

Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during
rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts.
Mostly due new MMU notifiers interface.

=

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include assert.h
#include stdlib.h
#include unistd.h

#define MB 1024*1024

int main(int argc, char **argv)
{
 char *p;
 int i;

 posix_memalign((void **)p, 2 * MB, 200 * MB);
 for (i = 0; i  200 * MB; i+= 4096)
 assert(p[i] == 0);
 pause();
 return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
  Patches 01-07: prepare all code paths for hzp
  Patch 08: all code paths are covered: safe to setup hzp
  Patch 09: lazy allocation
  Patch 10: lockless refcounting for hzp

v4:
  - Rebase to v3.7-rc1;
  - Update commit message;
v3:
  - fix potential deadlock in refcounting code on preemptive kernel.
  - do not mark huge zero page as movable.
  - fix typo in comment.
  - Reviewed-by tag from Andrea Arcangeli.
v2:
  - Avoid find_vma() if we've already had vma on stack.
Suggested by Andrea Arcangeli.
  - Implement refcounting for huge zero page.

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.


Thanks for your excellent works. But could you explain me why current 
implementation not cache friendly and hpa's request cache friendly? 
Thanks in advance.




The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB);
 for (i = 0; i  100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile (: : :memory);
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Kirill A. Shutemov
On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:
 By hpa request I've tried alternative approach for hzp implementation (see
 Virtual huge zero page patchset): pmd table with all entries set to zero
 page. This way should be more cache friendly, but it increases TLB
 pressure.
 
 Thanks for your excellent works. But could you explain me why
 current implementation not cache friendly and hpa's request cache
 friendly? Thanks in advance.

In workloads like microbenchmark1 you need N * size(zero page) cache
space to get zero page fully cached, where N is cache associativity.
If zero page is 2M, cache pressure is significant.

On other hand with table of 4k zero pages (hpa's proposal) will increase
pressure on TLB, since we have more pages for the same memory area. So we
have to do more page translation in this case.

On my test machine with simple memcmp() virtual huge zero page is faster.
But it highly depends on TLB size, cache size, memory access and page
translation costs.

It looks like cache size in modern processors grows faster than TLB size.

 The problem with virtual huge zero page: it requires per-arch enabling.
 We need a way to mark that pmd table has all ptes set to zero page.
 
 Some numbers to compare two implementations (on 4s Westmere-EX):
 
 Mirobenchmark1
 ==
 
 test:
  posix_memalign((void **)p, 2 * MB, 8 * GB);
  for (i = 0; i  100; i++) {
  assert(memcmp(p, p + 4*GB, 4*GB) == 0);
  asm volatile (: : :memory);
  }
 
 hzp:
   Performance counter stats for './test_memcmp' (5 runs):
 
32356.272845 task-clock#0.998 CPUs utilized   
   ( +-  0.13% )
  40 context-switches  #0.001 K/sec   
   ( +-  0.94% )
   0 CPU-migrations#0.000 K/sec
   4,218 page-faults   #0.130 K/sec   
   ( +-  0.00% )
  76,712,481,765 cycles#2.371 GHz 
   ( +-  0.13% ) [83.31%]
  36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles 
  idle ( +-  0.28% ) [83.35%]
   1,684,049,110 stalled-cycles-backend#2.20% backend  cycles 
  idle ( +-  2.96% ) [66.67%]
 134,355,715,816 instructions  #1.75  insns per cycle
   #0.27  stalled cycles per 
  insn  ( +-  0.10% ) [83.35%]
  13,526,169,702 branches  #  418.039 M/sec   
   ( +-  0.10% ) [83.31%]
   1,058,230 branch-misses #0.01% of all branches 
   ( +-  0.91% ) [83.36%]
 
32.413866442 seconds time elapsed 
   ( +-  0.13% )
 
 vhzp:
   Performance counter stats for './test_memcmp' (5 runs):
 
30327.183829 task-clock#0.998 CPUs utilized   
   ( +-  0.13% )
  38 context-switches  #0.001 K/sec   
   ( +-  1.53% )
   0 CPU-migrations#0.000 K/sec
   4,218 page-faults   #0.139 K/sec   
   ( +-  0.01% )
  71,964,773,660 cycles#2.373 GHz 
   ( +-  0.13% ) [83.35%]
  31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles 
  idle ( +-  0.40% ) [83.32%]
 773,484,474 stalled-cycles-backend#1.07% backend  cycles 
  idle ( +-  6.61% ) [66.67%]
 134,982,215,437 instructions  #1.88  insns per cycle
   #0.23  stalled cycles per 
  insn  ( +-  0.11% ) [83.32%]
  13,509,150,683 branches  #  445.447 M/sec   
   ( +-  0.11% ) [83.34%]
   1,017,667 branch-misses #0.01% of all branches 
   ( +-  1.07% ) [83.32%]
 
30.381324695 seconds time elapsed 
   ( +-  0.13% )
 
 Could you tell me which data I should care in this performance
 counter. And what's the benefit of your current implementation
 compare to hpa's request?
 
 
 Mirobenchmark2
 ==
 
 test:
  posix_memalign((void **)p, 2 * MB, 8 * GB);
  for (i = 0; i  1000; i++) {
  char *_p = p;
  while (_p  p+4*GB) {
  assert(*_p == *(_p+4*GB));
  _p += 4096;
  asm volatile (: : :memory);
  }
  }
 
 hzp:
   Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
 
 3505.727639 task-clock#0.998 CPUs utilized   
   ( +-  0.26% )
   9 context-switches  #0.003 K/sec   
   ( +-  4.97% )
   4,384 page-faults   #0.001 M/sec   
   ( +-  0.00% )
   

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

Thanks for your excellent works. But could you explain me why
current implementation not cache friendly and hpa's request cache
friendly? Thanks in advance.

In workloads like microbenchmark1 you need N * size(zero page) cache
space to get zero page fully cached, where N is cache associativity.
If zero page is 2M, cache pressure is significant.

On other hand with table of 4k zero pages (hpa's proposal) will increase
pressure on TLB, since we have more pages for the same memory area. So we
have to do more page translation in this case.

On my test machine with simple memcmp() virtual huge zero page is faster.
But it highly depends on TLB size, cache size, memory access and page
translation costs.

It looks like cache size in modern processors grows faster than TLB size.


Oh, I see, thanks for your quick response. Another one question below,




The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB);
 for (i = 0; i  100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile (: : :memory);
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
  #0.23  stalled cycles per 
insn  ( +-  0.11% ) [83.32%]
 13,509,150,683 branches  #  445.447 M/sec  
  ( +-  0.11% ) [83.34%]
  1,017,667 branch-misses #0.01% of all branches
  ( +-  1.07% ) [83.32%]

   30.381324695 seconds time elapsed
  ( +-  0.13% )

Could you tell me which data I should care in this performance
counter. And what's the benefit of your current implementation
compare to hpa's request?


Sorry for my unintelligent. Could you tell me which data I should care 
in this performance counter stats. The same question about the second 
benchmark counter stats, thanks in adance. :-)

Mirobenchmark2
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB);
 for (i = 0; i  1000; i++) {
 char *_p = p;
 while (_p  p+4*GB) {
 assert(*_p == *(_p+4*GB));
 _p += 4096;
 asm volatile (: : :memory);
 }
 }

hzp:
  Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

3505.727639 task-clock#0.998 CPUs utilized  
  ( +-  0.26% )
   

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Kirill A. Shutemov
On Tue, Oct 16, 2012 at 07:13:07PM +0800, Ni zhan Chen wrote:
 On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:
 On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:
 By hpa request I've tried alternative approach for hzp implementation (see
 Virtual huge zero page patchset): pmd table with all entries set to zero
 page. This way should be more cache friendly, but it increases TLB
 pressure.
 Thanks for your excellent works. But could you explain me why
 current implementation not cache friendly and hpa's request cache
 friendly? Thanks in advance.
 In workloads like microbenchmark1 you need N * size(zero page) cache
 space to get zero page fully cached, where N is cache associativity.
 If zero page is 2M, cache pressure is significant.
 
 On other hand with table of 4k zero pages (hpa's proposal) will increase
 pressure on TLB, since we have more pages for the same memory area. So we
 have to do more page translation in this case.
 
 On my test machine with simple memcmp() virtual huge zero page is faster.
 But it highly depends on TLB size, cache size, memory access and page
 translation costs.
 
 It looks like cache size in modern processors grows faster than TLB size.
 
 Oh, I see, thanks for your quick response. Another one question below,
 
 
 The problem with virtual huge zero page: it requires per-arch enabling.
 We need a way to mark that pmd table has all ptes set to zero page.
 
 Some numbers to compare two implementations (on 4s Westmere-EX):
 
 Mirobenchmark1
 ==
 
 test:
  posix_memalign((void **)p, 2 * MB, 8 * GB);
  for (i = 0; i  100; i++) {
  assert(memcmp(p, p + 4*GB, 4*GB) == 0);
  asm volatile (: : :memory);
  }
 
 hzp:
   Performance counter stats for './test_memcmp' (5 runs):
 
32356.272845 task-clock#0.998 CPUs utilized 
 ( +-  0.13% )
  40 context-switches  #0.001 K/sec 
 ( +-  0.94% )
   0 CPU-migrations#0.000 K/sec
   4,218 page-faults   #0.130 K/sec 
 ( +-  0.00% )
  76,712,481,765 cycles#2.371 GHz   
 ( +-  0.13% ) [83.31%]
  36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles 
  idle ( +-  0.28% ) [83.35%]
   1,684,049,110 stalled-cycles-backend#2.20% backend  cycles 
  idle ( +-  2.96% ) [66.67%]
 134,355,715,816 instructions  #1.75  insns per cycle
   #0.27  stalled cycles 
  per insn  ( +-  0.10% ) [83.35%]
  13,526,169,702 branches  #  418.039 M/sec 
 ( +-  0.10% ) [83.31%]
   1,058,230 branch-misses #0.01% of all branches   
 ( +-  0.91% ) [83.36%]
 
32.413866442 seconds time elapsed   
 ( +-  0.13% )
 
 vhzp:
   Performance counter stats for './test_memcmp' (5 runs):
 
30327.183829 task-clock#0.998 CPUs utilized 
 ( +-  0.13% )
  38 context-switches  #0.001 K/sec 
 ( +-  1.53% )
   0 CPU-migrations#0.000 K/sec
   4,218 page-faults   #0.139 K/sec 
 ( +-  0.01% )
  71,964,773,660 cycles#2.373 GHz   
 ( +-  0.13% ) [83.35%]
  31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles 
  idle ( +-  0.40% ) [83.32%]
 773,484,474 stalled-cycles-backend#1.07% backend  cycles 
  idle ( +-  6.61% ) [66.67%]
 134,982,215,437 instructions  #1.88  insns per cycle
   #0.23  stalled cycles 
  per insn  ( +-  0.11% ) [83.32%]
  13,509,150,683 branches  #  445.447 M/sec 
 ( +-  0.11% ) [83.34%]
   1,017,667 branch-misses #0.01% of all branches   
 ( +-  1.07% ) [83.32%]
 
30.381324695 seconds time elapsed   
 ( +-  0.13% )
 Could you tell me which data I should care in this performance
 counter. And what's the benefit of your current implementation
 compare to hpa's request?
 
 Sorry for my unintelligent. Could you tell me which data I should
 care in this performance counter stats. The same question about the
 second benchmark counter stats, thanks in adance. :-)

I've missed relevant counters in this run, you can see them in the second
benchmark.

Relevant counters:
L1-dcache-*, LLC-*: shows cache related stats (hits/misses);
dTLB-*: shows data TLB hits and misses.

Indirect relevant counters:
stalled-cycles-*: how long CPU pipeline has to wait for data.

 Mirobenchmark2
 ==
 
 test:
  posix_memalign((void **)p, 2 * MB, 

Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-16 Thread Ni zhan Chen

On 10/16/2012 07:28 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 07:13:07PM +0800, Ni zhan Chen wrote:

On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:

On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

Thanks for your excellent works. But could you explain me why
current implementation not cache friendly and hpa's request cache
friendly? Thanks in advance.

In workloads like microbenchmark1 you need N * size(zero page) cache
space to get zero page fully cached, where N is cache associativity.
If zero page is 2M, cache pressure is significant.

On other hand with table of 4k zero pages (hpa's proposal) will increase
pressure on TLB, since we have more pages for the same memory area. So we
have to do more page translation in this case.

On my test machine with simple memcmp() virtual huge zero page is faster.
But it highly depends on TLB size, cache size, memory access and page
translation costs.

It looks like cache size in modern processors grows faster than TLB size.

Oh, I see, thanks for your quick response. Another one question below,


The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB);
 for (i = 0; i  100; i++) {
 assert(memcmp(p, p + 4*GB, 4*GB) == 0);
 asm volatile (: : :memory);
 }

hzp:
  Performance counter stats for './test_memcmp' (5 runs):

   32356.272845 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 40 context-switches  #0.001 K/sec  
  ( +-  0.94% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.130 K/sec  
  ( +-  0.00% )
 76,712,481,765 cycles#2.371 GHz
  ( +-  0.13% ) [83.31%]
 36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle   
  ( +-  0.28% ) [83.35%]
  1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle   
  ( +-  2.96% ) [66.67%]
134,355,715,816 instructions  #1.75  insns per cycle
  #0.27  stalled cycles per 
insn  ( +-  0.10% ) [83.35%]
 13,526,169,702 branches  #  418.039 M/sec  
  ( +-  0.10% ) [83.31%]
  1,058,230 branch-misses #0.01% of all branches
  ( +-  0.91% ) [83.36%]

   32.413866442 seconds time elapsed
  ( +-  0.13% )

vhzp:
  Performance counter stats for './test_memcmp' (5 runs):

   30327.183829 task-clock#0.998 CPUs utilized  
  ( +-  0.13% )
 38 context-switches  #0.001 K/sec  
  ( +-  1.53% )
  0 CPU-migrations#0.000 K/sec
  4,218 page-faults   #0.139 K/sec  
  ( +-  0.01% )
 71,964,773,660 cycles#2.373 GHz
  ( +-  0.13% ) [83.35%]
 31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle   
  ( +-  0.40% ) [83.32%]
773,484,474 stalled-cycles-backend#1.07% backend  cycles idle   
  ( +-  6.61% ) [66.67%]
134,982,215,437 instructions  #1.88  insns per cycle
  #0.23  stalled cycles per 
insn  ( +-  0.11% ) [83.32%]
 13,509,150,683 branches  #  445.447 M/sec  
  ( +-  0.11% ) [83.34%]
  1,017,667 branch-misses #0.01% of all branches
  ( +-  1.07% ) [83.32%]

   30.381324695 seconds time elapsed
  ( +-  0.13% )

Could you tell me which data I should care in this performance
counter. And what's the benefit of your current implementation
compare to hpa's request?

Sorry for my unintelligent. Could you tell me which data I should
care in this performance counter stats. The same question about the
second benchmark counter stats, thanks in adance. :-)

I've missed relevant counters in this run, you can see them in the second
benchmark.

Relevant counters:
L1-dcache-*, LLC-*: shows cache related stats (hits/misses);
dTLB-*: shows data TLB hits and misses.

Indirect relevant counters:
stalled-cycles-*: how long CPU pipeline has to wait for data.


Oh, I see, thanks for your patient. :-)




Mirobenchmark2
==

test:
 posix_memalign((void **)p, 2 * MB, 8 * GB);
 for 

[PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-15 Thread Kirill A. Shutemov
From: "Kirill A. Shutemov" 

Hi,

Andrew, here's huge zero page patchset rebased to v3.7-rc1.

Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during
rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts.
Mostly due new MMU notifiers interface.

=

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include 
#include 
#include 

#define MB 1024*1024

int main(int argc, char **argv)
{
char *p;
int i;

posix_memalign((void **), 2 * MB, 200 * MB);
for (i = 0; i < 200 * MB; i+= 4096)
assert(p[i] == 0);
pause();
return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
 Patches 01-07: prepare all code paths for hzp
 Patch 08: all code paths are covered: safe to setup hzp
 Patch 09: lazy allocation
 Patch 10: lockless refcounting for hzp

v4:
 - Rebase to v3.7-rc1;
 - Update commit message;
v3:
 - fix potential deadlock in refcounting code on preemptive kernel.
 - do not mark huge zero page as movable.
 - fix typo in comment.
 - Reviewed-by tag from Andrea Arcangeli.
v2:
 - Avoid find_vma() if we've already had vma on stack.
   Suggested by Andrea Arcangeli.
 - Implement refcounting for huge zero page.

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
posix_memalign((void **), 2 * MB, 8 * GB);
for (i = 0; i < 100; i++) {
assert(memcmp(p, p + 4*GB, 4*GB) == 0);
asm volatile ("": : :"memory");
}

hzp:
 Performance counter stats for './test_memcmp' (5 runs):

  32356.272845 task-clock#0.998 CPUs utilized   
 ( +-  0.13% )
40 context-switches  #0.001 K/sec   
 ( +-  0.94% )
 0 CPU-migrations#0.000 K/sec
 4,218 page-faults   #0.130 K/sec   
 ( +-  0.00% )
76,712,481,765 cycles#2.371 GHz 
 ( +-  0.13% ) [83.31%]
36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle
 ( +-  0.28% ) [83.35%]
 1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle
 ( +-  2.96% ) [66.67%]
   134,355,715,816 instructions  #1.75  insns per cycle
 #0.27  stalled cycles per insn 
 ( +-  0.10% ) [83.35%]
13,526,169,702 branches  #  418.039 M/sec   
 ( +-  0.10% ) [83.31%]
 1,058,230 branch-misses #0.01% of all branches 
 ( +-  0.91% ) [83.36%]

  32.413866442 seconds time elapsed 
 ( +-  0.13% )

vhzp:
 Performance counter stats for './test_memcmp' (5 runs):

  30327.183829 task-clock#0.998 CPUs utilized   
 ( +-  0.13% )
38 context-switches  #0.001 K/sec   
 ( +-  1.53% )
 0 CPU-migrations#0.000 K/sec
 4,218 page-faults   #0.139 K/sec   
 ( +-  0.01% )
71,964,773,660 cycles#2.373 

[PATCH v4 00/10, REBASED] Introduce huge zero page

2012-10-15 Thread Kirill A. Shutemov
From: Kirill A. Shutemov kirill.shute...@linux.intel.com

Hi,

Andrew, here's huge zero page patchset rebased to v3.7-rc1.

Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during
rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts.
Mostly due new MMU notifiers interface.

=

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include assert.h
#include stdlib.h
#include unistd.h

#define MB 1024*1024

int main(int argc, char **argv)
{
char *p;
int i;

posix_memalign((void **)p, 2 * MB, 200 * MB);
for (i = 0; i  200 * MB; i+= 4096)
assert(p[i] == 0);
pause();
return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
 Patches 01-07: prepare all code paths for hzp
 Patch 08: all code paths are covered: safe to setup hzp
 Patch 09: lazy allocation
 Patch 10: lockless refcounting for hzp

v4:
 - Rebase to v3.7-rc1;
 - Update commit message;
v3:
 - fix potential deadlock in refcounting code on preemptive kernel.
 - do not mark huge zero page as movable.
 - fix typo in comment.
 - Reviewed-by tag from Andrea Arcangeli.
v2:
 - Avoid find_vma() if we've already had vma on stack.
   Suggested by Andrea Arcangeli.
 - Implement refcounting for huge zero page.

--

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==

test:
posix_memalign((void **)p, 2 * MB, 8 * GB);
for (i = 0; i  100; i++) {
assert(memcmp(p, p + 4*GB, 4*GB) == 0);
asm volatile (: : :memory);
}

hzp:
 Performance counter stats for './test_memcmp' (5 runs):

  32356.272845 task-clock#0.998 CPUs utilized   
 ( +-  0.13% )
40 context-switches  #0.001 K/sec   
 ( +-  0.94% )
 0 CPU-migrations#0.000 K/sec
 4,218 page-faults   #0.130 K/sec   
 ( +-  0.00% )
76,712,481,765 cycles#2.371 GHz 
 ( +-  0.13% ) [83.31%]
36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle
 ( +-  0.28% ) [83.35%]
 1,684,049,110 stalled-cycles-backend#2.20% backend  cycles idle
 ( +-  2.96% ) [66.67%]
   134,355,715,816 instructions  #1.75  insns per cycle
 #0.27  stalled cycles per insn 
 ( +-  0.10% ) [83.35%]
13,526,169,702 branches  #  418.039 M/sec   
 ( +-  0.10% ) [83.31%]
 1,058,230 branch-misses #0.01% of all branches 
 ( +-  0.91% ) [83.36%]

  32.413866442 seconds time elapsed 
 ( +-  0.13% )

vhzp:
 Performance counter stats for './test_memcmp' (5 runs):

  30327.183829 task-clock#0.998 CPUs utilized   
 ( +-  0.13% )
38 context-switches  #0.001 K/sec   
 ( +-  1.53% )
 0 CPU-migrations#0.000 K/sec
 4,218 page-faults   #0.139 K/sec   
 ( +-  0.01% )