Re: [PATCH v3 00/10] Introduce huge zero page
On Wed, Oct 17, 2012 at 10:32:13AM +0800, Ni zhan Chen wrote: > On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote: > >On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote: > >>On Tue, 2 Oct 2012 18:19:22 +0300 > >>"Kirill A. Shutemov" wrote: > >> > >>>During testing I noticed big (up to 2.5 times) memory consumption overhead > >>>on some workloads (e.g. ft.A from NPB) if THP is enabled. > >>> > >>>The main reason for that big difference is lacking zero page in THP case. > >>>We have to allocate a real page on read page fault. > >>> > >>>A program to demonstrate the issue: > >>>#include > >>>#include > >>>#include > >>> > >>>#define MB 1024*1024 > >>> > >>>int main(int argc, char **argv) > >>>{ > >>> char *p; > >>> int i; > >>> > >>> posix_memalign((void **), 2 * MB, 200 * MB); > >>> for (i = 0; i < 200 * MB; i+= 4096) > >>> assert(p[i] == 0); > >>> pause(); > >>> return 0; > >>>} > >>> > >>>With thp-never RSS is about 400k, but with thp-always it's 200M. > >>>After the patcheset thp-always RSS is 400k too. > >>I'd like to see a full description of the design, please. > >Okay. Design overview. > > > >Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with > >zeros. The way how we allocate it changes in the patchset: > > > >- [01/10] simplest way: hzp allocated on boot time in hugepage_init(); > >- [09/10] lazy allocation on first use; > >- [10/10] lockless refcounting + shrinker-reclaimable hzp; > > > >We setup it in do_huge_pmd_anonymous_page() if area around fault address > >is suitable for THP and we've got read page fault. > >If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we > >normally do in THP. > > > >On wp fault to hzp we allocate real memory for the huge page and clear it. > >If ENOMEM, graceful fallback: we create a new pmd table and set pte around > >fault address to newly allocated normal (4k) page. All other ptes in the > >pmd set to normal zero page. > > > >We cannot split hzp (and it's bug if we try), but we can split the pmd > >which points to it. On splitting the pmd we create a table with all ptes > >set to normal zero page. > > > >Patchset organized in bisect-friendly way: > > Patches 01-07: prepare all code paths for hzp > > Patch 08: all code paths are covered: safe to setup hzp > > Patch 09: lazy allocation > > Patch 10: lockless refcounting for hzp > > > >-- > > > >By hpa request I've tried alternative approach for hzp implementation (see > >Virtual huge zero page patchset): pmd table with all entries set to zero > >page. This way should be more cache friendly, but it increases TLB > >pressure. > > > >The problem with virtual huge zero page: it requires per-arch enabling. > >We need a way to mark that pmd table has all ptes set to zero page. > > > >Some numbers to compare two implementations (on 4s Westmere-EX): > > > >Mirobenchmark1 > >== > > > >test: > > posix_memalign((void **), 2 * MB, 8 * GB); > > for (i = 0; i < 100; i++) { > > assert(memcmp(p, p + 4*GB, 4*GB) == 0); > > asm volatile ("": : :"memory"); > > } > > > >hzp: > > Performance counter stats for './test_memcmp' (5 runs): > > > > 32356.272845 task-clock#0.998 CPUs utilized > > ( +- 0.13% ) > > 40 context-switches #0.001 K/sec > > ( +- 0.94% ) > > 0 CPU-migrations#0.000 K/sec > > 4,218 page-faults #0.130 K/sec > > ( +- 0.00% ) > > 76,712,481,765 cycles#2.371 GHz > > ( +- 0.13% ) [83.31%] > > 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles > > idle ( +- 0.28% ) [83.35%] > > 1,684,049,110 stalled-cycles-backend#2.20% backend cycles > > idle ( +- 2.96% ) [66.67%] > >134,355,715,816 instructions #1.75 insns per cycle > > #0.27 stalled cycles per > > insn ( +- 0.10% ) [83.35%] > > 13,526,169,702 branches # 418.039 M/sec > > ( +- 0.10% ) [83.31%] > > 1,058,230 branch-misses #0.01% of all branches > > ( +- 0.91% ) [83.36%] > > > > 32.413866442 seconds time elapsed > > ( +- 0.13% ) > > > >vhzp: > > Performance counter stats for './test_memcmp' (5 runs): > > > > 30327.183829 task-clock#0.998 CPUs utilized > > ( +- 0.13% ) > > 38 context-switches #0.001 K/sec > > ( +- 1.53% ) > > 0 CPU-migrations#0.000 K/sec > > 4,218 page-faults #0.139 K/sec
Re: [PATCH v3 00/10] Introduce huge zero page
On Wed, Oct 17, 2012 at 10:32:13AM +0800, Ni zhan Chen wrote: On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote: On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote: On Tue, 2 Oct 2012 18:19:22 +0300 Kirill A. Shutemov kirill.shute...@linux.intel.com wrote: During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include assert.h #include stdlib.h #include unistd.h #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **)p, 2 * MB, 200 * MB); for (i = 0; i 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. I'd like to see a full description of the design, please. Okay. Design overview. Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. The way how we allocate it changes in the patchset: - [01/10] simplest way: hzp allocated on boot time in hugepage_init(); - [09/10] lazy allocation on first use; - [10/10] lockless refcounting + shrinker-reclaimable hzp; We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. On wp fault to hzp we allocate real memory for the huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. We cannot split hzp (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Patchset organized in bisect-friendly way: Patches 01-07: prepare all code paths for hzp Patch 08: all code paths are covered: safe to setup hzp Patch 09: lazy allocation Patch 10: lockless refcounting for hzp -- By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **)p, 2 * MB, 8 * GB); for (i = 0; i 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile (: : :memory); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001 K/sec ( +- 1.53% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles#2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
Re: [PATCH v3 00/10] Introduce huge zero page
On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote: On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote: On Tue, 2 Oct 2012 18:19:22 +0300 "Kirill A. Shutemov" wrote: During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include #include #include #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **), 2 * MB, 200 * MB); for (i = 0; i < 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. I'd like to see a full description of the design, please. Okay. Design overview. Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. The way how we allocate it changes in the patchset: - [01/10] simplest way: hzp allocated on boot time in hugepage_init(); - [09/10] lazy allocation on first use; - [10/10] lockless refcounting + shrinker-reclaimable hzp; We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. On wp fault to hzp we allocate real memory for the huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. We cannot split hzp (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Patchset organized in bisect-friendly way: Patches 01-07: prepare all code paths for hzp Patch 08: all code paths are covered: safe to setup hzp Patch 09: lazy allocation Patch 10: lockless refcounting for hzp -- By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **), 2 * MB, 8 * GB); for (i = 0; i < 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile ("": : :"memory"); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001 K/sec ( +- 1.53% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles#2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend#1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions #1.88 insns per cycle #0.23 stalled
Re: [PATCH v3 00/10] Introduce huge zero page
On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote: On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote: On Tue, 2 Oct 2012 18:19:22 +0300 Kirill A. Shutemov kirill.shute...@linux.intel.com wrote: During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include assert.h #include stdlib.h #include unistd.h #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **)p, 2 * MB, 200 * MB); for (i = 0; i 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. I'd like to see a full description of the design, please. Okay. Design overview. Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. The way how we allocate it changes in the patchset: - [01/10] simplest way: hzp allocated on boot time in hugepage_init(); - [09/10] lazy allocation on first use; - [10/10] lockless refcounting + shrinker-reclaimable hzp; We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. On wp fault to hzp we allocate real memory for the huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. We cannot split hzp (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Patchset organized in bisect-friendly way: Patches 01-07: prepare all code paths for hzp Patch 08: all code paths are covered: safe to setup hzp Patch 09: lazy allocation Patch 10: lockless refcounting for hzp -- By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **)p, 2 * MB, 8 * GB); for (i = 0; i 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile (: : :memory); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001 K/sec ( +- 1.53% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles#2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend#1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions #1.88 insns per cycle
Re: [PATCH v3 00/10] Introduce huge zero page
On Wed, 3 Oct 2012 03:04:02 +0300 "Kirill A. Shutemov" wrote: > Is the overview complete enough? Have I answered all you questions here? Yes, thanks! The design overview is short enough to be put in as code comments in suitable places. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 00/10] Introduce huge zero page
On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote: > On Tue, 2 Oct 2012 18:19:22 +0300 > "Kirill A. Shutemov" wrote: > > > During testing I noticed big (up to 2.5 times) memory consumption overhead > > on some workloads (e.g. ft.A from NPB) if THP is enabled. > > > > The main reason for that big difference is lacking zero page in THP case. > > We have to allocate a real page on read page fault. > > > > A program to demonstrate the issue: > > #include > > #include > > #include > > > > #define MB 1024*1024 > > > > int main(int argc, char **argv) > > { > > char *p; > > int i; > > > > posix_memalign((void **), 2 * MB, 200 * MB); > > for (i = 0; i < 200 * MB; i+= 4096) > > assert(p[i] == 0); > > pause(); > > return 0; > > } > > > > With thp-never RSS is about 400k, but with thp-always it's 200M. > > After the patcheset thp-always RSS is 400k too. > > I'd like to see a full description of the design, please. Okay. Design overview. Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. The way how we allocate it changes in the patchset: - [01/10] simplest way: hzp allocated on boot time in hugepage_init(); - [09/10] lazy allocation on first use; - [10/10] lockless refcounting + shrinker-reclaimable hzp; We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. On wp fault to hzp we allocate real memory for the huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. We cannot split hzp (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Patchset organized in bisect-friendly way: Patches 01-07: prepare all code paths for hzp Patch 08: all code paths are covered: safe to setup hzp Patch 09: lazy allocation Patch 10: lockless refcounting for hzp -- By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **), 2 * MB, 8 * GB); for (i = 0; i < 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile ("": : :"memory"); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001 K/sec ( +- 1.53% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles#2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend#1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions #1.88 insns per cycle
Re: [PATCH v3 00/10] Introduce huge zero page
Hi Andrew, On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote: > From reading the code, it appears that we initially allocate a huge > page and point the pmd at that. If/when there is a write fault against > that page we then populate the mm with ptes which point at the normal > 4k zero page and populate the pte at the fault address with a newly > allocated page? Correct and complete? If not, please fix ;) During the cow, we never use 4k ptes, unless the 2m page allocation fails. > Also, IIRC, the early versions of the patch did not allocate the > initial huge page at all - it immediately filled the mm with ptes which > point at the normal 4k zero page. Is that a correct recollection? > If so, why the change? That was a different design yes. The design in this patchset will not do that. > Also IIRC, Andrea had a little test app which demonstrated the TLB > costs of the inital approach, and they were high? Yes we run the benchmarks yesterday, this version is the one that will decrease the TLB cost and that seems the safest tradeoff. > Please, let's capture all this knowledge in a single place, right here > in the changelog. And in code comments, where appropriate. Otherwise > people won't know why we made these decisions unless they go off and > find lengthy, years-old and quite possibly obsolete email threads. Agreed ;). > Also, you've presented some data on the memory savings, but no > quantitative testing results on the performance cost. Both you and > Andrea have run these tests and those results are important. Let's > capture them here. And when designing such tests we should not just > try to demonstrate the benefits of a code change - we should think of > test cases whcih might be adversely affected and run those as well. Right. > It's not an appropriate time to be merging new features - please plan > on preparing this patchset against 3.7-rc1. Ok, I assume Kirill will take care of it. Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 00/10] Introduce huge zero page
On Tue, 2 Oct 2012 18:19:22 +0300 "Kirill A. Shutemov" wrote: > During testing I noticed big (up to 2.5 times) memory consumption overhead > on some workloads (e.g. ft.A from NPB) if THP is enabled. > > The main reason for that big difference is lacking zero page in THP case. > We have to allocate a real page on read page fault. > > A program to demonstrate the issue: > #include > #include > #include > > #define MB 1024*1024 > > int main(int argc, char **argv) > { > char *p; > int i; > > posix_memalign((void **), 2 * MB, 200 * MB); > for (i = 0; i < 200 * MB; i+= 4096) > assert(p[i] == 0); > pause(); > return 0; > } > > With thp-never RSS is about 400k, but with thp-always it's 200M. > After the patcheset thp-always RSS is 400k too. I'd like to see a full description of the design, please. >From reading the code, it appears that we initially allocate a huge page and point the pmd at that. If/when there is a write fault against that page we then populate the mm with ptes which point at the normal 4k zero page and populate the pte at the fault address with a newly allocated page? Correct and complete? If not, please fix ;) Also, IIRC, the early versions of the patch did not allocate the initial huge page at all - it immediately filled the mm with ptes which point at the normal 4k zero page. Is that a correct recollection? If so, why the change? Also IIRC, Andrea had a little test app which demonstrated the TLB costs of the inital approach, and they were high? Please, let's capture all this knowledge in a single place, right here in the changelog. And in code comments, where appropriate. Otherwise people won't know why we made these decisions unless they go off and find lengthy, years-old and quite possibly obsolete email threads. Also, you've presented some data on the memory savings, but no quantitative testing results on the performance cost. Both you and Andrea have run these tests and those results are important. Let's capture them here. And when designing such tests we should not just try to demonstrate the benefits of a code change - we should think of test cases whcih might be adversely affected and run those as well. It's not an appropriate time to be merging new features - please plan on preparing this patchset against 3.7-rc1. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 00/10] Introduce huge zero page
On Tue, Oct 02, 2012 at 06:19:22PM +0300, Kirill A. Shutemov wrote: > From: "Kirill A. Shutemov" > > During testing I noticed big (up to 2.5 times) memory consumption overhead > on some workloads (e.g. ft.A from NPB) if THP is enabled. > > The main reason for that big difference is lacking zero page in THP case. > We have to allocate a real page on read page fault. > > A program to demonstrate the issue: > #include > #include > #include > > #define MB 1024*1024 > > int main(int argc, char **argv) > { > char *p; > int i; > > posix_memalign((void **), 2 * MB, 200 * MB); > for (i = 0; i < 200 * MB; i+= 4096) > assert(p[i] == 0); > pause(); > return 0; > } > > With thp-never RSS is about 400k, but with thp-always it's 200M. > After the patcheset thp-always RSS is 400k too. > > v3: > - fix potential deadlock in refcounting code on preemptive kernel. > - do not mark huge zero page as movable. > - fix typo in comment. > - Reviewed-by tag from Andrea Arcangeli. > v2: > - Avoid find_vma() if we've already had vma on stack. >Suggested by Andrea Arcangeli. > - Implement refcounting for huge zero page. > > Kirill A. Shutemov (10): > thp: huge zero page: basic preparation > thp: zap_huge_pmd(): zap huge zero pmd > thp: copy_huge_pmd(): copy huge zero page > thp: do_huge_pmd_wp_page(): handle huge zero page > thp: change_huge_pmd(): keep huge zero page write-protected > thp: change split_huge_page_pmd() interface > thp: implement splitting pmd for huge zero page > thp: setup huge zero page on non-write page fault > thp: lazy huge zero page allocation > thp: implement refcounting for huge zero page Reviewed-by: Andrea Arcangeli -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 00/10] Introduce huge zero page
From: "Kirill A. Shutemov" During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include #include #include #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **), 2 * MB, 200 * MB); for (i = 0; i < 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. v3: - fix potential deadlock in refcounting code on preemptive kernel. - do not mark huge zero page as movable. - fix typo in comment. - Reviewed-by tag from Andrea Arcangeli. v2: - Avoid find_vma() if we've already had vma on stack. Suggested by Andrea Arcangeli. - Implement refcounting for huge zero page. Kirill A. Shutemov (10): thp: huge zero page: basic preparation thp: zap_huge_pmd(): zap huge zero pmd thp: copy_huge_pmd(): copy huge zero page thp: do_huge_pmd_wp_page(): handle huge zero page thp: change_huge_pmd(): keep huge zero page write-protected thp: change split_huge_page_pmd() interface thp: implement splitting pmd for huge zero page thp: setup huge zero page on non-write page fault thp: lazy huge zero page allocation thp: implement refcounting for huge zero page Documentation/vm/transhuge.txt |4 +- arch/x86/kernel/vm86_32.c |2 +- fs/proc/task_mmu.c |2 +- include/linux/huge_mm.h| 14 ++- include/linux/mm.h |8 + mm/huge_memory.c | 307 mm/memory.c| 11 +-- mm/mempolicy.c |2 +- mm/mprotect.c |2 +- mm/mremap.c|2 +- mm/pagewalk.c |2 +- 11 files changed, 305 insertions(+), 51 deletions(-) -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 00/10] Introduce huge zero page
From: Kirill A. Shutemov kirill.shute...@linux.intel.com During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include assert.h #include stdlib.h #include unistd.h #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **)p, 2 * MB, 200 * MB); for (i = 0; i 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. v3: - fix potential deadlock in refcounting code on preemptive kernel. - do not mark huge zero page as movable. - fix typo in comment. - Reviewed-by tag from Andrea Arcangeli. v2: - Avoid find_vma() if we've already had vma on stack. Suggested by Andrea Arcangeli. - Implement refcounting for huge zero page. Kirill A. Shutemov (10): thp: huge zero page: basic preparation thp: zap_huge_pmd(): zap huge zero pmd thp: copy_huge_pmd(): copy huge zero page thp: do_huge_pmd_wp_page(): handle huge zero page thp: change_huge_pmd(): keep huge zero page write-protected thp: change split_huge_page_pmd() interface thp: implement splitting pmd for huge zero page thp: setup huge zero page on non-write page fault thp: lazy huge zero page allocation thp: implement refcounting for huge zero page Documentation/vm/transhuge.txt |4 +- arch/x86/kernel/vm86_32.c |2 +- fs/proc/task_mmu.c |2 +- include/linux/huge_mm.h| 14 ++- include/linux/mm.h |8 + mm/huge_memory.c | 307 mm/memory.c| 11 +-- mm/mempolicy.c |2 +- mm/mprotect.c |2 +- mm/mremap.c|2 +- mm/pagewalk.c |2 +- 11 files changed, 305 insertions(+), 51 deletions(-) -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 00/10] Introduce huge zero page
On Tue, Oct 02, 2012 at 06:19:22PM +0300, Kirill A. Shutemov wrote: From: Kirill A. Shutemov kirill.shute...@linux.intel.com During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include assert.h #include stdlib.h #include unistd.h #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **)p, 2 * MB, 200 * MB); for (i = 0; i 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. v3: - fix potential deadlock in refcounting code on preemptive kernel. - do not mark huge zero page as movable. - fix typo in comment. - Reviewed-by tag from Andrea Arcangeli. v2: - Avoid find_vma() if we've already had vma on stack. Suggested by Andrea Arcangeli. - Implement refcounting for huge zero page. Kirill A. Shutemov (10): thp: huge zero page: basic preparation thp: zap_huge_pmd(): zap huge zero pmd thp: copy_huge_pmd(): copy huge zero page thp: do_huge_pmd_wp_page(): handle huge zero page thp: change_huge_pmd(): keep huge zero page write-protected thp: change split_huge_page_pmd() interface thp: implement splitting pmd for huge zero page thp: setup huge zero page on non-write page fault thp: lazy huge zero page allocation thp: implement refcounting for huge zero page Reviewed-by: Andrea Arcangeli aarca...@redhat.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 00/10] Introduce huge zero page
On Tue, 2 Oct 2012 18:19:22 +0300 Kirill A. Shutemov kirill.shute...@linux.intel.com wrote: During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include assert.h #include stdlib.h #include unistd.h #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **)p, 2 * MB, 200 * MB); for (i = 0; i 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. I'd like to see a full description of the design, please. From reading the code, it appears that we initially allocate a huge page and point the pmd at that. If/when there is a write fault against that page we then populate the mm with ptes which point at the normal 4k zero page and populate the pte at the fault address with a newly allocated page? Correct and complete? If not, please fix ;) Also, IIRC, the early versions of the patch did not allocate the initial huge page at all - it immediately filled the mm with ptes which point at the normal 4k zero page. Is that a correct recollection? If so, why the change? Also IIRC, Andrea had a little test app which demonstrated the TLB costs of the inital approach, and they were high? Please, let's capture all this knowledge in a single place, right here in the changelog. And in code comments, where appropriate. Otherwise people won't know why we made these decisions unless they go off and find lengthy, years-old and quite possibly obsolete email threads. Also, you've presented some data on the memory savings, but no quantitative testing results on the performance cost. Both you and Andrea have run these tests and those results are important. Let's capture them here. And when designing such tests we should not just try to demonstrate the benefits of a code change - we should think of test cases whcih might be adversely affected and run those as well. It's not an appropriate time to be merging new features - please plan on preparing this patchset against 3.7-rc1. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 00/10] Introduce huge zero page
Hi Andrew, On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote: From reading the code, it appears that we initially allocate a huge page and point the pmd at that. If/when there is a write fault against that page we then populate the mm with ptes which point at the normal 4k zero page and populate the pte at the fault address with a newly allocated page? Correct and complete? If not, please fix ;) During the cow, we never use 4k ptes, unless the 2m page allocation fails. Also, IIRC, the early versions of the patch did not allocate the initial huge page at all - it immediately filled the mm with ptes which point at the normal 4k zero page. Is that a correct recollection? If so, why the change? That was a different design yes. The design in this patchset will not do that. Also IIRC, Andrea had a little test app which demonstrated the TLB costs of the inital approach, and they were high? Yes we run the benchmarks yesterday, this version is the one that will decrease the TLB cost and that seems the safest tradeoff. Please, let's capture all this knowledge in a single place, right here in the changelog. And in code comments, where appropriate. Otherwise people won't know why we made these decisions unless they go off and find lengthy, years-old and quite possibly obsolete email threads. Agreed ;). Also, you've presented some data on the memory savings, but no quantitative testing results on the performance cost. Both you and Andrea have run these tests and those results are important. Let's capture them here. And when designing such tests we should not just try to demonstrate the benefits of a code change - we should think of test cases whcih might be adversely affected and run those as well. Right. It's not an appropriate time to be merging new features - please plan on preparing this patchset against 3.7-rc1. Ok, I assume Kirill will take care of it. Thanks, Andrea -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 00/10] Introduce huge zero page
On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote: On Tue, 2 Oct 2012 18:19:22 +0300 Kirill A. Shutemov kirill.shute...@linux.intel.com wrote: During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include assert.h #include stdlib.h #include unistd.h #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **)p, 2 * MB, 200 * MB); for (i = 0; i 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. I'd like to see a full description of the design, please. Okay. Design overview. Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. The way how we allocate it changes in the patchset: - [01/10] simplest way: hzp allocated on boot time in hugepage_init(); - [09/10] lazy allocation on first use; - [10/10] lockless refcounting + shrinker-reclaimable hzp; We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. On wp fault to hzp we allocate real memory for the huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. We cannot split hzp (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. Patchset organized in bisect-friendly way: Patches 01-07: prepare all code paths for hzp Patch 08: all code paths are covered: safe to setup hzp Patch 09: lazy allocation Patch 10: lockless refcounting for hzp -- By hpa request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 == test: posix_memalign((void **)p, 2 * MB, 8 * GB); for (i = 0; i 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile (: : :memory); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock#0.998 CPUs utilized ( +- 0.13% ) 40 context-switches #0.001 K/sec ( +- 0.94% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles#2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend#2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions #1.75 insns per cycle #0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses #0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock#0.998 CPUs utilized ( +- 0.13% ) 38 context-switches #0.001 K/sec ( +- 1.53% ) 0 CPU-migrations#0.000 K/sec 4,218 page-faults #0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles#2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend#1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions #1.88 insns per cycle #
Re: [PATCH v3 00/10] Introduce huge zero page
On Wed, 3 Oct 2012 03:04:02 +0300 Kirill A. Shutemov kir...@shutemov.name wrote: Is the overview complete enough? Have I answered all you questions here? Yes, thanks! The design overview is short enough to be put in as code comments in suitable places. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/