RE: [PATCH] mm/compaction: remove unused variable sysctl_compact_memory
> -Original Message- > From: owner-linux...@kvack.org On Behalf > Of pi...@codeaurora.org > Sent: Wednesday, March 3, 2021 6:34 AM > To: Nitin Gupta > Cc: linux-kernel@vger.kernel.org; a...@linux-foundation.org; linux- > m...@kvack.org; linux-fsde...@vger.kernel.org; iamjoonsoo@lge.com; > sh_...@163.com; mateusznos...@gmail.com; b...@redhat.com; > vba...@suse.cz; yzai...@google.com; keesc...@chromium.org; > mcg...@kernel.org; mgor...@techsingularity.net; pintu.p...@gmail.com > Subject: Re: [PATCH] mm/compaction: remove unused variable > sysctl_compact_memory > > External email: Use caution opening links or attachments > > > On 2021-03-03 01:48, Nitin Gupta wrote: > >> -Original Message- > >> From: pintu=codeaurora@mg.codeaurora.org > >> On Behalf Of Pintu Kumar > >> Sent: Tuesday, March 2, 2021 9:56 AM > >> To: linux-kernel@vger.kernel.org; a...@linux-foundation.org; linux- > >> m...@kvack.org; linux-fsde...@vger.kernel.org; pi...@codeaurora.org; > >> iamjoonsoo@lge.com; sh_...@163.com; > mateusznos...@gmail.com; > >> b...@redhat.com; Nitin Gupta ; vba...@suse.cz; > >> yzai...@google.com; keesc...@chromium.org; mcg...@kernel.org; > >> mgor...@techsingularity.net > >> Cc: pintu.p...@gmail.com > >> Subject: [PATCH] mm/compaction: remove unused variable > >> sysctl_compact_memory > >> > >> External email: Use caution opening links or attachments > >> > >> > >> The sysctl_compact_memory is mostly unsed in mm/compaction.c It just > >> acts as a place holder for sysctl. > >> > >> Thus we can remove it from here and move the declaration directly in > >> kernel/sysctl.c itself. > >> This will also eliminate the extern declaration from header file. > > > > > > I prefer keeping the existing pattern of listing all compaction > > related tunables together in compaction.h: > > > > extern int sysctl_compact_memory; > > extern unsigned int sysctl_compaction_proactiveness; > > extern int sysctl_extfrag_threshold; > > extern int sysctl_compact_unevictable_allowed; > > > > Thanks Nitin for your review. > You mean, you just wanted to retain this extern declaration ? > Any real benefit of keeping this declaration if not used elsewhere ? > I see that sysctl_compaction_handler() doesn't use the sysctl value at all. So, we can get rid of it completely as Vlastimil suggested. > > > >> No functionality is broken or changed this way. > >> > >> Signed-off-by: Pintu Kumar > >> Signed-off-by: Pintu Agarwal > >> --- > >> include/linux/compaction.h | 1 - > >> kernel/sysctl.c| 1 + > >> mm/compaction.c| 3 --- > >> 3 files changed, 1 insertion(+), 4 deletions(-) > >> > >> diff --git a/include/linux/compaction.h b/include/linux/compaction.h > >> index > >> ed4070e..4221888 100644 > >> --- a/include/linux/compaction.h > >> +++ b/include/linux/compaction.h > >> @@ -81,7 +81,6 @@ static inline unsigned long compact_gap(unsigned > >> int > >> order) } > >> > >> #ifdef CONFIG_COMPACTION > >> -extern int sysctl_compact_memory; > >> extern unsigned int sysctl_compaction_proactiveness; extern int > >> sysctl_compaction_handler(struct ctl_table *table, int write, > >> void *buffer, size_t *length, loff_t *ppos); > >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c9fbdd8..66aff21 > >> 100644 > >> --- a/kernel/sysctl.c > >> +++ b/kernel/sysctl.c > >> @@ -198,6 +198,7 @@ static int max_sched_tunable_scaling = > >> SCHED_TUNABLESCALING_END-1; #ifdef CONFIG_COMPACTION static int > >> min_extfrag_threshold; static int max_extfrag_threshold = 1000; > >> +static int sysctl_compact_memory; > >> #endif > >> > >> #endif /* CONFIG_SYSCTL */ > >> diff --git a/mm/compaction.c b/mm/compaction.c index > 190ccda..ede2886 > >> 100644 > >> --- a/mm/compaction.c > >> +++ b/mm/compaction.c > >> @@ -2650,9 +2650,6 @@ static void compact_nodes(void) > >> compact_node(nid); > >> } > >> > >> -/* The written value is actually unused, all memory is compacted */ > >> -int sysctl_compact_memory; > >> - > > > > > > Please retain this comment for the tunable. > > Sorry, I could not understand. > You mean to say just retain this last comment and only remove the > variable ? > Again any real benefit you see in retaining this even if its not used? > > You are just moving declaration of sysctl_compact_memory from compaction.c to sysctl.c. So, I wanted the comment "... all memory is compacted" to be retained with the sysctl variable. Since you are now getting rid of this variable completely, this comment goes away too. Thanks, Nitin
RE: [PATCH] mm/compaction: remove unused variable sysctl_compact_memory
> -Original Message- > From: pintu=codeaurora@mg.codeaurora.org > On Behalf Of Pintu Kumar > Sent: Tuesday, March 2, 2021 9:56 AM > To: linux-kernel@vger.kernel.org; a...@linux-foundation.org; linux- > m...@kvack.org; linux-fsde...@vger.kernel.org; pi...@codeaurora.org; > iamjoonsoo@lge.com; sh_...@163.com; mateusznos...@gmail.com; > b...@redhat.com; Nitin Gupta ; vba...@suse.cz; > yzai...@google.com; keesc...@chromium.org; mcg...@kernel.org; > mgor...@techsingularity.net > Cc: pintu.p...@gmail.com > Subject: [PATCH] mm/compaction: remove unused variable > sysctl_compact_memory > > External email: Use caution opening links or attachments > > > The sysctl_compact_memory is mostly unsed in mm/compaction.c It just acts > as a place holder for sysctl. > > Thus we can remove it from here and move the declaration directly in > kernel/sysctl.c itself. > This will also eliminate the extern declaration from header file. I prefer keeping the existing pattern of listing all compaction related tunables together in compaction.h: extern int sysctl_compact_memory; extern unsigned int sysctl_compaction_proactiveness; extern int sysctl_extfrag_threshold; extern int sysctl_compact_unevictable_allowed; > No functionality is broken or changed this way. > > Signed-off-by: Pintu Kumar > Signed-off-by: Pintu Agarwal > --- > include/linux/compaction.h | 1 - > kernel/sysctl.c| 1 + > mm/compaction.c| 3 --- > 3 files changed, 1 insertion(+), 4 deletions(-) > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h index > ed4070e..4221888 100644 > --- a/include/linux/compaction.h > +++ b/include/linux/compaction.h > @@ -81,7 +81,6 @@ static inline unsigned long compact_gap(unsigned int > order) } > > #ifdef CONFIG_COMPACTION > -extern int sysctl_compact_memory; > extern unsigned int sysctl_compaction_proactiveness; extern int > sysctl_compaction_handler(struct ctl_table *table, int write, > void *buffer, size_t *length, loff_t *ppos); diff > --git > a/kernel/sysctl.c b/kernel/sysctl.c index c9fbdd8..66aff21 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -198,6 +198,7 @@ static int max_sched_tunable_scaling = > SCHED_TUNABLESCALING_END-1; #ifdef CONFIG_COMPACTION static int > min_extfrag_threshold; static int max_extfrag_threshold = 1000; > +static int sysctl_compact_memory; > #endif > > #endif /* CONFIG_SYSCTL */ > diff --git a/mm/compaction.c b/mm/compaction.c index 190ccda..ede2886 > 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -2650,9 +2650,6 @@ static void compact_nodes(void) > compact_node(nid); > } > > -/* The written value is actually unused, all memory is compacted */ -int > sysctl_compact_memory; > - Please retain this comment for the tunable. -Nitin
[PATCH] mm: Fix compile error due to COMPACTION_HPAGE_ORDER
Fix compile error when COMPACTION_HPAGE_ORDER is assigned to HUGETLB_PAGE_ORDER. The correct way to check if this constant is defined is to check for CONFIG_HUGETLBFS. Signed-off-by: Nitin Gupta To: Andrew Morton Reported-by: Nathan Chancellor Tested-by: Nathan Chancellor --- mm/compaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/compaction.c b/mm/compaction.c index 45fd24a0ea0b..02963ffb9e70 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -62,7 +62,7 @@ static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; */ #if defined CONFIG_TRANSPARENT_HUGEPAGE #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER -#elif defined HUGETLB_PAGE_ORDER +#elif defined CONFIG_HUGETLBFS #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER #else #define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT) -- 2.27.0
Re: [PATCH v8] mm: Proactive compaction
On 6/22/20 9:57 PM, Nathan Chancellor wrote: > On Mon, Jun 22, 2020 at 09:32:12PM -0700, Nitin Gupta wrote: >> On 6/22/20 7:26 PM, Nathan Chancellor wrote: >>> On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote: >>>> For some applications, we need to allocate almost all memory as >>>> hugepages. However, on a running system, higher-order allocations can >>>> fail if the memory is fragmented. Linux kernel currently does on-demand >>>> compaction as we request more hugepages, but this style of compaction >>>> incurs very high latency. Experiments with one-time full memory >>>> compaction (followed by hugepage allocations) show that kernel is able >>>> to restore a highly fragmented memory state to a fairly compacted memory >>>> state within <1 sec for a 32G system. Such data suggests that a more >>>> proactive compaction can help us allocate a large fraction of memory as >>>> hugepages keeping allocation latencies low. >>>> >>>> For a more proactive compaction, the approach taken here is to define a >>>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds >>>> for external fragmentation which kcompactd tries to maintain. >>>> >>>> The tunable takes a value in range [0, 100], with a default of 20. >>>> >>>> Note that a previous version of this patch [1] was found to introduce >>>> too many tunables (per-order extfrag{low, high}), but this one reduces >>>> them to just one sysctl. Also, the new tunable is an opaque value >>>> instead of asking for specific bounds of "external fragmentation", which >>>> would have been difficult to estimate. The internal interpretation of >>>> this opaque value allows for future fine-tuning. >>>> >>>> Currently, we use a simple translation from this tunable to [low, high] >>>> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). >>>> The score for a node is defined as weighted mean of per-zone external >>>> fragmentation. A zone's present_pages determines its weight. >>>> >>>> To periodically check per-node score, we reuse per-node kcompactd >>>> threads, which are woken up every 500 milliseconds to check the same. If >>>> a node's score exceeds its high threshold (as derived from user-provided >>>> proactiveness value), proactive compaction is started until its score >>>> reaches its low threshold value. By default, proactiveness is set to 20, >>>> which implies threshold values of low=80 and high=90. >>>> >>>> This patch is largely based on ideas from Michal Hocko [2]. See also the >>>> LWN article [3]. >>>> >>>> Performance data >>>> >>>> >>>> System: x64_64, 1T RAM, 80 CPU threads. >>>> Kernel: 5.6.0-rc3 + this patch >>>> >>>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >>>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >>>> >>>> Before starting the driver, the system was fragmented from a userspace >>>> program that allocates all memory and then for each 2M aligned section, >>>> frees 3/4 of base pages using munmap. The workload is mainly anonymous >>>> userspace pages, which are easy to move around. I intentionally avoided >>>> unmovable pages in this test to see how much latency we incur when >>>> hugepage allocations hit direct compaction. >>>> >>>> 1. Kernel hugepage allocation latencies >>>> >>>> With the system in such a fragmented state, a kernel driver then >>>> allocates as many hugepages as possible and measures allocation >>>> latency: >>>> >>>> (all latency values are in microseconds) >>>> >>>> - With vanilla 5.6.0-rc3 >>>> >>>> percentile latency >>>> –– ––– >>>> 57894 >>>> 109496 >>>> 25 12561 >>>> 30 15295 >>>> 40 18244 >>>> 50 21229 >>>> 60 27556 >>>> 75 30147 >>>> 80 31047 >>>> 90 32859 >>>> 95 33799 >>>> >>>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of >>>> 762G total free => 98% of free memory could be alloc
Re: [PATCH v8] mm: Proactive compaction
On 6/22/20 7:26 PM, Nathan Chancellor wrote: > On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote: >> For some applications, we need to allocate almost all memory as >> hugepages. However, on a running system, higher-order allocations can >> fail if the memory is fragmented. Linux kernel currently does on-demand >> compaction as we request more hugepages, but this style of compaction >> incurs very high latency. Experiments with one-time full memory >> compaction (followed by hugepage allocations) show that kernel is able >> to restore a highly fragmented memory state to a fairly compacted memory >> state within <1 sec for a 32G system. Such data suggests that a more >> proactive compaction can help us allocate a large fraction of memory as >> hugepages keeping allocation latencies low. >> >> For a more proactive compaction, the approach taken here is to define a >> new sysctl called 'vm.compaction_proactiveness' which dictates bounds >> for external fragmentation which kcompactd tries to maintain. >> >> The tunable takes a value in range [0, 100], with a default of 20. >> >> Note that a previous version of this patch [1] was found to introduce >> too many tunables (per-order extfrag{low, high}), but this one reduces >> them to just one sysctl. Also, the new tunable is an opaque value >> instead of asking for specific bounds of "external fragmentation", which >> would have been difficult to estimate. The internal interpretation of >> this opaque value allows for future fine-tuning. >> >> Currently, we use a simple translation from this tunable to [low, high] >> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). >> The score for a node is defined as weighted mean of per-zone external >> fragmentation. A zone's present_pages determines its weight. >> >> To periodically check per-node score, we reuse per-node kcompactd >> threads, which are woken up every 500 milliseconds to check the same. If >> a node's score exceeds its high threshold (as derived from user-provided >> proactiveness value), proactive compaction is started until its score >> reaches its low threshold value. By default, proactiveness is set to 20, >> which implies threshold values of low=80 and high=90. >> >> This patch is largely based on ideas from Michal Hocko [2]. See also the >> LWN article [3]. >> >> Performance data >> >> >> System: x64_64, 1T RAM, 80 CPU threads. >> Kernel: 5.6.0-rc3 + this patch >> >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >> >> Before starting the driver, the system was fragmented from a userspace >> program that allocates all memory and then for each 2M aligned section, >> frees 3/4 of base pages using munmap. The workload is mainly anonymous >> userspace pages, which are easy to move around. I intentionally avoided >> unmovable pages in this test to see how much latency we incur when >> hugepage allocations hit direct compaction. >> >> 1. Kernel hugepage allocation latencies >> >> With the system in such a fragmented state, a kernel driver then >> allocates as many hugepages as possible and measures allocation >> latency: >> >> (all latency values are in microseconds) >> >> - With vanilla 5.6.0-rc3 >> >> percentile latency >> –– ––– >> 57894 >>109496 >>25 12561 >>30 15295 >>40 18244 >>50 21229 >>60 27556 >>75 30147 >>80 31047 >>90 32859 >>95 33799 >> >> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of >> 762G total free => 98% of free memory could be allocated as hugepages) >> >> - With 5.6.0-rc3 + this patch, with proactiveness=20 >> >> sysctl -w vm.compaction_proactiveness=20 >> >> percentile latency >> –– ––– >> 5 2 >>10 2 >>25 3 >>30 3 >>40 3 >>50 4 >>60 4 >>75 4 >>80 4 >>90 5 >>95 429 >> >> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of >> 762G total free => 98% of free memory could be allocated as hugepages) >> >> 2. JAVA heap allocation >> >> I
Re: [PATCH] mm: Use unsigned types for fragmentation score
On 6/18/20 6:41 AM, Baoquan He wrote: > On 06/17/20 at 06:03pm, Nitin Gupta wrote: >> Proactive compaction uses per-node/zone "fragmentation score" which >> is always in range [0, 100], so use unsigned type of these scores >> as well as for related constants. >> >> Signed-off-by: Nitin Gupta >> --- >> include/linux/compaction.h | 4 ++-- >> kernel/sysctl.c| 2 +- >> mm/compaction.c| 18 +- >> mm/vmstat.c| 2 +- >> 4 files changed, 13 insertions(+), 13 deletions(-) >> >> diff --git a/include/linux/compaction.h b/include/linux/compaction.h >> index 7a242d46454e..25a521d299c1 100644 >> --- a/include/linux/compaction.h >> +++ b/include/linux/compaction.h >> @@ -85,13 +85,13 @@ static inline unsigned long compact_gap(unsigned int >> order) >> >> #ifdef CONFIG_COMPACTION >> extern int sysctl_compact_memory; >> -extern int sysctl_compaction_proactiveness; >> +extern unsigned int sysctl_compaction_proactiveness; >> extern int sysctl_compaction_handler(struct ctl_table *table, int write, >> void *buffer, size_t *length, loff_t *ppos); >> extern int sysctl_extfrag_threshold; >> extern int sysctl_compact_unevictable_allowed; >> >> -extern int extfrag_for_order(struct zone *zone, unsigned int order); >> +extern unsigned int extfrag_for_order(struct zone *zone, unsigned int >> order); >> extern int fragmentation_index(struct zone *zone, unsigned int order); >> extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, >> unsigned int order, unsigned int alloc_flags, >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c >> index 58b0a59c9769..40180cdde486 100644 >> --- a/kernel/sysctl.c >> +++ b/kernel/sysctl.c >> @@ -2833,7 +2833,7 @@ static struct ctl_table vm_table[] = { >> { >> .procname = "compaction_proactiveness", >> .data = &sysctl_compaction_proactiveness, >> -.maxlen = sizeof(int), >> +.maxlen = sizeof(sysctl_compaction_proactiveness), > > Patch looks good to me. Wondering why not using 'unsigned int' here, > just curious. > It's just coding style preference. I see the same style used for many other sysctls too (min_free_kbytes etc.). Thanks, Nitin
[PATCH] mm: Use unsigned types for fragmentation score
Proactive compaction uses per-node/zone "fragmentation score" which is always in range [0, 100], so use unsigned type of these scores as well as for related constants. Signed-off-by: Nitin Gupta --- include/linux/compaction.h | 4 ++-- kernel/sysctl.c| 2 +- mm/compaction.c| 18 +- mm/vmstat.c| 2 +- 4 files changed, 13 insertions(+), 13 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 7a242d46454e..25a521d299c1 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -85,13 +85,13 @@ static inline unsigned long compact_gap(unsigned int order) #ifdef CONFIG_COMPACTION extern int sysctl_compact_memory; -extern int sysctl_compaction_proactiveness; +extern unsigned int sysctl_compaction_proactiveness; extern int sysctl_compaction_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos); extern int sysctl_extfrag_threshold; extern int sysctl_compact_unevictable_allowed; -extern int extfrag_for_order(struct zone *zone, unsigned int order); +extern unsigned int extfrag_for_order(struct zone *zone, unsigned int order); extern int fragmentation_index(struct zone *zone, unsigned int order); extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 58b0a59c9769..40180cdde486 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2833,7 +2833,7 @@ static struct ctl_table vm_table[] = { { .procname = "compaction_proactiveness", .data = &sysctl_compaction_proactiveness, - .maxlen = sizeof(int), + .maxlen = sizeof(sysctl_compaction_proactiveness), .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, diff --git a/mm/compaction.c b/mm/compaction.c index ac2030814edb..45fd24a0ea0b 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -53,7 +53,7 @@ static inline void count_compact_events(enum vm_event_item item, long delta) /* * Fragmentation score check interval for proactive compaction purposes. */ -static const int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; +static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; /* * Page order with-respect-to which proactive compaction @@ -1890,7 +1890,7 @@ static bool kswapd_is_running(pg_data_t *pgdat) * ZONE_DMA32. For smaller zones, the score value remains close to zero, * and thus never exceeds the high threshold for proactive compaction. */ -static int fragmentation_score_zone(struct zone *zone) +static unsigned int fragmentation_score_zone(struct zone *zone) { unsigned long score; @@ -1906,9 +1906,9 @@ static int fragmentation_score_zone(struct zone *zone) * the node's score falls below the low threshold, or one of the back-off * conditions is met. */ -static int fragmentation_score_node(pg_data_t *pgdat) +static unsigned int fragmentation_score_node(pg_data_t *pgdat) { - unsigned long score = 0; + unsigned int score = 0; int zoneid; for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { @@ -1921,17 +1921,17 @@ static int fragmentation_score_node(pg_data_t *pgdat) return score; } -static int fragmentation_score_wmark(pg_data_t *pgdat, bool low) +static unsigned int fragmentation_score_wmark(pg_data_t *pgdat, bool low) { - int wmark_low; + unsigned int wmark_low; /* * Cap the low watermak to avoid excessive compaction * activity in case a user sets the proactivess tunable * close to 100 (maximum). */ - wmark_low = max(100 - sysctl_compaction_proactiveness, 5); - return low ? wmark_low : min(wmark_low + 10, 100); + wmark_low = max(100U - sysctl_compaction_proactiveness, 5U); + return low ? wmark_low : min(wmark_low + 10, 100U); } static bool should_proactive_compact_node(pg_data_t *pgdat) @@ -2604,7 +2604,7 @@ int sysctl_compact_memory; * aggressively the kernel should compact memory in the * background. It takes values in the range [0, 100]. */ -int __read_mostly sysctl_compaction_proactiveness = 20; +unsigned int __read_mostly sysctl_compaction_proactiveness = 20; /* * This is the entry point for compacting all nodes via diff --git a/mm/vmstat.c b/mm/vmstat.c index 3e7ba8bce2ba..b1de695b826d 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1079,7 +1079,7 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in * It is defined as the percentage of pages found in blocks of size * less than 1 << order. It returns values in range [0, 100]. */ -int extfrag_for_order(struct zone *zone, unsigned int order) +unsigned int extfrag_for_order(struct zone *zone,
Re: [PATCH v8] mm: Proactive compaction
On 6/17/20 1:53 PM, Andrew Morton wrote: On Tue, 16 Jun 2020 13:45:27 -0700 Nitin Gupta wrote: For some applications, we need to allocate almost all memory as hugepages. However, on a running system, higher-order allocations can fail if the memory is fragmented. Linux kernel currently does on-demand compaction as we request more hugepages, but this style of compaction incurs very high latency. Experiments with one-time full memory compaction (followed by hugepage allocations) show that kernel is able to restore a highly fragmented memory state to a fairly compacted memory state within <1 sec for a 32G system. Such data suggests that a more proactive compaction can help us allocate a large fraction of memory as hugepages keeping allocation latencies low. ... All looks straightforward to me and easy to disable if it goes wrong. All the hard-coded magic numbers are a worry, but such is life. One teeny complaint: ... @@ -2650,12 +2801,34 @@ static int kcompactd(void *p) unsigned long pflags; trace_mm_compaction_kcompactd_sleep(pgdat->node_id); - wait_event_freezable(pgdat->kcompactd_wait, - kcompactd_work_requested(pgdat)); + if (wait_event_freezable_timeout(pgdat->kcompactd_wait, + kcompactd_work_requested(pgdat), + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) { + + psi_memstall_enter(&pflags); + kcompactd_do_work(pgdat); + psi_memstall_leave(&pflags); + continue; + } - psi_memstall_enter(&pflags); - kcompactd_do_work(pgdat); - psi_memstall_leave(&pflags); + /* kcompactd wait timeout */ + if (should_proactive_compact_node(pgdat)) { + unsigned int prev_score, score; Everywhere else, scores have type `int'. Here they are unsigned. How come? Would it be better to make these unsigned throughout? I don't think a score can ever be negative? The score is always in [0, 100], so yes, it should be unsigned. I will send another patch which fixes this. Thanks, Nitin
[PATCH v8] mm: Proactive compaction
e's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As t he benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ [2] https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/ [3] https://lwn.net/Articles/817905/ Signed-off-by: Nitin Gupta Reviewed-by: Vlastimil Babka Reviewed-by: Khalid Aziz Reviewed-by: Oleksandr Natalenko Tested-by: Oleksandr Natalenko To: Andrew Morton CC: Vlastimil Babka CC: Khalid Aziz CC: Michal Hocko CC: Mel Gorman CC: Matthew Wilcox CC: Mike Kravetz CC: Joonsoo Kim CC: David Rientjes CC: Nitin Gupta CC: Oleksandr Natalenko CC: linux-kernel CC: linux-mm CC: Linux API --- Changelog v8 vs v7: - Rebase to 5.8-rc1 Changelog v7 vs v6: - Fix compile error while THP is disabled (Oleksandr) Changelog v6 vs v5: - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, and some cleanups (Vlastimil) - Cap min threshold to avoid excess compaction load in case user sets extreme values like 100 for `vm.compaction_proactiveness` sysctl (Khalid) - Add some more explanation about the effect of tunable on compaction behavior in user guide (Khalid) Changelog v5 vs v4: - Change tunable from sysfs to sysctl (Vlastimil) - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil) - Minor cleanups (remove redundant initializations, ...) Changelog v4 vs v3: - Document various functions. - Added admin-guide for the new tunable `proactiveness`. - Rename proactive_compaction_score to fragmentation_score for clarity. Changelog v3 vs v2: - Make proactiveness a global tunable and not per-node. Also upadated the patch description to reflect the same (Vlastimil Babka). - Don't start proactive compaction if kswapd is running (Vlastimil Babka). - Clarified in the description that compaction runs in parallel with the workload, instead of a one-time compaction followed by a stream of hugepage allocations. Changelog v2 vs v1: - Introduce per-node and per-zone "proactive compaction score". This score is compared against watermarks which are set according to user provided proactiveness value. - Separate code-paths for proactive compaction from targeted compaction i.e. where pgdat->kcompactd_max_order is non-zero. - Renamed hpage_compaction_effort -> proactiveness. In future we may use more than extfrag wrt hugepage size to determine proactive compaction score. --- Documentation/admin-guide/sysctl/vm.rst | 15 ++ include/linux/compaction.h | 2 + kernel/sysctl.c | 9 ++ mm/compaction.c | 183 +++- mm/internal.h | 1 + mm/vmstat.c | 18 +++ 6 files changed, 223 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index d46d5b7013c6..4b7c496199ca 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -119,6 +119,21 @@ all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required. +compaction_proactiveness + + +This tunable takes a value in the range [0, 100] with a default value of +20. This tunable determines how aggressively compaction is done in the +background. Setting it to 0 disables proactive compaction. + +Note that compaction has a non-trivial system-wide impact as pages +belonging to different processes are moved around, which could also lead +to latency spikes in unsuspecting applications. The kernel employs +various heuristics to avoid wasting CPU cycles if it detects that +proactive compaction is not being effective. + +Be careful when setting it
Re: [PATCH v7] mm: Proactive compaction
On 6/16/20 2:46 AM, Oleksandr Natalenko wrote: > Hello. > > Please see the notes inline. > > On Mon, Jun 15, 2020 at 07:36:14AM -0700, Nitin Gupta wrote: >> For some applications, we need to allocate almost all memory as >> hugepages. However, on a running system, higher-order allocations can >> fail if the memory is fragmented. Linux kernel currently does on-demand >> compaction as we request more hugepages, but this style of compaction >> incurs very high latency. Experiments with one-time full memory >> compaction (followed by hugepage allocations) show that kernel is able >> to restore a highly fragmented memory state to a fairly compacted memory >> state within <1 sec for a 32G system. Such data suggests that a more >> proactive compaction can help us allocate a large fraction of memory as >> hugepages keeping allocation latencies low. >> >> For a more proactive compaction, the approach taken here is to define a >> new sysctl called 'vm.compaction_proactiveness' which dictates bounds >> for external fragmentation which kcompactd tries to maintain. >> >> The tunable takes a value in range [0, 100], with a default of 20. >> >> Note that a previous version of this patch [1] was found to introduce >> too many tunables (per-order extfrag{low, high}), but this one reduces >> them to just one sysctl. Also, the new tunable is an opaque value >> instead of asking for specific bounds of "external fragmentation", which >> would have been difficult to estimate. The internal interpretation of >> this opaque value allows for future fine-tuning. >> >> Currently, we use a simple translation from this tunable to [low, high] >> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). >> The score for a node is defined as weighted mean of per-zone external >> fragmentation. A zone's present_pages determines its weight. >> >> To periodically check per-node score, we reuse per-node kcompactd >> threads, which are woken up every 500 milliseconds to check the same. If >> a node's score exceeds its high threshold (as derived from user-provided >> proactiveness value), proactive compaction is started until its score >> reaches its low threshold value. By default, proactiveness is set to 20, >> which implies threshold values of low=80 and high=90. >> >> This patch is largely based on ideas from Michal Hocko [2]. See also the >> LWN article [3]. >> >> Performance data >> >> >> System: x64_64, 1T RAM, 80 CPU threads. >> Kernel: 5.6.0-rc3 + this patch >> >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >> >> Before starting the driver, the system was fragmented from a userspace >> program that allocates all memory and then for each 2M aligned section, >> frees 3/4 of base pages using munmap. The workload is mainly anonymous >> userspace pages, which are easy to move around. I intentionally avoided >> unmovable pages in this test to see how much latency we incur when >> hugepage allocations hit direct compaction. >> >> 1. Kernel hugepage allocation latencies >> >> With the system in such a fragmented state, a kernel driver then >> allocates as many hugepages as possible and measures allocation >> latency: >> >> (all latency values are in microseconds) >> >> - With vanilla 5.6.0-rc3 >> >> percentile latency >> –– ––– >> 57894 >>109496 >>25 12561 >>30 15295 >>40 18244 >>50 21229 >>60 27556 >>75 30147 >>80 31047 >>90 32859 >>95 33799 >> >> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of >> 762G total free => 98% of free memory could be allocated as hugepages) >> >> - With 5.6.0-rc3 + this patch, with proactiveness=20 >> >> sysctl -w vm.compaction_proactiveness=20 >> >> percentile latency >> –– ––– >> 5 2 >>10 2 >>25 3 >>30 3 >>40 3 >>50 4 >>60 4 >>75 4 >>80 4 >>90 5 >>95 429 >> >> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of >> 762G total free => 98% of free memory could be allocated as hugepages) >> &
[PATCH v7] mm: Proactive compaction
e's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As t he benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ [2] https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/ [3] https://lwn.net/Articles/817905/ Signed-off-by: Nitin Gupta Reviewed-by: Vlastimil Babka Reviewed-by: Khalid Aziz To: Andrew Morton CC: Vlastimil Babka CC: Khalid Aziz CC: Michal Hocko CC: Mel Gorman CC: Matthew Wilcox CC: Mike Kravetz CC: Joonsoo Kim CC: David Rientjes CC: Nitin Gupta CC: Oleksandr Natalenko CC: linux-kernel CC: linux-mm CC: Linux API --- Changelog v7 vs v6: - Fix compile error while THP is disabled (Oleksandr) Changelog v6 vs v5: - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, and some cleanups (Vlastimil) - Cap min threshold to avoid excess compaction load in case user sets extreme values like 100 for `vm.compaction_proactiveness` sysctl (Khalid) - Add some more explanation about the effect of tunable on compaction behavior in user guide (Khalid) Changelog v5 vs v4: - Change tunable from sysfs to sysctl (Vlastimil) - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil) - Minor cleanups (remove redundant initializations, ...) Changelog v4 vs v3: - Document various functions. - Added admin-guide for the new tunable `proactiveness`. - Rename proactive_compaction_score to fragmentation_score for clarity. Changelog v3 vs v2: - Make proactiveness a global tunable and not per-node. Also upadated the patch description to reflect the same (Vlastimil Babka). - Don't start proactive compaction if kswapd is running (Vlastimil Babka). - Clarified in the description that compaction runs in parallel with the workload, instead of a one-time compaction followed by a stream of hugepage allocations. Changelog v2 vs v1: - Introduce per-node and per-zone "proactive compaction score". This score is compared against watermarks which are set according to user provided proactiveness value. - Separate code-paths for proactive compaction from targeted compaction i.e. where pgdat->kcompactd_max_order is non-zero. - Renamed hpage_compaction_effort -> proactiveness. In future we may use more than extfrag wrt hugepage size to determine proactive compaction score. --- Documentation/admin-guide/sysctl/vm.rst | 15 ++ include/linux/compaction.h | 2 + kernel/sysctl.c | 9 ++ mm/compaction.c | 183 +++- mm/internal.h | 1 + mm/vmstat.c | 18 +++ 6 files changed, 223 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 0329a4d3fa9e..360914b4f346 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -119,6 +119,21 @@ all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required. +compaction_proactiveness + + +This tunable takes a value in the range [0, 100] with a default value of +20. This tunable determines how aggressively compaction is done in the +background. Setting it to 0 disables proactive compaction. + +Note that compaction has a non-trivial system-wide impact as pages +belonging to different processes are moved around, which could also lead +to latency spikes in unsuspecting applications. The kernel employs +various heuristics to avoid wasting CPU cycles if it detects that +proactive compaction is not being effective. + +Be careful when setting it to extreme values like 100, as that may +cause excessive background compaction activity. compact_unevictable_allowed
Re: [PATCH v6] mm: Proactive compaction
On 6/15/20 7:25 AM, Oleksandr Natalenko wrote: > On Mon, Jun 15, 2020 at 10:29:01AM +0200, Oleksandr Natalenko wrote: >> Just to let you know, this fails to compile for me with THP disabled on >> v5.8-rc1: >> >> CC mm/compaction.o >> In file included from ./include/linux/dev_printk.h:14, >> from ./include/linux/device.h:15, >> from ./include/linux/node.h:18, >> from ./include/linux/cpu.h:17, >> from mm/compaction.c:11: >> In function ‘fragmentation_score_zone’, >> inlined from ‘__compact_finished’ at mm/compaction.c:1982:11, >> inlined from ‘compact_zone’ at mm/compaction.c:2062:8: >> ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ >> declared with attribute error: BUILD_BUG failed >> 392 | _compiletime_assert(condition, msg, __compiletime_assert_, >> __COUNTER__) >> | ^ >> ./include/linux/compiler.h:373:4: note: in definition of macro >> ‘__compiletime_assert’ >> 373 |prefix ## suffix();\ >> |^~ >> ./include/linux/compiler.h:392:2: note: in expansion of macro >> ‘_compiletime_assert’ >> 392 | _compiletime_assert(condition, msg, __compiletime_assert_, >> __COUNTER__) >> | ^~~ >> ./include/linux/build_bug.h:39:37: note: in expansion of macro >> ‘compiletime_assert’ >>39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) >> | ^~ >> ./include/linux/build_bug.h:59:21: note: in expansion of macro >> ‘BUILD_BUG_ON_MSG’ >>59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") >> | ^~~~ >> ./include/linux/huge_mm.h:319:28: note: in expansion of macro ‘BUILD_BUG’ >> 319 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) >> |^ >> ./include/linux/huge_mm.h:115:26: note: in expansion of macro >> ‘HPAGE_PMD_SHIFT’ >> 115 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) >> | ^~~ >> mm/compaction.c:64:32: note: in expansion of macro ‘HPAGE_PMD_ORDER’ >>64 | #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER >> |^~~ >> mm/compaction.c:1898:28: note: in expansion of macro ‘COMPACTION_HPAGE_ORDER’ >> 1898 |extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); >> |^~ >> In function ‘fragmentation_score_zone’, >> inlined from ‘kcompactd’ at mm/compaction.c:1918:12: >> ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ >> declared with attribute error: BUILD_BUG failed >> 392 | _compiletime_assert(condition, msg, __compiletime_assert_, >> __COUNTER__) >> | ^ >> ./include/linux/compiler.h:373:4: note: in definition of macro >> ‘__compiletime_assert’ >> 373 |prefix ## suffix();\ >> |^~ >> ./include/linux/compiler.h:392:2: note: in expansion of macro >> ‘_compiletime_assert’ >> 392 | _compiletime_assert(condition, msg, __compiletime_assert_, >> __COUNTER__) >> | ^~~ >> ./include/linux/build_bug.h:39:37: note: in expansion of macro >> ‘compiletime_assert’ >>39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) >> | ^~ >> ./include/linux/build_bug.h:59:21: note: in expansion of macro >> ‘BUILD_BUG_ON_MSG’ >>59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") >> | ^~~~ >> ./include/linux/huge_mm.h:319:28: note: in expansion of macro ‘BUILD_BUG’ >> 319 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) >> |^ >> ./include/linux/huge_mm.h:115:26: note: in expansion of macro >> ‘HPAGE_PMD_SHIFT’ >> 115 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) >> | ^~~ >> mm/compaction.c:64:32: note: in expansion of macro ‘HPAGE_PMD_ORDER’ >>64 | #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER >> |^~~ >> mm/compaction.c:1898:28: note: in expansion of macro ‘COMPACTION_HPAGE_ORDER’ >> 1898 |extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); >> |^~ >> In function ‘fragmentation_score_zone’, >> inlined from ‘kcompactd’ at mm/compaction.c:1918:12: >> ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ >> declared with attribute error: BUILD_BUG failed >> 392 | _compiletime_assert(condition, msg, __compiletime_assert_, >> __COUNTER__) >> | ^ >> ./include/linux/compiler.h:373:4: note: in definition of macro >> ‘__compiletime_assert’ >> 373 |prefix ## suffix();\ >>
Re: [PATCH v6] mm: Proactive compaction
On 6/9/20 12:23 PM, Khalid Aziz wrote: > On Mon, 2020-06-01 at 12:48 -0700, Nitin Gupta wrote: >> For some applications, we need to allocate almost all memory as >> hugepages. However, on a running system, higher-order allocations can >> fail if the memory is fragmented. Linux kernel currently does on- >> demand >> compaction as we request more hugepages, but this style of compaction >> incurs very high latency. Experiments with one-time full memory >> compaction (followed by hugepage allocations) show that kernel is >> able >> to restore a highly fragmented memory state to a fairly compacted >> memory >> state within <1 sec for a 32G system. Such data suggests that a more >> proactive compaction can help us allocate a large fraction of memory >> as >> hugepages keeping allocation latencies low. >> >> For a more proactive compaction, the approach taken here is to define >> a >> new sysctl called 'vm.compaction_proactiveness' which dictates bounds >> for external fragmentation which kcompactd tries to maintain. >> >> The tunable takes a value in range [0, 100], with a default of 20. >> >> Note that a previous version of this patch [1] was found to introduce >> too many tunables (per-order extfrag{low, high}), but this one >> reduces >> them to just one sysctl. Also, the new tunable is an opaque value >> instead of asking for specific bounds of "external fragmentation", >> which >> would have been difficult to estimate. The internal interpretation of >> this opaque value allows for future fine-tuning. >> >> Currently, we use a simple translation from this tunable to [low, >> high] >> "fragmentation score" thresholds (low=100-proactiveness, >> high=low+10%). >> The score for a node is defined as weighted mean of per-zone external >> fragmentation. A zone's present_pages determines its weight. >> >> To periodically check per-node score, we reuse per-node kcompactd >> threads, which are woken up every 500 milliseconds to check the same. >> If >> a node's score exceeds its high threshold (as derived from user- >> provided >> proactiveness value), proactive compaction is started until its score >> reaches its low threshold value. By default, proactiveness is set to >> 20, >> which implies threshold values of low=80 and high=90. >> >> This patch is largely based on ideas from Michal Hocko [2]. See also >> the >> LWN article [3]. >> >> Performance data >> >> >> System: x64_64, 1T RAM, 80 CPU threads. >> Kernel: 5.6.0-rc3 + this patch >> >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >> >> Before starting the driver, the system was fragmented from a >> userspace >> program that allocates all memory and then for each 2M aligned >> section, >> frees 3/4 of base pages using munmap. The workload is mainly >> anonymous >> userspace pages, which are easy to move around. I intentionally >> avoided >> unmovable pages in this test to see how much latency we incur when >> hugepage allocations hit direct compaction. >> >> 1. Kernel hugepage allocation latencies >> >> With the system in such a fragmented state, a kernel driver then >> allocates as many hugepages as possible and measures allocation >> latency: >> >> (all latency values are in microseconds) >> >> - With vanilla 5.6.0-rc3 >> >> percentile latency >> –– ––– >> 57894 >>109496 >>25 12561 >>30 15295 >>40 18244 >>50 21229 >>60 27556 >>75 30147 >>80 31047 >>90 32859 >>95 33799 >> >> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of >> 762G total free => 98% of free memory could be allocated as >> hugepages) >> >> - With 5.6.0-rc3 + this patch, with proactiveness=20 >> >> sysctl -w vm.compaction_proactiveness=20 >> >> percentile latency >> –– ––– >> 5 2 >>10 2 >>25 3 >>30 3 >>40 3 >>50 4 >>60 4 >>75 4 >>80 4 >>90 5 >>95 429 >> >> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
Re: [PATCH v6] mm: Proactive compaction
On Mon, Jun 1, 2020 at 12:48 PM Nitin Gupta wrote: > > For some applications, we need to allocate almost all memory as > hugepages. However, on a running system, higher-order allocations can > fail if the memory is fragmented. Linux kernel currently does on-demand > compaction as we request more hugepages, but this style of compaction > incurs very high latency. Experiments with one-time full memory > compaction (followed by hugepage allocations) show that kernel is able > to restore a highly fragmented memory state to a fairly compacted memory > state within <1 sec for a 32G system. Such data suggests that a more > proactive compaction can help us allocate a large fraction of memory as > hugepages keeping allocation latencies low. > > Signed-off-by: Nitin Gupta > Reviewed-by: Vlastimil Babka (+CC Khalid) Can this be pipelined for upstream inclusion now? Sorry, I'm a bit rusty on upstream flow these days. Thanks, Nitin
[PATCH v6] mm: Proactive compaction
e's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As t he benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ [2] https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/ [3] https://lwn.net/Articles/817905/ Signed-off-by: Nitin Gupta Reviewed-by: Vlastimil Babka To: Mel Gorman To: Michal Hocko To: Vlastimil Babka CC: Matthew Wilcox CC: Andrew Morton CC: Mike Kravetz CC: Joonsoo Kim CC: David Rientjes CC: Nitin Gupta CC: linux-kernel CC: linux-mm CC: Linux API --- Changelog v6 vs v5: - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, and some cleanups (Vlastimil) - Cap min threshold to avoid excess compaction load in case user sets extreme values like 100 for `vm.compaction_proactiveness` sysctl (Khalid) - Add some more explanation about the effect of tunable on compaction behavior in user guide (Khalid) Changelog v5 vs v4: - Change tunable from sysfs to sysctl (Vlastimil) - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil) - Minor cleanups (remove redundant initializations, ...) Changelog v4 vs v3: - Document various functions. - Added admin-guide for the new tunable `proactiveness`. - Rename proactive_compaction_score to fragmentation_score for clarity. Changelog v3 vs v2: - Make proactiveness a global tunable and not per-node. Also upadated the patch description to reflect the same (Vlastimil Babka). - Don't start proactive compaction if kswapd is running (Vlastimil Babka). - Clarified in the description that compaction runs in parallel with the workload, instead of a one-time compaction followed by a stream of hugepage allocations. Changelog v2 vs v1: - Introduce per-node and per-zone "proactive compaction score". This score is compared against watermarks which are set according to user provided proactiveness value. - Separate code-paths for proactive compaction from targeted compaction i.e. where pgdat->kcompactd_max_order is non-zero. - Renamed hpage_compaction_effort -> proactiveness. In future we may use more than extfrag wrt hugepage size to determine proactive compaction score. --- Documentation/admin-guide/sysctl/vm.rst | 15 ++ include/linux/compaction.h | 2 + kernel/sysctl.c | 9 ++ mm/compaction.c | 183 +++- mm/internal.h | 1 + mm/vmstat.c | 18 +++ 6 files changed, 223 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 0329a4d3fa9e..360914b4f346 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -119,6 +119,21 @@ all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required. +compaction_proactiveness + + +This tunable takes a value in the range [0, 100] with a default value of +20. This tunable determines how aggressively compaction is done in the +background. Setting it to 0 disables proactive compaction. + +Note that compaction has a non-trivial system-wide impact as pages +belonging to different processes are moved around, which could also lead +to latency spikes in unsuspecting applications. The kernel employs +various heuristics to avoid wasting CPU cycles if it detects that +proactive compaction is not being effective. + +Be careful when setting it to extreme values like 100, as that may +cause excessive background compaction activity. compact_unevictable_allowed === diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 4b898cdbdf05..ccd28978b296 100644 --- a/include/linux/
Re: [PATCH v5] mm: Proactive compaction
On Thu, May 28, 2020 at 4:32 PM Khalid Aziz wrote: > > This looks good to me. I like the idea overall of controlling > aggressiveness of compaction with a single tunable for the whole > system. I wonder how an end user could arrive at what a reasonable > value would be for this based upon their workload. More comments below. > Tunables like the one this patch introduces, and similar ones like 'swappiness' will always require some experimentations from the user. > On Mon, 2020-05-18 at 11:14 -0700, Nitin Gupta wrote: > > For some applications, we need to allocate almost all memory as > > hugepages. However, on a running system, higher-order allocations can > > fail if the memory is fragmented. Linux kernel currently does on- > > demand > > compaction as we request more hugepages, but this style of compaction > > incurs very high latency. Experiments with one-time full memory > > compaction (followed by hugepage allocations) show that kernel is > > able > > to restore a highly fragmented memory state to a fairly compacted > > memory > > state within <1 sec for a 32G system. Such data suggests that a more > > proactive compaction can help us allocate a large fraction of memory > > as > > hugepages keeping allocation latencies low. > > > > For a more proactive compaction, the approach taken here is to define > > a new tunable called 'proactiveness' which dictates bounds for > > external > > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to > > maintain. > > > > The tunable is exposed through sysctl: > > /proc/sys/vm/compaction_proactiveness > > > > It takes value in range [0, 100], with a default of 20. > > Looking at the code, setting this to 100 would mean system would > continuously strive to drive level of fragmentation down to 0 which can > not be reasonable and would bog the system down. A cap lower than 100 > might be a good idea to keep kcompactd from dragging system down. > Yes, I understand that a value of 100 would be a continuous compaction storm but I still don't want to artificially cap the tunable. The interpretation of this tunable can change in future, and a range of [0, 100] seems more intuitive than, say [0, 90]. Still, I think a word of caution should be added to its documentation (admin-guide/sysctl/vm.rst). > > > > Total 2M hugepages allocated = 383859 (749G worth of hugepages out of > > 762G total free => 98% of free memory could be allocated as > > hugepages) > > > > - With 5.6.0-rc3 + this patch, with proactiveness=20 > > > > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness > > Should be "echo 20 | sudo tee /proc/sys/vm/compaction_proactiveness" > oops... I forgot to update the patch description. This is from the v4 patch which used sysfs but v5 switched to using sysctl. > > > > diff --git a/Documentation/admin-guide/sysctl/vm.rst > > b/Documentation/admin-guide/sysctl/vm.rst > > index 0329a4d3fa9e..e5d88cabe980 100644 > > --- a/Documentation/admin-guide/sysctl/vm.rst > > +++ b/Documentation/admin-guide/sysctl/vm.rst > > @@ -119,6 +119,19 @@ all zones are compacted such that free memory is > > available in contiguous > > blocks where possible. This can be important for example in the > > allocation of > > huge pages although processes will also directly compact memory as > > required. > > > > +compaction_proactiveness > > + > > + > > +This tunable takes a value in the range [0, 100] with a default > > value of > > +20. This tunable determines how aggressively compaction is done in > > the > > +background. Setting it to 0 disables proactive compaction. > > + > > +Note that compaction has a non-trivial system-wide impact as pages > > +belonging to different processes are moved around, which could also > > lead > > +to latency spikes in unsuspecting applications. The kernel employs > > +various heuristics to avoid wasting CPU cycles if it detects that > > +proactive compaction is not being effective. > > + > > Value of 100 would cause kcompactd to try to bring fragmentation down > to 0. If hugepages are being consumed and released continuously by the > workload, it is possible that kcompactd keeps making progress (and > hence passes the test "proactive_defer = score < prev_score ?") > continuously but can not reach a fragmentation score of 0 and hence > gets stuck in compact_zone() for a long time. Page migration for > compaction is not inexpensive. Maybe either cap the value to something > less than 100 or set a floor fo
Re: [PATCH v5] mm: Proactive compaction
On Wed, May 27, 2020 at 3:18 AM Vlastimil Babka wrote: > > On 5/18/20 8:14 PM, Nitin Gupta wrote: > > For some applications, we need to allocate almost all memory as > > hugepages. However, on a running system, higher-order allocations can > > fail if the memory is fragmented. Linux kernel currently does on-demand > > compaction as we request more hugepages, but this style of compaction > > incurs very high latency. Experiments with one-time full memory > > compaction (followed by hugepage allocations) show that kernel is able > > to restore a highly fragmented memory state to a fairly compacted memory > > state within <1 sec for a 32G system. Such data suggests that a more > > proactive compaction can help us allocate a large fraction of memory as > > hugepages keeping allocation latencies low. > > > > For a more proactive compaction, the approach taken here is to define > > a new tunable called 'proactiveness' which dictates bounds for external > > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to > > HPAGE_PMD_ORDER > Since HPAGE_PMD_ORDER is not always defined, and thus we may have to fallback to HUGETLB_PAGE_ORDER or even PMD_ORDER, I think I should remove references to the order in the patch description entirely. I also need to change the tunable name from 'proactiveness' to 'vm.compaction_proactiveness' sysctl. modified description: === For a more proactive compaction, the approach taken here is to define a new sysctl called 'vm.compaction_proactiveness' which dictates bounds for external fragmentation which kcompactd tries to ... === > > > > The tunable is exposed through sysctl: > > /proc/sys/vm/compaction_proactiveness > > > > It takes value in range [0, 100], with a default of 20. > > > > > > This patch is largely based on ideas from Michal Hocko posted here: > > https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/ > > Make this link a [2] reference? I would also add: "See also the LWN article > [3]." where [3] is https://lwn.net/Articles/817905/ > > Sounds good. I will turn these into [2] and [3] references. > > Reviewed-by: Vlastimil Babka > > With some smaller nitpicks below. > > But as we are adding a new API, I would really appreciate others comment about > the approach at least. > > > +/* > > + * A zone's fragmentation score is the external fragmentation wrt to the > > + * HUGETLB_PAGE_ORDER scaled by the zone's size. It returns a value in the > > HPAGE_PMD_ORDER > Maybe just remove reference to the order as I mentioned above? > > +/* > > + * Tunable for proactive compaction. It determines how > > + * aggressively the kernel should compact memory in the > > + * background. It takes values in the range [0, 100]. > > + */ > > +int sysctl_compaction_proactiveness = 20; > > These are usually __read_mostly > Ok. > > + > > /* > > * This is the entry point for compacting all nodes via > > * /proc/sys/vm/compact_memory > > @@ -2637,6 +2769,7 @@ static int kcompactd(void *p) > > { > > pg_data_t *pgdat = (pg_data_t*)p; > > struct task_struct *tsk = current; > > + unsigned int proactive_defer = 0; > > > > const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); > > > > @@ -2652,12 +2785,34 @@ static int kcompactd(void *p) > > unsigned long pflags; > > > > trace_mm_compaction_kcompactd_sleep(pgdat->node_id); > > - wait_event_freezable(pgdat->kcompactd_wait, > > - kcompactd_work_requested(pgdat)); > > + if (wait_event_freezable_timeout(pgdat->kcompactd_wait, > > + kcompactd_work_requested(pgdat), > > + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) { > > Hmm perhaps the wakeups should also backoff if there's nothing to do? Perhaps. For now, I just wanted to keep it simple and waking a thread to do a quick calculation didn't seem expensive to me, so I prefer this simplistic approach for now. > > +/* > > + * Calculates external fragmentation within a zone wrt the given order. > > + * It is defined as the percentage of pages found in blocks of size > > + * less than 1 << order. It returns values in range [0, 100]. > > + */ > > +int extfrag_for_order(struct zone *zone, unsigned int order) > > +{ > > + struct contig_page_info info; > > + > > + fill_contig_page_info(zone, order, &info); > > + if (info.free_pages == 0) > > + return 0; > > + > > + return (info.free_pages - (info.free_blocks_suitable << order)) * 100 > > + / info.free_pages; > > I guess this should also use div_u64() like __fragmentation_index() does. > Ok. > > +} > > + > > /* Same as __fragmentation index but allocs contig_page_info on stack */ > > int fragmentation_index(struct zone *zone, unsigned int order) > > { > > > Thanks, Nitin
Re: [PATCH v5] mm: Proactive compaction
On Thu, May 28, 2020 at 2:50 AM Vlastimil Babka wrote: > > On 5/28/20 11:15 AM, Holger Hoffstätte wrote: > > > > On 5/18/20 8:14 PM, Nitin Gupta wrote: > > [patch v5 :)] > > > > I've been successfully using this in my tree and it works great, but a > > friend > > who also uses my tree just found a bug (actually an improvement ;) due to > > the > > change from HUGETLB_PAGE_ORDER to HPAGE_PMD_ORDER in v5. > > > > When building with CONFIG_TRANSPARENT_HUGEPAGE=n (for some reason it was > > off) > > HPAGE_PMD_SHIFT expands to BUILD_BUG() and compilation fails like this: > > Oops, I forgot about this. Still I believe HPAGE_PMD_ORDER is the best choice > as > long as THP's are enabled. I guess fallback to HUGETLB_PAGE_ORDER would be > possible if THPS are not enabled, but AFAICS some architectures don't define > that. Such architectures perhaps won't benefit from proactive compaction > anyway? > I am not sure about such architectures but in such cases, we would end up calculating "fragmentation score" based on a page size which does not match the architecture's view of the "default hugepage size" which is not a terrible thing in itself as compaction can still be done in the background, after all. Since we always need a target order to calculate the fragmentation score, how about this fallack scheme: HPAGE_PMD_ORDER -> HUGETLB_PAGE_ORDER -> PMD_ORDER Thanks, Nitin
[PATCH v5] mm: Proactive compaction
e workloads. The situation of one-time compaction, sufficient to supply hugepages for following allocation stream, can probably happen for more extreme proactiveness values, like 80 or 90. In the above Java workload, proactiveness is set to 20. The test starts with a node's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As the benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ Signed-off-by: Nitin Gupta To: Mel Gorman To: Michal Hocko To: Vlastimil Babka CC: Matthew Wilcox CC: Andrew Morton CC: Mike Kravetz CC: Joonsoo Kim CC: David Rientjes CC: Nitin Gupta CC: linux-kernel CC: linux-mm CC: Linux API --- Changelog v5 vs v4: - Change tunable from sysfs to sysctl (Vlastimil) - HUGETLB_PAGE_ORDER -> HPAGE_PMD_ORDER (Vlastimil) - Minor cleanups (remove redundant initializations, ...) Changelog v4 vs v3: - Document various functions. - Added admin-guide for the new tunable `proactiveness`. - Rename proactive_compaction_score to fragmentation_score for clarity. Changelog v3 vs v2: - Make proactiveness a global tunable and not per-node. Also upadated the patch description to reflect the same (Vlastimil Babka). - Don't start proactive compaction if kswapd is running (Vlastimil Babka). - Clarified in the description that compaction runs in parallel with the workload, instead of a one-time compaction followed by a stream of hugepage allocations. Changelog v2 vs v1: - Introduce per-node and per-zone "proactive compaction score". This score is compared against watermarks which are set according to user provided proactiveness value. - Separate code-paths for proactive compaction from targeted compaction i.e. where pgdat->kcompactd_max_order is non-zero. - Renamed hpage_compaction_effort -> proactiveness. In future we may use more than extfrag wrt hugepage size to determine proactive compaction score. --- Documentation/admin-guide/sysctl/vm.rst | 13 ++ include/linux/compaction.h | 2 + kernel/sysctl.c | 9 ++ mm/compaction.c | 165 +++- mm/internal.h | 1 + mm/vmstat.c | 17 +++ 6 files changed, 202 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 0329a4d3fa9e..e5d88cabe980 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -119,6 +119,19 @@ all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required. +compaction_proactiveness + + +This tunable takes a value in the range [0, 100] with a default value of +20. This tunable determines how aggressively compaction is done in the +background. Setting it to 0 disables proactive compaction. + +Note that compaction has a non-trivial system-wide impact as pages +belonging to different processes are moved around, which could also lead +to latency spikes in unsuspecting applications. The kernel employs +various heuristics to avoid wasting CPU cycles if it detects that +proactive compaction is not being effective. + compact_unevictable_allowed === diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 4b898cdbdf05..ccd28978b296 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -85,11 +85,13 @@ static inline unsigned long compact_gap(unsigned int order) #ifdef CONFIG_COMPACTION extern int sysctl_compact_memory; +extern int sysctl_compaction_proactiveness; extern int sysctl_compaction_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *
[PATCH v4] mm: Proactive compaction
e workloads. The situation of one-time compaction, sufficient to supply hugepages for following allocation stream, can probably happen for more extreme proactiveness values, like 80 or 90. In the above Java workload, proactiveness is set to 20. The test starts with a node's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As the benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ Signed-off-by: Nitin Gupta To: Mel Gorman To: Michal Hocko To: Vlastimil Babka CC: Matthew Wilcox CC: Andrew Morton CC: Mike Kravetz CC: Joonsoo Kim CC: David Rientjes CC: Nitin Gupta CC: linux-kernel CC: linux-mm CC: Linux API --- Changelog v4 vs v3: - Document various functions. - Added admin-guide for the new tunable `proactiveness`. - Rename proactive_compaction_score to fragmentation_score for clarity. Changelog v3 vs v2: - Make proactiveness a global tunable and not per-node. Also upadated the patch description to reflect the same (Vlastimil Babka). - Don't start proactive compaction if kswapd is running (Vlastimil Babka). - Clarified in the description that compaction runs in parallel with the workload, instead of a one-time compaction followed by a stream of hugepage allocations. Changelog v2 vs v1: - Introduce per-node and per-zone "proactive compaction score". This score is compared against watermarks which are set according to user provided proactiveness value. - Separate code-paths for proactive compaction from targeted compaction i.e. where pgdat->kcompactd_max_order is non-zero. - Renamed hpage_compaction_effort -> proactiveness. In future we may use more than extfrag wrt hugepage size to determine proactive compaction score. --- .../admin-guide/mm/proactive-compaction.rst | 26 ++ MAINTAINERS | 6 + include/linux/compaction.h| 1 + mm/compaction.c | 236 +- mm/internal.h | 1 + mm/page_alloc.c | 1 + mm/vmstat.c | 17 ++ 7 files changed, 282 insertions(+), 6 deletions(-) create mode 100644 Documentation/admin-guide/mm/proactive-compaction.rst diff --git a/Documentation/admin-guide/mm/proactive-compaction.rst b/Documentation/admin-guide/mm/proactive-compaction.rst new file mode 100644 index ..510f47e38238 --- /dev/null +++ b/Documentation/admin-guide/mm/proactive-compaction.rst @@ -0,0 +1,26 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _proactive_compaction: + + +Proactive Compaction + + +Many applications benefit significantly from the use of huge pages. +However, huge-page allocations often incur a high latency or even fail +under fragmented memory conditions. Proactive compaction provides an +effective solution to these problems by doing memory compaction in the +background. + +The process of proactive compaction is controlled by a single tunable: + +/sys/kernel/mm/compaction/proactiveness + +This tunable takes a value in the range [0, 100] with a default value of +20. This tunable determines how aggressively compaction is done in the +background. Setting it to 0 disables proactive compaction. + +Note that compaction has a non-trivial system-wide impact as pages +belonging to different processes are moved around, which could also lead +to latency spikes in unsuspecting applications. The kernel employs +various heuristics to avoid wasting CPU cycles if it detects that +proactive compaction is not being effective. diff --git a/MAINTAINERS b/MAINTAINERS index 26f281d9f32a..e448c0b35ecb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18737,6 +18737,12 @@ L: linux...@kvack.org S: Maintained F: mm/zswap.c +PROACTIVE COMPACTION +M: Nitin Gupta +L: linux...@kvack.org +S: Maintained +F: Docu
Re: [RFC] mm: Proactive compaction
On Tue, 2019-08-20 at 10:46 +0200, Vlastimil Babka wrote: > > This patch is largely based on ideas from Michal Hocko posted here: > > https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/ > > > > Testing done (on x86): > > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30} > > respectively. > > - Use a test program to fragment memory: the program allocates all > > memory > > and then for each 2M aligned section, frees 3/4 of base pages using > > munmap. > > - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts > > compaction till extfrag < extfrag_low for order-9. > > > > The patch has plenty of rough edges but posting it early to see if I'm > > going in the right direction and to get some early feedback. > > That's a lot of control knobs - how is an admin supposed to tune them to > their > needs? Yes, it's difficult for an admin to get so many tunable right unless targeting a very specific workload. How about a simpler solution where we exposed just one tunable per-node: /sys/.../node-x/compaction_effort which accepts [0, 100] This parallels /proc/sys/vm/swappiness but for compaction. With this single number, we can estimate per-order [low, high] watermarks for external fragmentation like this: - For now, map this range to [low, medium, high] which correponds to specific low, high thresholds for extfrag. - Apply more relaxed thresholds for higher-order than for lower orders. With this single tunable we remove the burden of setting per-order explicit [low, high] thresholds and it should be easier to experiment with. -Nitin
Re: [RFC] mm: Proactive compaction
On Thu, 2019-08-22 at 09:51 +0100, Mel Gorman wrote: > As unappealing as it sounds, I think it is better to try improve the > allocation latency itself instead of trying to hide the cost in a kernel > thread. It's far harder to implement as compaction is not easy but it > would be more obvious what the savings are by looking at a histogram of > allocation latencies -- there are other metrics that could be considered > but that's the obvious one. > Do you mean reducing allocation latency especially when it hits direct compaction path? Do you have any ideas in mind for this? I'm open to working on them and report back latency nummbers, while I think more on less tunable-heavy background (pro-active) compaction approaches. -Nitin
Re: [RFC] mm: Proactive compaction
On Mon, 2019-09-16 at 13:16 -0700, David Rientjes wrote: > On Fri, 16 Aug 2019, Nitin Gupta wrote: > > > For some applications we need to allocate almost all memory as > > hugepages. However, on a running system, higher order allocations can > > fail if the memory is fragmented. Linux kernel currently does > > on-demand compaction as we request more hugepages but this style of > > compaction incurs very high latency. Experiments with one-time full > > memory compaction (followed by hugepage allocations) shows that kernel > > is able to restore a highly fragmented memory state to a fairly > > compacted memory state within <1 sec for a 32G system. Such data > > suggests that a more proactive compaction can help us allocate a large > > fraction of memory as hugepages keeping allocation latencies low. > > > > For a more proactive compaction, the approach taken here is to define > > per page-order external fragmentation thresholds and let kcompactd > > threads act on these thresholds. > > > > The low and high thresholds are defined per page-order and exposed > > through sysfs: > > > > /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high} > > > > Per-node kcompactd thread is woken up every few seconds to check if > > any zone on its node has extfrag above the extfrag_high threshold for > > any order, in which case the thread starts compaction in the backgrond > > till all zones are below extfrag_low level for all orders. By default > > both these thresolds are set to 100 for all orders which essentially > > disables kcompactd. > > > > To avoid wasting CPU cycles when compaction cannot help, such as when > > memory is full, we check both, extfrag > extfrag_high and > > compaction_suitable(zone). This allows kcomapctd thread to stays inactive > > even if extfrag thresholds are not met. > > > > This patch is largely based on ideas from Michal Hocko posted here: > > https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/ > > > > Testing done (on x86): > > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30} > > respectively. > > - Use a test program to fragment memory: the program allocates all memory > > and then for each 2M aligned section, frees 3/4 of base pages using > > munmap. > > - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts > > compaction till extfrag < extfrag_low for order-9. > > > > The patch has plenty of rough edges but posting it early to see if I'm > > going in the right direction and to get some early feedback. > > > > Is there an update to this proposal or non-RFC patch that has been posted > for proactive compaction? > > We've had good success with periodically compacting memory on a regular > cadence on systems with hugepages enabled. The cadence itself is defined > by the admin but it causes khugepaged[*] to periodically wakeup and invoke > compaction in an attempt to keep zones as defragmented as possible > (perhaps more "proactive" than what is proposed here in an attempt to keep > all memory as unfragmented as possible regardless of extfrag thresholds). > It also avoids corner-cases where kcompactd could become more expensive > than what is anticipated because it is unsuccessful at compacting memory > yet the extfrag threshold is still exceeded. > > [*] Khugepaged instead of kcompactd only because this is only enabled > for systems where transparent hugepages are enabled, probably better > off in kcompactd to avoid duplicating work between two kthreads if > there is already a need for background compaction. > Discussion on this RFC patch revolved around the issue of exposing too many tunables (per-node, per-order, [low-high] extfrag thresholds). It was sort-of concluded that no admin will get these tunables right for a variety of workloads. To eliminate the need for tunables, I proposed another patch: https://patchwork.kernel.org/patch/11140067/ which does not add any tunables but extends and exports an existing function (compact_zone_order). In summary, this new patch adds a callback function which allows any driver to implement ad-hoc compaction policies. There is also a sample driver which makes use of this interface to keep hugepage external fragmentation within specified range (exposed through debugfs): https://gitlab.com/nigupta/linux/snippets/1894161 -Nitin
Re: [PATCH] mm: Add callback for defining compaction completion
On Thu, 2019-09-12 at 17:11 +0530, Bharath Vedartham wrote: > Hi Nitin, > On Wed, Sep 11, 2019 at 10:33:39PM +, Nitin Gupta wrote: > > On Wed, 2019-09-11 at 08:45 +0200, Michal Hocko wrote: > > > On Tue 10-09-19 22:27:53, Nitin Gupta wrote: > > > [...] > > > > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote: > > > > > > For some applications we need to allocate almost all memory as > > > > > > hugepages. > > > > > > However, on a running system, higher order allocations can fail if > > > > > > the > > > > > > memory is fragmented. Linux kernel currently does on-demand > > > > > > compaction > > > > > > as we request more hugepages but this style of compaction incurs > > > > > > very > > > > > > high latency. Experiments with one-time full memory compaction > > > > > > (followed by hugepage allocations) shows that kernel is able to > > > > > > restore a highly fragmented memory state to a fairly compacted > > > > > > memory > > > > > > state within <1 sec for a 32G system. Such data suggests that a > > > > > > more > > > > > > proactive compaction can help us allocate a large fraction of > > > > > > memory > > > > > > as hugepages keeping allocation latencies low. > > > > > > > > > > > > In general, compaction can introduce unexpected latencies for > > > > > > applications that don't even have strong requirements for > > > > > > contiguous > > > > > > allocations. > > > > > > Could you expand on this a bit please? Gfp flags allow to express how > > > much the allocator try and compact for a high order allocations. Hugetlb > > > allocations tend to require retrying and heavy compaction to succeed and > > > the success rate tends to be pretty high from my experience. Why that > > > is not case in your case? > > > > The link to the driver you send on gitlab is not working :( Sorry about that, here's the correct link: https://gitlab.com/nigupta/linux/snippets/1894161 > > Yes, I have the same observation: with `GFP_TRANSHUGE | > > __GFP_RETRY_MAYFAIL` I get very good success rate (~90% of free RAM > > allocated as hugepages). However, what I'm trying to point out is that > > this > > high success rate comes with high allocation latencies (90th percentile > > latency of 2206us). On the same system, the same high-order allocations > > which hit the fast path have latency <5us. > > > > > > > > It is also hard to efficiently determine if the current > > > > > > system state can be easily compacted due to mixing of unmovable > > > > > > memory. Due to these reasons, automatic background compaction by > > > > > > the > > > > > > kernel itself is hard to get right in a way which does not hurt > > > > > > unsuspecting > > > > > applications or waste CPU cycles. > > > > > > > > > > We do trigger background compaction on a high order pressure from > > > > > the > > > > > page allocator by waking up kcompactd. Why is that not sufficient? > > > > > > > > > > > > > Whenever kcompactd is woken up, it does just enough work to create > > > > one free page of the given order (compaction_control.order) or higher. > > > > > > This is an implementation detail IMHO. I am pretty sure we can do a > > > better auto tuning when there is an indication of a constant flow of > > > high order requests. This is no different from the memory reclaim in > > > principle. Just because the kswapd autotuning not fitting with your > > > particular workload you wouldn't want to export direct reclaim > > > functionality and call it from a random module. That is just doomed to > > > fail because different subsystems in control just leads to decisions > > > going against each other. > > > > > > > I don't want to go the route of adding any auto-tuning/perdiction code to > > control compaction in the kernel. I'm more inclined towards extending > > existing interfaces to allow compaction behavior to be controlled either > > from userspace or a kernel driver. Letting a random module control > > compaction or a root process pumping new tunables from sysfs is the same > > in > > principle.
Re: [PATCH] mm: Add callback for defining compaction completion
On Wed, 2019-09-11 at 08:45 +0200, Michal Hocko wrote: > On Tue 10-09-19 22:27:53, Nitin Gupta wrote: > [...] > > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote: > > > > For some applications we need to allocate almost all memory as > > > > hugepages. > > > > However, on a running system, higher order allocations can fail if the > > > > memory is fragmented. Linux kernel currently does on-demand > > > > compaction > > > > as we request more hugepages but this style of compaction incurs very > > > > high latency. Experiments with one-time full memory compaction > > > > (followed by hugepage allocations) shows that kernel is able to > > > > restore a highly fragmented memory state to a fairly compacted memory > > > > state within <1 sec for a 32G system. Such data suggests that a more > > > > proactive compaction can help us allocate a large fraction of memory > > > > as hugepages keeping allocation latencies low. > > > > > > > > In general, compaction can introduce unexpected latencies for > > > > applications that don't even have strong requirements for contiguous > > > > allocations. > > Could you expand on this a bit please? Gfp flags allow to express how > much the allocator try and compact for a high order allocations. Hugetlb > allocations tend to require retrying and heavy compaction to succeed and > the success rate tends to be pretty high from my experience. Why that > is not case in your case? > Yes, I have the same observation: with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` I get very good success rate (~90% of free RAM allocated as hugepages). However, what I'm trying to point out is that this high success rate comes with high allocation latencies (90th percentile latency of 2206us). On the same system, the same high-order allocations which hit the fast path have latency <5us. > > > > It is also hard to efficiently determine if the current > > > > system state can be easily compacted due to mixing of unmovable > > > > memory. Due to these reasons, automatic background compaction by the > > > > kernel itself is hard to get right in a way which does not hurt > > > > unsuspecting > > > applications or waste CPU cycles. > > > > > > We do trigger background compaction on a high order pressure from the > > > page allocator by waking up kcompactd. Why is that not sufficient? > > > > > > > Whenever kcompactd is woken up, it does just enough work to create > > one free page of the given order (compaction_control.order) or higher. > > This is an implementation detail IMHO. I am pretty sure we can do a > better auto tuning when there is an indication of a constant flow of > high order requests. This is no different from the memory reclaim in > principle. Just because the kswapd autotuning not fitting with your > particular workload you wouldn't want to export direct reclaim > functionality and call it from a random module. That is just doomed to > fail because different subsystems in control just leads to decisions > going against each other. > I don't want to go the route of adding any auto-tuning/perdiction code to control compaction in the kernel. I'm more inclined towards extending existing interfaces to allow compaction behavior to be controlled either from userspace or a kernel driver. Letting a random module control compaction or a root process pumping new tunables from sysfs is the same in principle. This patch is in the spirit of simple extension to existing compaction_zone_order() which allows either a kernel driver or userspace (through sysfs) to control compaction. Also, we should avoid driving hard parallels between reclaim and compaction: the former is often necessary for forward progress while the latter is often an optimization. Since contiguous allocations are mostly optimizations it's good to expose hooks from the kernel that let user (through a driver or userspace) control it using its own heuristics. I thought hard about whats lacking in current userspace interface (sysfs): - /proc/sys/vm/compact_memory: full system compaction is not an option as a viable pro-active compaction strategy. - possibly expose [low, high] threshold values for each node and let kcompactd act on them. This was my approach for my original patch I linked earlier. Problem here is that it introduces too many tunables. Considering the above, I came up with this callback approach which make it trivial to introduce user specific policies for compaction. It puts the onus of system stability, responsive in the hands of user without burdening admins with more tunables or adding cry
RE: [PATCH] mm: Add callback for defining compaction completion
> -Original Message- > From: owner-linux...@kvack.org On Behalf > Of Michal Hocko > Sent: Tuesday, September 10, 2019 1:19 PM > To: Nitin Gupta > Cc: a...@linux-foundation.org; vba...@suse.cz; > mgor...@techsingularity.net; dan.j.willi...@intel.com; > khalid.a...@oracle.com; Matthew Wilcox ; Yu Zhao > ; Qian Cai ; Andrey Ryabinin > ; Allison Randal ; Mike > Rapoport ; Thomas Gleixner > ; Arun KS ; Wei Yang > ; linux-kernel@vger.kernel.org; linux- > m...@kvack.org > Subject: Re: [PATCH] mm: Add callback for defining compaction completion > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote: > > For some applications we need to allocate almost all memory as > hugepages. > > However, on a running system, higher order allocations can fail if the > > memory is fragmented. Linux kernel currently does on-demand > compaction > > as we request more hugepages but this style of compaction incurs very > > high latency. Experiments with one-time full memory compaction > > (followed by hugepage allocations) shows that kernel is able to > > restore a highly fragmented memory state to a fairly compacted memory > > state within <1 sec for a 32G system. Such data suggests that a more > > proactive compaction can help us allocate a large fraction of memory > > as hugepages keeping allocation latencies low. > > > > In general, compaction can introduce unexpected latencies for > > applications that don't even have strong requirements for contiguous > > allocations. It is also hard to efficiently determine if the current > > system state can be easily compacted due to mixing of unmovable > > memory. Due to these reasons, automatic background compaction by the > > kernel itself is hard to get right in a way which does not hurt unsuspecting > applications or waste CPU cycles. > > We do trigger background compaction on a high order pressure from the > page allocator by waking up kcompactd. Why is that not sufficient? > Whenever kcompactd is woken up, it does just enough work to create one free page of the given order (compaction_control.order) or higher. Such a design causes very high latency for workloads where we want to allocate lots of hugepages in short period of time. With pro-active compaction we can hide much of this latency. For some more background discussion and data, please see this thread: https://patchwork.kernel.org/patch/11098289/ > > Even with these caveats, pro-active compaction can still be very > > useful in certain scenarios to reduce hugepage allocation latencies. > > This callback interface allows drivers to drive compaction based on > > their own policies like the current level of external fragmentation > > for a particular order, system load etc. > > So we do not trust the core MM to make a reasonable decision while we give > a free ticket to modules. How does this make any sense at all? How is a > random module going to make a more informed decision when it has less > visibility on the overal MM situation. > Embedding any specific policy (like: keep external fragmentation for order-9 between 30-40%) within MM core looks like a bad idea. As a driver, we can easily measure parameters like system load, current fragmentation level for any order in any zone etc. to make an informed decision. See the thread I refereed above for more background discussion. > If you need to control compaction from the userspace you have an interface > for that. It is also completely unexplained why you need a completion > callback. > /proc/sys/vm/compact_memory does whole system compaction which is often too much as a pro-active compaction strategy. To get more control over how to compaction work to do, I have added a compaction callback which controls how much work is done in one compaction cycle. For example, as a test for this patch, I have a small test driver which defines [low, high] external fragmentation thresholds for the HPAGE_ORDER. Whenever extfrag is within this range, I run compact_zone_order with a callback which returns COMPACT_CONTINUE till extfrag > low threshold and returns COMPACT_PARTIAL_SKIPPED when extfrag <= low. Here's the code for this sample driver: https://gitlab.com/nigupta/memstress/snippets/1893847 Maybe this code can be added to Documentation/... Thanks, Nitin > > > Signed-off-by: Nitin Gupta > > --- > > include/linux/compaction.h | 10 ++ > > mm/compaction.c| 20 ++-- > > mm/internal.h | 2 ++ > > 3 files changed, 26 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h > > index 9569e7c786d3..1ea828450fa2 100644 > > --- a/include/linux/compaction.
[PATCH] mm: Add callback for defining compaction completion
For some applications we need to allocate almost all memory as hugepages. However, on a running system, higher order allocations can fail if the memory is fragmented. Linux kernel currently does on-demand compaction as we request more hugepages but this style of compaction incurs very high latency. Experiments with one-time full memory compaction (followed by hugepage allocations) shows that kernel is able to restore a highly fragmented memory state to a fairly compacted memory state within <1 sec for a 32G system. Such data suggests that a more proactive compaction can help us allocate a large fraction of memory as hugepages keeping allocation latencies low. In general, compaction can introduce unexpected latencies for applications that don't even have strong requirements for contiguous allocations. It is also hard to efficiently determine if the current system state can be easily compacted due to mixing of unmovable memory. Due to these reasons, automatic background compaction by the kernel itself is hard to get right in a way which does not hurt unsuspecting applications or waste CPU cycles. Even with these caveats, pro-active compaction can still be very useful in certain scenarios to reduce hugepage allocation latencies. This callback interface allows drivers to drive compaction based on their own policies like the current level of external fragmentation for a particular order, system load etc. Signed-off-by: Nitin Gupta --- include/linux/compaction.h | 10 ++ mm/compaction.c| 20 ++-- mm/internal.h | 2 ++ 3 files changed, 26 insertions(+), 6 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 9569e7c786d3..1ea828450fa2 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -58,6 +58,16 @@ enum compact_result { COMPACT_SUCCESS, }; +/* Callback function to determine if compaction is finished. */ +typedef enum compact_result (*compact_finished_cb)( + struct zone *zone, int order); + +enum compact_result compact_zone_order(struct zone *zone, int order, + gfp_t gfp_mask, enum compact_priority prio, + unsigned int alloc_flags, int classzone_idx, + struct page **capture, + compact_finished_cb compact_finished_cb); + struct alloc_context; /* in mm/internal.h */ /* diff --git a/mm/compaction.c b/mm/compaction.c index 952dc2fb24e5..73e2e9246bc4 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1872,6 +1872,9 @@ static enum compact_result __compact_finished(struct compact_control *cc) return COMPACT_PARTIAL_SKIPPED; } + if (cc->compact_finished_cb) + return cc->compact_finished_cb(cc->zone, cc->order); + if (is_via_compact_memory(cc->order)) return COMPACT_CONTINUE; @@ -2274,10 +2277,11 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) return ret; } -static enum compact_result compact_zone_order(struct zone *zone, int order, +enum compact_result compact_zone_order(struct zone *zone, int order, gfp_t gfp_mask, enum compact_priority prio, unsigned int alloc_flags, int classzone_idx, - struct page **capture) + struct page **capture, + compact_finished_cb compact_finished_cb) { enum compact_result ret; struct compact_control cc = { @@ -2293,10 +2297,11 @@ static enum compact_result compact_zone_order(struct zone *zone, int order, MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT, .alloc_flags = alloc_flags, .classzone_idx = classzone_idx, - .direct_compaction = true, + .direct_compaction = !compact_finished_cb, .whole_zone = (prio == MIN_COMPACT_PRIORITY), .ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY), - .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY) + .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY), + .compact_finished_cb = compact_finished_cb }; struct capture_control capc = { .cc = &cc, @@ -2313,11 +2318,13 @@ static enum compact_result compact_zone_order(struct zone *zone, int order, VM_BUG_ON(!list_empty(&cc.freepages)); VM_BUG_ON(!list_empty(&cc.migratepages)); - *capture = capc.page; + if (capture) + *capture = capc.page; current->capture_control = NULL; return ret; } +EXPORT_SYMBOL(compact_zone_order); int sysctl_extfrag_threshold = 500; @@ -2361,7 +2368,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, } status = compact_zone_order(zone, order, gfp_mask, prio, - allo
Re: [RFC] mm: Proactive compaction
On Mon, 2019-08-26 at 12:47 +0100, Mel Gorman wrote: > On Thu, Aug 22, 2019 at 09:57:22PM +0000, Nitin Gupta wrote: > > > Note that proactive compaction may reduce allocation latency but > > > it is not > > > free either. Even though the scanning and migration may happen in > > > a kernel > > > thread, tasks can incur faults while waiting for compaction to > > > complete if the > > > task accesses data being migrated. This means that costs are > > > incurred by > > > applications on a system that may never care about high-order > > > allocation > > > latency -- particularly if the allocations typically happen at > > > application > > > initialisation time. I recognise that kcompactd makes a bit of > > > effort to > > > compact memory out-of-band but it also is typically triggered in > > > response to > > > reclaim that was triggered by a high-order allocation request. > > > i.e. the work > > > done by the thread is triggered by an allocation request that hit > > > the slow > > > paths and not a preemptive measure. > > > > > > > Hitting the slow path for every higher-order allocation is a > > signification > > performance/latency issue for applications that requires a large > > number of > > these allocations to succeed in bursts. To get some concrete > > numbers, I > > made a small driver that allocates as many hugepages as possible > > and > > measures allocation latency: > > > > Every higher-order allocation does not necessarily hit the slow path > nor > does it incur equal latency. I did not mean *every* hugepage allocation in a literal sense. I meant to say: higher order allocation *tend* to hit slow path with a high probability under reasonably fragmented memory state and when they do, they incur high latency. > > > The driver first tries to allocate hugepage using > > GFP_TRANSHUGE_LIGHT > > (referred to as "Light" in the table below) and if that fails, > > tries to > > allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as > > "Fallback" in table below). We stop the allocation loop if both > > methods > > fail. > > > > Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All > > latencies > > are in microsec. > > > > > GFP/Stat |Any | Light | Fallback | > > > : | -: | --: | -: | > > >count | 9908 | 788 | 9120 | > > > min |0.0 | 0.0 | 1726.0 | > > > max | 135387.0 | 142.0 | 135387.0 | > > > mean |5494.66 |1.83 |5969.26 | > > > stddev | 21624.04 |7.58 | 22476.06 | > > Given that it is expected that there would be significant tail > latencies, > it would be better to analyse this in terms of percentiles. A very > small > number of high latency allocations would skew the mean significantly > which is hinted by the stddev. > Here is the same data in terms of percentiles: - with vanilla kernel 5.3.0-rc5: percentile latency –– ––– 5 1 10179 0 251829 301838 401854 5018 71 601890 751924 801945 902 206 952302 - Now with kernel 5.3.0-rc5 + this patch: percentile latency –– ––– 5 3 10 4 25 4 30 4 40 4 50 4 60 4 75 5 80 5 90 9 951 154 > > As you can see, the mean and stddev of allocation is extremely high > > with > > the current approach of on-demand compaction. > > > > The system was fragmented from a userspace program as I described > > in this > > patch description. The workload is mainly anonymous userspace pages > > which > > as easy to move around. I intentionally avoided unmovable pages in > > this > > test to see how much latency do we incur just by hitting the slow > > path for > > a majority of allocations. > > > > Even though, the penalty for proactive compaction is that > applications > that may have no interest in higher-order pages may still stall while > their data is migrated if the data is hot. This is why I think the > focus > should be on reducing the latency of compaction -- it benefits > applications that require higher-order latencies without increasing > the > overhead for unrelated applications. > Sure, reducing compaction latency would help b
Re: [RFC] mm: Proactive compaction
> -Original Message- > From: owner-linux...@kvack.org On Behalf > Of Mel Gorman > Sent: Thursday, August 22, 2019 1:52 AM > To: Nitin Gupta > Cc: a...@linux-foundation.org; vba...@suse.cz; mho...@suse.com; > dan.j.willi...@intel.com; Yu Zhao ; Matthew Wilcox > ; Qian Cai ; Andrey Ryabinin > ; Roman Gushchin ; Greg Kroah- > Hartman ; Kees Cook > ; Jann Horn ; Johannes > Weiner ; Arun KS ; Janne > Huttunen ; Konstantin Khlebnikov > ; linux-kernel@vger.kernel.org; linux- > m...@kvack.org > Subject: Re: [RFC] mm: Proactive compaction > > On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote: > > For some applications we need to allocate almost all memory as > > hugepages. However, on a running system, higher order allocations can > > fail if the memory is fragmented. Linux kernel currently does > > on-demand compaction as we request more hugepages but this style of > > compaction incurs very high latency. Experiments with one-time full > > memory compaction (followed by hugepage allocations) shows that kernel > > is able to restore a highly fragmented memory state to a fairly > > compacted memory state within <1 sec for a 32G system. Such data > > suggests that a more proactive compaction can help us allocate a large > > fraction of memory as hugepages keeping allocation latencies low. > > > > Note that proactive compaction may reduce allocation latency but it is not > free either. Even though the scanning and migration may happen in a kernel > thread, tasks can incur faults while waiting for compaction to complete if the > task accesses data being migrated. This means that costs are incurred by > applications on a system that may never care about high-order allocation > latency -- particularly if the allocations typically happen at application > initialisation time. I recognise that kcompactd makes a bit of effort to > compact memory out-of-band but it also is typically triggered in response to > reclaim that was triggered by a high-order allocation request. i.e. the work > done by the thread is triggered by an allocation request that hit the slow > paths and not a preemptive measure. > Hitting the slow path for every higher-order allocation is a signification performance/latency issue for applications that requires a large number of these allocations to succeed in bursts. To get some concrete numbers, I made a small driver that allocates as many hugepages as possible and measures allocation latency: The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT (referred to as "Light" in the table below) and if that fails, tries to allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as "Fallback" in table below). We stop the allocation loop if both methods fail. Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All latencies are in microsec. | GFP/Stat |Any | Light | Fallback | |: | -: | --: | -: | |count | 9908 | 788 | 9120 | | min |0.0 | 0.0 | 1726.0 | | max | 135387.0 | 142.0 | 135387.0 | | mean |5494.66 |1.83 |5969.26 | | stddev | 21624.04 |7.58 | 22476.06 | As you can see, the mean and stddev of allocation is extremely high with the current approach of on-demand compaction. The system was fragmented from a userspace program as I described in this patch description. The workload is mainly anonymous userspace pages which as easy to move around. I intentionally avoided unmovable pages in this test to see how much latency do we incur just by hitting the slow path for a majority of allocations. > > For a more proactive compaction, the approach taken here is to define > > per page-order external fragmentation thresholds and let kcompactd > > threads act on these thresholds. > > > > The low and high thresholds are defined per page-order and exposed > > through sysfs: > > > > /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high} > > > > These will be difficult for an admin to tune that is not extremely familiar > with > how external fragmentation is defined. If an admin asked "how much will > stalls be reduced by setting this to a different value?", the answer will > always > be "I don't know, maybe some, maybe not". > Yes, this is my main worry. These values can be set to emperically determined values on highly specialized systems like database appliances. However, on a generic system, there is no real reasonable value. Still, at the very least, I would like an interface that allows compacting system to a reasonable state. Something like: compact_extfrag(node, zone, order, high, low) which start compaction if extfrag > hi
RE: [RFC] mm: Proactive compaction
> -Original Message- > From: owner-linux...@kvack.org On Behalf > Of Matthew Wilcox > Sent: Tuesday, August 20, 2019 3:21 PM > To: Nitin Gupta > Cc: a...@linux-foundation.org; vba...@suse.cz; > mgor...@techsingularity.net; mho...@suse.com; > dan.j.willi...@intel.com; Yu Zhao ; Qian Cai > ; Andrey Ryabinin ; Roman > Gushchin ; Greg Kroah-Hartman > ; Kees Cook ; Jann > Horn ; Johannes Weiner ; Arun > KS ; Janne Huttunen > ; Konstantin Khlebnikov > ; linux-kernel@vger.kernel.org; linux- > m...@kvack.org > Subject: Re: [RFC] mm: Proactive compaction > > On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote: > > Testing done (on x86): > > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30} > > respectively. > > - Use a test program to fragment memory: the program allocates all > > memory and then for each 2M aligned section, frees 3/4 of base pages > > using munmap. > > - kcompactd0 detects fragmentation for order-9 > extfrag_high and > > starts compaction till extfrag < extfrag_low for order-9. > > Your test program is a good idea, but I worry it may produce unrealistically > optimistic outcomes. Page cache is readily reclaimable, so you're setting up > a situation where 2MB pages can once again be produced. > > How about this: > > One program which creates a file several times the size of memory (or > several files which total the same amount). Then read the file(s). Maybe by > mmap(), and just do nice easy sequential accesses. > > A second program which causes slab allocations. eg > > for (;;) { > for (i = 0; i < n * 1000 * 1000; i++) { > char fname[64]; > > sprintf(fname, "/tmp/missing.%d", i); > open(fname, O_RDWR); > } > } > > The first program should thrash the pagecache, causing pages to > continuously be allocated, reclaimed and freed. The second will create > millions of dentries, causing the slab allocator to allocate a lot of > order-0 pages which are harder to free. If you really want to make it work > hard, mix in opening some files whihc actually exist, preventing the pages > which contain those dentries from being evicted. > > This feels like it's simulating a more normal workload than your test. > What do you think? This combination of workloads for mixing movable and unmovable pages sounds good. I coded up these two and here's what I observed: - kernel: 5.3.0-rc5 + this patch, x86_64, 32G RAM. - Set extfrag_{low,high} = {25,30} for order-9 - Run pagecache and dentry thrash test programs as you described - for pagecache test: mmap and sequentially read 128G file on a 32G system. - for dentry test: set n=100. I created /tmp/missing.[0-1] so these dentries stay allocated.. - Start linux kernel compile for further pagecache thrashing. With above workload fragmentation for order-9 stayed 80-90% which kept kcompactd0 working but it couldn't make progress due to unmovable pages from dentries. As expected, we keep hitting compaction_deferred() as compaction attempts fail. After a manual `echo 3 | /proc/sys/vm/drop_caches` and stopping dentry thrasher, kcompactd succeded in bringing extfrag below set thresholds. With unmovable pages spread across memory, there is little compaction can do. Maybe we should have a knob like 'compactness' (like swapiness) which defines how aggressive compaction can be. For high values, maybe allow freeing dentries too? This way hugepage sensitive applications can trade with higher I/O latencies. Thanks, Nitin
RE: [RFC] mm: Proactive compaction
> -Original Message- > From: Vlastimil Babka > Sent: Tuesday, August 20, 2019 1:46 AM > To: Nitin Gupta ; a...@linux-foundation.org; > mgor...@techsingularity.net; mho...@suse.com; > dan.j.willi...@intel.com > Cc: Yu Zhao ; Matthew Wilcox ; > Qian Cai ; Andrey Ryabinin ; Roman > Gushchin ; Greg Kroah-Hartman > ; Kees Cook ; Jann > Horn ; Johannes Weiner ; Arun > KS ; Janne Huttunen > ; Konstantin Khlebnikov > ; linux-kernel@vger.kernel.org; linux- > m...@kvack.org; Khalid Aziz > Subject: Re: [RFC] mm: Proactive compaction > > +CC Khalid Aziz who proposed a different approach: > https://lore.kernel.org/linux-mm/20190813014012.30232-1- > khalid.a...@oracle.com/T/#u > > On 8/16/19 11:43 PM, Nitin Gupta wrote: > > For some applications we need to allocate almost all memory as > > hugepages. However, on a running system, higher order allocations can > > fail if the memory is fragmented. Linux kernel currently does > > on-demand compaction as we request more hugepages but this style of > > compaction incurs very high latency. Experiments with one-time full > > memory compaction (followed by hugepage allocations) shows that kernel > > is able to restore a highly fragmented memory state to a fairly > > compacted memory state within <1 sec for a 32G system. Such data > > suggests that a more proactive compaction can help us allocate a large > > fraction of memory as hugepages keeping allocation latencies low. > > > > For a more proactive compaction, the approach taken here is to define > > per page-order external fragmentation thresholds and let kcompactd > > threads act on these thresholds. > > > > The low and high thresholds are defined per page-order and exposed > > through sysfs: > > > > /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high} > > > > Per-node kcompactd thread is woken up every few seconds to check if > > any zone on its node has extfrag above the extfrag_high threshold for > > any order, in which case the thread starts compaction in the backgrond > > till all zones are below extfrag_low level for all orders. By default > > both these thresolds are set to 100 for all orders which essentially > > disables kcompactd. > > Could you define what exactly extfrag is, in the changelog? > extfrag for order-n = ((total free pages) - (free pages for order >= n)) / (total free pages) * 100; I will add this to v2 changelog. > > To avoid wasting CPU cycles when compaction cannot help, such as when > > memory is full, we check both, extfrag > extfrag_high and > > compaction_suitable(zone). This allows kcomapctd thread to stays > > inactive even if extfrag thresholds are not met. > > How does it translate to e.g. the number of free pages of order? > Watermarks are checked as follows (see: __compaction_suitable) watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ? low_wmark_pages(zone) : min_wmark_pages(zone); If a zone does not satisfy this watermark, we don't start compaction. > > This patch is largely based on ideas from Michal Hocko posted here: > > https://lore.kernel.org/linux- > mm/20161230131412.gi13...@dhcp22.suse.cz > > / > > > > Testing done (on x86): > > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30} > > respectively. > > - Use a test program to fragment memory: the program allocates all > > memory and then for each 2M aligned section, frees 3/4 of base pages > > using munmap. > > - kcompactd0 detects fragmentation for order-9 > extfrag_high and > > starts compaction till extfrag < extfrag_low for order-9. > > > > The patch has plenty of rough edges but posting it early to see if I'm > > going in the right direction and to get some early feedback. > > That's a lot of control knobs - how is an admin supposed to tune them to > their needs? I expect that a workload would typically care for just a particular page order (say, order-9 on x86 for the default hugepage size). An admin can set extfrag_{low,high} for just that order (say, low=25, high=30) and leave these thresholds to their default value (low=100, high=100) for all other orders. Thanks, Nitin > > (keeping the rest for reference) > > > Signed-off-by: Nitin Gupta > > --- > > include/linux/compaction.h | 12 ++ > > mm/compaction.c| 250 ++--- > > mm/vmstat.c| 12 ++ > > 3 files changed, 228 insertions(+), 46 deletions(-) > > > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h > > index 9569e7c786d3..2
[RFC] mm: Proactive compaction
For some applications we need to allocate almost all memory as hugepages. However, on a running system, higher order allocations can fail if the memory is fragmented. Linux kernel currently does on-demand compaction as we request more hugepages but this style of compaction incurs very high latency. Experiments with one-time full memory compaction (followed by hugepage allocations) shows that kernel is able to restore a highly fragmented memory state to a fairly compacted memory state within <1 sec for a 32G system. Such data suggests that a more proactive compaction can help us allocate a large fraction of memory as hugepages keeping allocation latencies low. For a more proactive compaction, the approach taken here is to define per page-order external fragmentation thresholds and let kcompactd threads act on these thresholds. The low and high thresholds are defined per page-order and exposed through sysfs: /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high} Per-node kcompactd thread is woken up every few seconds to check if any zone on its node has extfrag above the extfrag_high threshold for any order, in which case the thread starts compaction in the backgrond till all zones are below extfrag_low level for all orders. By default both these thresolds are set to 100 for all orders which essentially disables kcompactd. To avoid wasting CPU cycles when compaction cannot help, such as when memory is full, we check both, extfrag > extfrag_high and compaction_suitable(zone). This allows kcomapctd thread to stays inactive even if extfrag thresholds are not met. This patch is largely based on ideas from Michal Hocko posted here: https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/ Testing done (on x86): - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30} respectively. - Use a test program to fragment memory: the program allocates all memory and then for each 2M aligned section, frees 3/4 of base pages using munmap. - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts compaction till extfrag < extfrag_low for order-9. The patch has plenty of rough edges but posting it early to see if I'm going in the right direction and to get some early feedback. Signed-off-by: Nitin Gupta --- include/linux/compaction.h | 12 ++ mm/compaction.c| 250 ++--- mm/vmstat.c| 12 ++ 3 files changed, 228 insertions(+), 46 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 9569e7c786d3..26bfedbbc64b 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -60,6 +60,17 @@ enum compact_result { struct alloc_context; /* in mm/internal.h */ +// "order-%d" +#define COMPACTION_ORDER_STATE_NAME_LEN 16 +// Per-order compaction state +struct compaction_order_state { + unsigned int order; + unsigned int extfrag_low; + unsigned int extfrag_high; + unsigned int extfrag_curr; + char name[COMPACTION_ORDER_STATE_NAME_LEN]; +}; + /* * Number of free order-0 pages that should be available above given watermark * to make sure compaction has reasonable chance of not running out of free @@ -90,6 +101,7 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write, extern int sysctl_extfrag_threshold; extern int sysctl_compact_unevictable_allowed; +extern int extfrag_for_order(struct zone *zone, unsigned int order); extern int fragmentation_index(struct zone *zone, unsigned int order); extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, diff --git a/mm/compaction.c b/mm/compaction.c index 952dc2fb24e5..21866b1ad249 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -25,6 +25,10 @@ #include #include "internal.h" +#ifdef CONFIG_COMPACTION +struct compaction_order_state compaction_order_states[MAX_ORDER+1]; +#endif + #ifdef CONFIG_COMPACTION static inline void count_compact_event(enum vm_event_item item) { @@ -1846,6 +1850,49 @@ static inline bool is_via_compact_memory(int order) return order == -1; } +static int extfrag_wmark_high(struct zone *zone) +{ + int order; + + for (order = 1; order <= MAX_ORDER; order++) { + int extfrag = extfrag_for_order(zone, order); + int threshold = compaction_order_states[order].extfrag_high; + + if (extfrag > threshold) + return order; + } + return 0; +} + +static bool node_should_compact(pg_data_t *pgdat) +{ + struct zone *zone; + + for_each_populated_zone(zone) { + int order = extfrag_wmark_high(zone); + + if (order && compaction_suitable(zone, order, + 0, zone_idx(zone)) == COMPACT_CONTINUE) { + return true; +
Re: [PATCH v2] mm: Reduce memory bloat with THP
On 01/25/2018 01:13 PM, Mel Gorman wrote: > On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote: >>>> It's not really about memory scarcity but a more efficient use of it. >>>> Applications may want hugepage benefits without requiring any changes to >>>> app code which is what THP is supposed to provide, while still avoiding >>>> memory bloat. >>>> >>> I read these links and find that there are mainly two complains: >>> 1. THP causes latency spikes, because direction compaction slows down THP >>> allocation, >>> 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return >>> memory ranges smaller than >>>THP size and fails because of THP. >>> >>> The first complain is not related to this patch. >> >> I'm trying to address many different THP issues and memory bloat is >> first among them. > > Expecting userspace to get this right is probably going to go sideways. > It'll be screwed up and be sub-optimal or have odd semantics for existing > madvise flags. The fact is that an application may not even know if it's > going to be sparsely using memory in advance if it's a computation load > modelling from unknown input data. > > I suggest you read the old Talluri paper "Superpassing the TLB Performance > of Superpages with Less Operating System Support" and pay attention to > Section 4. There it discusses a page reservation scheme whereby on fault > a naturally aligned set of base pages are reserved and only one correctly > placed base page is inserted into the faulting address. It was tied into > a hypothetical piece of hardware that doesn't exist to give best-effort > support for superpages so it does not directly help you but the initial > idea is sound. There are holes in the paper from todays perspective but > it was written in the 90's. > > From there, read "Transparent operating system support for superpages" > by Navarro, particularly chapter 4 paying attention to the parts where > it talks about opportunism and promotion threshold. > > Superficially, it goes like this > > 1. On fault, reserve a THP in the allocator and use one base page that >is correctly-aligned for the faulting addresses. By correctly-aligned, >I mean that you use base page whose offset would be naturally contiguous >if it ever was part of a huge page. > 2. On subsequent faults, attempt to use a base page that is naturally >aligned to be a THP > 3. When a "threshold" of base pages are inserted, allocate the remaining >pages and promote it to a THP > 4. If there is memory pressure, spill "reserved" pages into the main >allocation pool and lose the opportunity to promote (which will need >khugepaged to recover) > > By definition, a promotion threshold of 1 would be the existing scheme > of allocation a THP on the first fault and some users will want that. It > also should be the default to avoid unexpected overhead. For workloads > where memory is being sparsely addressed and the increased overhead of > THP is unwelcome then the threshold should be tuned higher with a maximum > possible value of HPAGE_PMD_NR. > > It's non-trivial to do this because at minimum a page fault has to check > if there is a potential promotion candidate by checking the PTEs around > the faulting address searching for a correctly-aligned base page that is > already inserted. If there is, then check if the correctly aligned base > page for the current faulting address is free and if so use it. It'll > also then need to check the remaining PTEs to see if both the promotion > threshold has been reached and if so, promote it to a THP (or else teach > khugepaged to do an in-place promotion if possible). In other words, > implementing the promotion threshold is both hard and it's not free. > > However, if it did exist then the only tunable would be the "promotion > threshold" and applications would not need any special awareness of their > address space. > I went through both references you mentioned and I really like the idea of reservation-based hugepage allocation. Navarro also extends the idea to allow multiple hugepage sizes to be used (as support by underlying hardware) which was next in order of what I wanted to do in THP. So, please ignore this patch and I would work towards implementing ideas in these papers. Thanks for the feedback. Nitin
Re: [PATCH v2] mm: Reduce memory bloat with THP
On 01/24/2018 04:47 PM, Zi Yan wrote: With this change, whenever an application issues MADV_DONTNEED on a memory region, the region is marked as "space-efficient". For such regions, a hugepage is not immediately allocated on first write. >>> Kirill didn't like it in the previous version and I do not like this >>> either. You are adding a very subtle side effect which might completely >>> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED >>> to free up unused memory. Now you have put it out of THP usage >>> basically. >>> >> Userpsace may want a region to be considered by khugepaged while opting >> out of hugepage allocation on first touch. Asking userspace memory >> allocators to have to track and reclaim unused parts of a THP allocated >> hugepage does not seems right, as the kernel can use simple userspace >> hints to avoid allocating extra memory in the first place. >> >> I agree that this patch is adding a subtle side-effect which may take >> some applications by surprise. However, I often see the opposite too: >> for many workloads, disabling THP is the first advise as this aggressive >> allocation of hugepages on first touch is unexpected and is too >> wasteful. For e.g.: >> >> 1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB) >> http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/ >> >> 2) Disable THP on MongoDB >> https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/ >> >> 3) Disable THP for Couchbase Server >> https://blog.couchbase.com/often-overlooked-linux-os-tweaks/ >> >> 4) Redis >> http://antirez.com/news/84 >> >> >>> If the memory is used really scarce then we have MADV_NOHUGEPAGE. >>> >> It's not really about memory scarcity but a more efficient use of it. >> Applications may want hugepage benefits without requiring any changes to >> app code which is what THP is supposed to provide, while still avoiding >> memory bloat. >> > I read these links and find that there are mainly two complains: > 1. THP causes latency spikes, because direction compaction slows down THP > allocation, > 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return > memory ranges smaller than >THP size and fails because of THP. > > The first complain is not related to this patch. I'm trying to address many different THP issues and memory bloat is first among them. > For second one, at least with recent kernels, MADV_DONTNEED splits THPs and > returns the memory range you > specified in madvise(). Am I missing anything? > Yes, MADV_DONTNEED splits THPs and releases the requested range but this is not solving the issue of aggressive alloc-hugepage-on-first-touch policy of THP=madvise on MADV_HUGEPAGE regions. Sure, some workloads may prefer that policy but for application that don't, this patch give them an option to give hints to the kernel to go for gradual hugepage promotion via khugepaged only (and not on first touch). It's not good if an application has to track which parts of their (implicitly allocated) hugepage are in use and which sub-parts are free so they can issue MADV_DONTNEED calls on them. This approach really does not make THP "transparent" and requires lot of mm tracking code in userpace. Nitin
Re: [PATCH v2] mm: Reduce memory bloat with THP
On 1/19/18 4:49 AM, Michal Hocko wrote: > On Thu 18-01-18 15:33:16, Nitin Gupta wrote: >> From: Nitin Gupta >> >> Currently, if the THP enabled policy is "always", or the mode >> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage >> is allocated on a page fault if the pud or pmd is empty. This >> yields the best VA translation performance, but increases memory >> consumption if some small page ranges within the huge page are >> never accessed. > > Yes, this is true but hardly unexpected for MADV_HUGEPAGE or THP always > users. > Yes, allocating hugepage on first touch is the current behavior for above two cases. However, I see issues with this current behavior. Firstly, THP=always mode is often too aggressive/wasteful to be useful for any realistic workloads. For THP=madvise, users may want to back active parts of memory region with hugepages while avoiding aggressive hugepage allocation on first touch. Or, they may really want the current behavior. With this patch, users would have the option to pick what behavior they want by passing hints to the kernel in the form of MADV_HUGEPAGE and MADV_DONTNEED madvise calls. >> An alternate behavior for such page faults is to install a >> hugepage only when a region is actually found to be (almost) >> fully mapped and active. This is a compromise between >> translation performance and memory consumption. Currently there >> is no way for an application to choose this compromise for the >> page fault conditions above. > > Is that really true? We have > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none > This is not reflected during the PF of course but you can control the > behavior there as well. Either by the global setting or a per proces > prctl. > I think this part of patch description needs some rewording. This patch is to change *only* the page fault behavior. Once pages are installed, khugepaged does its job as usual, using max_ptes_none and other config values. I'm not trying to change any khugepaged behavior here. >> With this change, whenever an application issues MADV_DONTNEED on a >> memory region, the region is marked as "space-efficient". For such >> regions, a hugepage is not immediately allocated on first write. > > Kirill didn't like it in the previous version and I do not like this > either. You are adding a very subtle side effect which might completely > unexpected. Consider userspace memory allocator which uses MADV_DONTNEED > to free up unused memory. Now you have put it out of THP usage > basically. > Userpsace may want a region to be considered by khugepaged while opting out of hugepage allocation on first touch. Asking userspace memory allocators to have to track and reclaim unused parts of a THP allocated hugepage does not seems right, as the kernel can use simple userspace hints to avoid allocating extra memory in the first place. I agree that this patch is adding a subtle side-effect which may take some applications by surprise. However, I often see the opposite too: for many workloads, disabling THP is the first advise as this aggressive allocation of hugepages on first touch is unexpected and is too wasteful. For e.g.: 1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB) http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/ 2) Disable THP on MongoDB https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/ 3) Disable THP for Couchbase Server https://blog.couchbase.com/often-overlooked-linux-os-tweaks/ 4) Redis http://antirez.com/news/84 > If the memory is used really scarce then we have MADV_NOHUGEPAGE. > It's not really about memory scarcity but a more efficient use of it. Applications may want hugepage benefits without requiring any changes to app code which is what THP is supposed to provide, while still avoiding memory bloat. -Nitin
Re: [PATCH] mm: Reduce memory bloat with THP
On 12/15/17 2:01 AM, Kirill A. Shutemov wrote: > On Thu, Dec 14, 2017 at 05:28:52PM -0800, Nitin Gupta wrote: >> diff --git a/mm/madvise.c b/mm/madvise.c >> index 751e97a..b2ec07b 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -508,6 +508,7 @@ static long madvise_dontneed_single_vma(struct >> vm_area_struct *vma, >> unsigned long start, unsigned long end) >> { >> zap_page_range(vma, start, end - start); >> +vma->space_efficient = true; >> return 0; >> } >> > > And this modifies vma without down_write(mmap_sem). > I thought this function was always called with mmmap_sem write locked. I will check again. - Nitin
Re: [PATCH] mm: Reduce memory bloat with THP
On 12/15/17 2:00 AM, Kirill A. Shutemov wrote: > On Thu, Dec 14, 2017 at 05:28:52PM -0800, Nitin Gupta wrote: >> Currently, if the THP enabled policy is "always", or the mode >> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage >> is allocated on a page fault if the pud or pmd is empty. This >> yields the best VA translation performance, but increases memory >> consumption if some small page ranges within the huge page are >> never accessed. >> >> An alternate behavior for such page faults is to install a >> hugepage only when a region is actually found to be (almost) >> fully mapped and active. This is a compromise between >> translation performance and memory consumption. Currently there >> is no way for an application to choose this compromise for the >> page fault conditions above. >> >> With this change, when an application issues MADV_DONTNEED on a >> memory region, the region is marked as "space-efficient". For >> such regions, a hugepage is not immediately allocated on first >> write. Instead, it is left to the khugepaged thread to do >> delayed hugepage promotion depending on whether the region is >> actually mapped and active. When application issues >> MADV_HUGEPAGE, the region is marked again as non-space-efficient >> wherein hugepage is allocated on first touch. > > I think this would be NAK. At least in this form. > > What performance testing have you done? Any numbers? > I wrote a throw-away code which mmaps 128G area and writes to a random address in a loop. Together with writes, madvise(MADV_DONTNEED) are issued at another random addresses. Writes are issued with 70% probability and DONTNEED with 30%. With this test, I'm trying to emulate workload of a large in-memory hash-table. With the patch, I see that memory bloat is much less severe. I've uploaded the test program with the memory usage plot here: https://gist.github.com/nitingupta910/42ddf969e17556d74a14fbd84640ddb3 THP was set to 'always' mode in both cases but the result would be the same if madvise mode was used instead. > Making whole vma "space_efficient" just because somebody freed one page > from it is just wrong. And there's no way back after this. > I'm using MADV_DONTNEED as a hint that although user wants to transparently use hugepages but at the same time wants to be more conservative with respect to memory usage. If a MADV_HUGEPAGE is issued for a VMA range after any DONTNEEDs then the space_efficient bit is again cleared, so we revert back to allocating hugepage on fault on empty pud/pmd. >> >> Orabug: 26910556 > > Wat? > It's oracle internal identifier used to track this work. Thanks, Nitin
[PATCH] sparc64: Fix page table walk for PUD hugepages
For a PUD hugepage entry, we need to propagate bits [32:22] from virtual address to resolve at 4M granularity. However, the current code was incorrectly propagating bits [29:19]. This bug can cause incorrect data to be returned for pages backed with 16G hugepages. Signed-off-by: Nitin Gupta Reported-by: Al Viro Cc: Al Viro diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index acf55063aa3d..ca0de1646f1e 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -216,7 +216,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; sllxREG2, 32, REG2; \ andcc REG1, REG2, %g0;\ be,pt %xcc, 700f; \ -sethi %hi(0x1ffc), REG2; \ +sethi %hi(0xffe0), REG2; \ sllxREG2, 1, REG2; \ brgez,pnREG1, FAIL_LABEL; \ andn REG1, REG2, REG1; \ -- 2.13.1
Re: [PATCH 1/4] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
On Sun, Oct 22, 2017 at 8:10 PM, Minchan Kim wrote: > On Fri, Oct 20, 2017 at 10:59:31PM +0300, Kirill A. Shutemov wrote: >> With boot-time switching between paging mode we will have variable >> MAX_PHYSMEM_BITS. >> >> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y >> configuration to define zsmalloc data structures. >> >> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case. >> It also suits well to handle PAE special case. >> >> Signed-off-by: Kirill A. Shutemov >> Cc: Minchan Kim >> Cc: Nitin Gupta >> Cc: Sergey Senozhatsky > Acked-by: Minchan Kim > > Nitin: > > I think this patch works and it would be best for Kirill to be able to do. > So if you have better idea to clean it up, let's make it as another patch > regardless of this patch series. > I was looking into dynamically allocating size_class array to avoid that compile error, but yes, that can be done in a future patch. So, for this patch: Reviewed-by: Nitin Gupta
Re: [PATCH 1/4] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
On Fri, Oct 20, 2017 at 12:59 PM, Kirill A. Shutemov wrote: > With boot-time switching between paging mode we will have variable > MAX_PHYSMEM_BITS. > > Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y > configuration to define zsmalloc data structures. > > The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case. > It also suits well to handle PAE special case. > I see that with your upcoming patch, MAX_PHYSMEM_BITS is turned into a variable for x86_64 case as: (pgtable_l5_enabled ? 52 : 46). Even with this change, I don't see a need for this new MAX_POSSIBLE_PHYSMEM_BITS constant. > -#ifndef MAX_PHYSMEM_BITS > -#ifdef CONFIG_HIGHMEM64G > -#define MAX_PHYSMEM_BITS 36 > -#else /* !CONFIG_HIGHMEM64G */ > +#ifndef MAX_POSSIBLE_PHYSMEM_BITS > +#ifdef MAX_PHYSMEM_BITS > +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS > +#else This ifdef on HIGHMEM64G is redundant, as x86 already defines MAX_PHYSMEM_BITS = 36 in PAE case. So, all that zsmalloc should do is: #ifndef MAX_PHYSMEM_BITS #define MAX_PHYSMEM_BITS BITS_PER_LONG #endif .. and then no change is needed for rest of derived constants like _PFN_BITS. It is upto every arch to define correct MAX_PHYSMEM_BITS (variable or constant) based on whatever configurations the arch supports. If not defined, zsmalloc picks a reasonable default of BITS_PER_LONG. I will send a patch which makes the change to remove ifdef on CONFIG_HIGHMEM64G. Thanks, Nitin
Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
On Mon, Oct 16, 2017 at 7:44 AM, Kirill A. Shutemov wrote: > On Fri, Oct 13, 2017 at 05:00:12PM -0700, Nitin Gupta wrote: >> On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov >> wrote: >> > With boot-time switching between paging mode we will have variable >> > MAX_PHYSMEM_BITS. >> > >> > Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y >> > configuration to define zsmalloc data structures. >> > >> > The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case. >> > It also suits well to handle PAE special case. >> > >> > Signed-off-by: Kirill A. Shutemov >> > Cc: Minchan Kim >> > Cc: Nitin Gupta >> > Cc: Sergey Senozhatsky >> > --- >> > arch/x86/include/asm/pgtable-3level_types.h | 1 + >> > arch/x86/include/asm/pgtable_64_types.h | 2 ++ >> > mm/zsmalloc.c | 13 +++-- >> > 3 files changed, 10 insertions(+), 6 deletions(-) >> > >> > diff --git a/arch/x86/include/asm/pgtable-3level_types.h >> > b/arch/x86/include/asm/pgtable-3level_types.h >> > index b8a4341faafa..3fe1d107a875 100644 >> > --- a/arch/x86/include/asm/pgtable-3level_types.h >> > +++ b/arch/x86/include/asm/pgtable-3level_types.h >> > @@ -43,5 +43,6 @@ typedef union { >> > */ >> > #define PTRS_PER_PTE 512 >> > >> > +#define MAX_POSSIBLE_PHYSMEM_BITS 36 >> > >> > #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */ >> > diff --git a/arch/x86/include/asm/pgtable_64_types.h >> > b/arch/x86/include/asm/pgtable_64_types.h >> > index 06470da156ba..39075df30b8a 100644 >> > --- a/arch/x86/include/asm/pgtable_64_types.h >> > +++ b/arch/x86/include/asm/pgtable_64_types.h >> > @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t; >> > #define P4D_SIZE (_AC(1, UL) << P4D_SHIFT) >> > #define P4D_MASK (~(P4D_SIZE - 1)) >> > >> > +#define MAX_POSSIBLE_PHYSMEM_BITS 52 >> > + >> > #else /* CONFIG_X86_5LEVEL */ >> > >> > /* >> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c >> > index 7c38e850a8fc..7bde01c55c90 100644 >> > --- a/mm/zsmalloc.c >> > +++ b/mm/zsmalloc.c >> > @@ -82,18 +82,19 @@ >> > * This is made more complicated by various memory models and PAE. >> > */ >> > >> > -#ifndef MAX_PHYSMEM_BITS >> > -#ifdef CONFIG_HIGHMEM64G >> > -#define MAX_PHYSMEM_BITS 36 >> > -#else /* !CONFIG_HIGHMEM64G */ >> > +#ifndef MAX_POSSIBLE_PHYSMEM_BITS >> > +#ifdef MAX_PHYSMEM_BITS >> > +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS >> > +#else >> > /* >> > * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will >> > just >> > * be PAGE_SHIFT >> > */ >> > -#define MAX_PHYSMEM_BITS BITS_PER_LONG >> > +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG >> > #endif >> > #endif >> > -#define _PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) >> > + >> > +#define _PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) >> > >> >> >> I think we can avoid using this new constant in zsmalloc. >> >> The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more >> bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However, >> for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size >> would remain 32 bytes. >> >> So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and >> thus OBJ_INDEX_BITS = PAGE_SHIFT. > > As you understand the topic better than me, could you prepare the patch? > Actually no changes are necessary. As long as physical address bits <= BITS_PER_LONG, then setting _PFN_BITS to the most conservative value of BITS_PER_LONG is fine. AFAIK, this condition does not hold on x86 PAE where PA bits (36) > BITS_PER_LONG (32), so only that case need special handling to make sure PFN bits are not lost when encoding allocated object location in an unsigned long. Thanks, Nitin
Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov wrote: > With boot-time switching between paging mode we will have variable > MAX_PHYSMEM_BITS. > > Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y > configuration to define zsmalloc data structures. > > The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case. > It also suits well to handle PAE special case. > > Signed-off-by: Kirill A. Shutemov > Cc: Minchan Kim > Cc: Nitin Gupta > Cc: Sergey Senozhatsky > --- > arch/x86/include/asm/pgtable-3level_types.h | 1 + > arch/x86/include/asm/pgtable_64_types.h | 2 ++ > mm/zsmalloc.c | 13 +++-- > 3 files changed, 10 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/include/asm/pgtable-3level_types.h > b/arch/x86/include/asm/pgtable-3level_types.h > index b8a4341faafa..3fe1d107a875 100644 > --- a/arch/x86/include/asm/pgtable-3level_types.h > +++ b/arch/x86/include/asm/pgtable-3level_types.h > @@ -43,5 +43,6 @@ typedef union { > */ > #define PTRS_PER_PTE 512 > > +#define MAX_POSSIBLE_PHYSMEM_BITS 36 > > #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */ > diff --git a/arch/x86/include/asm/pgtable_64_types.h > b/arch/x86/include/asm/pgtable_64_types.h > index 06470da156ba..39075df30b8a 100644 > --- a/arch/x86/include/asm/pgtable_64_types.h > +++ b/arch/x86/include/asm/pgtable_64_types.h > @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t; > #define P4D_SIZE (_AC(1, UL) << P4D_SHIFT) > #define P4D_MASK (~(P4D_SIZE - 1)) > > +#define MAX_POSSIBLE_PHYSMEM_BITS 52 > + > #else /* CONFIG_X86_5LEVEL */ > > /* > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c > index 7c38e850a8fc..7bde01c55c90 100644 > --- a/mm/zsmalloc.c > +++ b/mm/zsmalloc.c > @@ -82,18 +82,19 @@ > * This is made more complicated by various memory models and PAE. > */ > > -#ifndef MAX_PHYSMEM_BITS > -#ifdef CONFIG_HIGHMEM64G > -#define MAX_PHYSMEM_BITS 36 > -#else /* !CONFIG_HIGHMEM64G */ > +#ifndef MAX_POSSIBLE_PHYSMEM_BITS > +#ifdef MAX_PHYSMEM_BITS > +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS > +#else > /* > * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just > * be PAGE_SHIFT > */ > -#define MAX_PHYSMEM_BITS BITS_PER_LONG > +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG > #endif > #endif > -#define _PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT) > + > +#define _PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) > I think we can avoid using this new constant in zsmalloc. The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However, for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size would remain 32 bytes. So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and thus OBJ_INDEX_BITS = PAGE_SHIFT. - Nitin
[PATCH v6 3/3] sparc64: Cleanup hugepage table walk functions
Flatten out nested code structure in huge_pte_offset() and huge_pte_alloc(). Signed-off-by: Nitin Gupta --- arch/sparc/mm/hugetlbpage.c | 54 + 1 file changed, 20 insertions(+), 34 deletions(-) diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index 7acb84d..bcd8cdb 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); if (!pud) return NULL; - if (sz >= PUD_SIZE) - pte = (pte_t *)pud; - else { - pmd = pmd_alloc(mm, pud, addr); - if (!pmd) - return NULL; - - if (sz >= PMD_SIZE) - pte = (pte_t *)pmd; - else - pte = pte_alloc_map(mm, pmd, addr); - } - - return pte; + return (pte_t *)pud; + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return NULL; + if (sz >= PMD_SIZE) + return (pte_t *)pmd; + return pte_alloc_map(mm, pmd, addr); } pte_t *huge_pte_offset(struct mm_struct *mm, @@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); - if (!pgd_none(*pgd)) { - pud = pud_offset(pgd, addr); - if (!pud_none(*pud)) { - if (is_hugetlb_pud(*pud)) - pte = (pte_t *)pud; - else { - pmd = pmd_offset(pud, addr); - if (!pmd_none(*pmd)) { - if (is_hugetlb_pmd(*pmd)) - pte = (pte_t *)pmd; - else - pte = pte_offset_map(pmd, addr); - } - } - } - } - - return pte; + if (pgd_none(*pgd)) + return NULL; + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) + return NULL; + if (is_hugetlb_pud(*pud)) + return (pte_t *)pud; + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return NULL; + if (is_hugetlb_pmd(*pmd)) + return (pte_t *)pmd; + return pte_offset_map(pmd, addr); } void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, -- 2.9.2
[PATCH v6 1/3] sparc64: Support huge PUD case in get_user_pages
get_user_pages() is used to do direct IO. It already handles the case where the address range is backed by PMD huge pages. This patch now adds the case where the range could be backed by PUD huge pages. Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/pgtable_64.h | 15 +++-- arch/sparc/mm/gup.c | 45 - 2 files changed, 57 insertions(+), 3 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 6fbd931..2579f5a 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd) return pte_write(pte); } +#define pud_write(pud) pte_write(__pte(pud_val(pud))) + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline unsigned long pmd_dirty(pmd_t pmd) { @@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd) return ((unsigned long) __va(pfn << PAGE_SHIFT)); } + +static inline unsigned long pud_page_vaddr(pud_t pud) +{ + pte_t pte = __pte(pud_val(pud)); + unsigned long pfn; + + pfn = pte_pfn(pte); + + return ((unsigned long) __va(pfn << PAGE_SHIFT)); +} + #define pmd_page(pmd) virt_to_page((void *)__pmd_page(pmd)) -#define pud_page_vaddr(pud)\ - ((unsigned long) __va(pud_val(pud))) #define pud_page(pud) virt_to_page((void *)pud_page_vaddr(pud)) #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL) #define pud_present(pud) (pud_val(pud) != 0U) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index f80cfc6..d809099 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -103,6 +103,45 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, return 1; } +static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr, + unsigned long end, int write, struct page **pages, + int *nr) +{ + struct page *head, *page; + int refs; + + if (!(pud_val(pud) & _PAGE_VALID)) + return 0; + + if (write && !pud_write(pud)) + return 0; + + refs = 0; + page = pud_page(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + head = compound_head(page); + do { + VM_BUG_ON(compound_head(page) != head); + pages[*nr] = page; + (*nr)++; + page++; + refs++; + } while (addr += PAGE_SIZE, addr != end); + + if (!page_cache_add_speculative(head, refs)) { + *nr -= refs; + return 0; + } + + if (unlikely(pud_val(pud) != pud_val(*pudp))) { + *nr -= refs; + while (refs--) + put_page(head); + return 0; + } + + return 1; +} + static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { @@ -141,7 +180,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, next = pud_addr_end(addr, end); if (pud_none(pud)) return 0; - if (!gup_pmd_range(pud, addr, next, write, pages, nr)) + if (unlikely(pud_large(pud))) { + if (!gup_huge_pud(pudp, pud, addr, next, + write, pages, nr)) + return 0; + } else if (!gup_pmd_range(pud, addr, next, write, pages, nr)) return 0; } while (pudp++, addr = next, addr != end); -- 2.9.2
[PATCH v6 2/3] sparc64: Add 16GB hugepage support
Adds support for 16GB hugepage size. To use this page size use kernel parameters as: default_hugepagesz=16G hugepagesz=16G hugepages=10 Testing: Tested with the stream benchmark which allocates 48G of arrays backed by 16G hugepages and does RW operation on them in parallel. Orabug: 25362942 Cc: Anthony Yznaga Reviewed-by: Bob Picco Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/hugetlb.h| 7 arch/sparc/include/asm/page_64.h| 3 +- arch/sparc/include/asm/pgtable_64.h | 5 +++ arch/sparc/include/asm/tsb.h| 36 ++ arch/sparc/kernel/head_64.S | 2 +- arch/sparc/kernel/tsb.S | 2 +- arch/sparc/kernel/vmlinux.lds.S | 5 +++ arch/sparc/mm/hugetlbpage.c | 74 ++--- arch/sparc/mm/init_64.c | 54 +++ 9 files changed, 157 insertions(+), 31 deletions(-) diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h index d1f837d..0ca7caa 100644 --- a/arch/sparc/include/asm/hugetlb.h +++ b/arch/sparc/include/asm/hugetlb.h @@ -4,6 +4,13 @@ #include #include +#ifdef CONFIG_HUGETLB_PAGE +struct pud_huge_patch_entry { + unsigned int addr; + unsigned int insn; +}; +extern struct pud_huge_patch_entry __pud_huge_patch, __pud_huge_patch_end; +#endif void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte); diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index 5961b2d..8ee1f97 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,6 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 +#define HPAGE_16GB_SHIFT 34 #define HPAGE_2GB_SHIFT31 #define HPAGE_256MB_SHIFT 28 #define HPAGE_64K_SHIFT16 @@ -28,7 +29,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) -#define HUGE_MAX_HSTATE4 +#define HUGE_MAX_HSTATE5 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 2579f5a..4fefe37 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd) return !!(pmd_val(pmd) & _PAGE_PMD_HUGE); } +static inline bool is_hugetlb_pud(pud_t pud) +{ + return !!(pud_val(pud) & _PAGE_PUD_HUGE); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index 32258e0..acf5506 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -195,6 +195,41 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; nop; \ 699: + /* PUD has been loaded into REG1, interpret the value, seeing +* if it is a HUGE PUD or a normal one. If it is not valid +* then jump to FAIL_LABEL. If it is a HUGE PUD, and it +* translates to a valid PTE, branch to PTE_LABEL. +* +* We have to propagate bits [32:22] from the virtual address +* to resolve at 4M granularity. +*/ +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ +700: ba 700f;\ +nop; \ + .section.pud_huge_patch, "ax"; \ + .word 700b; \ + nop;\ + .previous; \ + brz,pn REG1, FAIL_LABEL; \ +sethi %uhi(_PAGE_PUD_HUGE), REG2; \ + sllxREG2, 32, REG2; \ + andcc REG1, REG2, %g0;\ + be,pt %xcc, 700f; \ +sethi %hi(0x1ffc), REG2; \ + sllxREG2, 1, REG2; \ + brgez,pnREG1, FAIL_LABEL; \ +andn REG1, REG2, REG1; \ + and VADDR, REG2, REG2; \ + brlz,pt REG1, PTE_LABEL;\ +or REG1, REG2, REG1; \ +700: +#else +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +nop; +#endif + /* PMD has been loaded into REG1, interpret the value, seeing * if it is a HUGE PMD or a normal one. If it is not valid * then jump to FAIL_LABEL. If it is a HUGE PMD, and it @@
[PATCH v5 1/3] sparc64: Support huge PUD case in get_user_pages
get_user_pages() is used to do direct IO. It already handles the case where the address range is backed by PMD huge pages. This patch now adds the case where the range could be backed by PUD huge pages. Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/pgtable_64.h | 15 +++-- arch/sparc/mm/gup.c | 45 - 2 files changed, 57 insertions(+), 3 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 6fbd931..2579f5a 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd) return pte_write(pte); } +#define pud_write(pud) pte_write(__pte(pud_val(pud))) + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline unsigned long pmd_dirty(pmd_t pmd) { @@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd) return ((unsigned long) __va(pfn << PAGE_SHIFT)); } + +static inline unsigned long pud_page_vaddr(pud_t pud) +{ + pte_t pte = __pte(pud_val(pud)); + unsigned long pfn; + + pfn = pte_pfn(pte); + + return ((unsigned long) __va(pfn << PAGE_SHIFT)); +} + #define pmd_page(pmd) virt_to_page((void *)__pmd_page(pmd)) -#define pud_page_vaddr(pud)\ - ((unsigned long) __va(pud_val(pud))) #define pud_page(pud) virt_to_page((void *)pud_page_vaddr(pud)) #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL) #define pud_present(pud) (pud_val(pud) != 0U) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index f80cfc6..d809099 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -103,6 +103,45 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, return 1; } +static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr, + unsigned long end, int write, struct page **pages, + int *nr) +{ + struct page *head, *page; + int refs; + + if (!(pud_val(pud) & _PAGE_VALID)) + return 0; + + if (write && !pud_write(pud)) + return 0; + + refs = 0; + page = pud_page(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + head = compound_head(page); + do { + VM_BUG_ON(compound_head(page) != head); + pages[*nr] = page; + (*nr)++; + page++; + refs++; + } while (addr += PAGE_SIZE, addr != end); + + if (!page_cache_add_speculative(head, refs)) { + *nr -= refs; + return 0; + } + + if (unlikely(pud_val(pud) != pud_val(*pudp))) { + *nr -= refs; + while (refs--) + put_page(head); + return 0; + } + + return 1; +} + static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { @@ -141,7 +180,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, next = pud_addr_end(addr, end); if (pud_none(pud)) return 0; - if (!gup_pmd_range(pud, addr, next, write, pages, nr)) + if (unlikely(pud_large(pud))) { + if (!gup_huge_pud(pudp, pud, addr, next, + write, pages, nr)) + return 0; + } else if (!gup_pmd_range(pud, addr, next, write, pages, nr)) return 0; } while (pudp++, addr = next, addr != end); -- 2.9.2
[PATCH v5 2/3] sparc64: Add 16GB hugepage support
Adds support for 16GB hugepage size. To use this page size use kernel parameters as: default_hugepagesz=16G hugepagesz=16G hugepages=10 Testing: Tested with the stream benchmark which allocates 48G of arrays backed by 16G hugepages and does RW operation on them in parallel. Orabug: 25362942 Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/hugetlb.h| 7 arch/sparc/include/asm/page_64.h| 3 +- arch/sparc/include/asm/pgtable_64.h | 5 +++ arch/sparc/include/asm/tsb.h| 36 ++ arch/sparc/kernel/tsb.S | 2 +- arch/sparc/kernel/vmlinux.lds.S | 5 +++ arch/sparc/mm/hugetlbpage.c | 74 ++--- arch/sparc/mm/init_64.c | 54 +++ 8 files changed, 156 insertions(+), 30 deletions(-) diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h index d1f837d..0ca7caa 100644 --- a/arch/sparc/include/asm/hugetlb.h +++ b/arch/sparc/include/asm/hugetlb.h @@ -4,6 +4,13 @@ #include #include +#ifdef CONFIG_HUGETLB_PAGE +struct pud_huge_patch_entry { + unsigned int addr; + unsigned int insn; +}; +extern struct pud_huge_patch_entry __pud_huge_patch, __pud_huge_patch_end; +#endif void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte); diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index 5961b2d..8ee1f97 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,6 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 +#define HPAGE_16GB_SHIFT 34 #define HPAGE_2GB_SHIFT31 #define HPAGE_256MB_SHIFT 28 #define HPAGE_64K_SHIFT16 @@ -28,7 +29,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) -#define HUGE_MAX_HSTATE4 +#define HUGE_MAX_HSTATE5 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 2579f5a..4fefe37 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd) return !!(pmd_val(pmd) & _PAGE_PMD_HUGE); } +static inline bool is_hugetlb_pud(pud_t pud) +{ + return !!(pud_val(pud) & _PAGE_PUD_HUGE); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index 32258e0..acf5506 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -195,6 +195,41 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; nop; \ 699: + /* PUD has been loaded into REG1, interpret the value, seeing +* if it is a HUGE PUD or a normal one. If it is not valid +* then jump to FAIL_LABEL. If it is a HUGE PUD, and it +* translates to a valid PTE, branch to PTE_LABEL. +* +* We have to propagate bits [32:22] from the virtual address +* to resolve at 4M granularity. +*/ +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ +700: ba 700f;\ +nop; \ + .section.pud_huge_patch, "ax"; \ + .word 700b; \ + nop;\ + .previous; \ + brz,pn REG1, FAIL_LABEL; \ +sethi %uhi(_PAGE_PUD_HUGE), REG2; \ + sllxREG2, 32, REG2; \ + andcc REG1, REG2, %g0;\ + be,pt %xcc, 700f; \ +sethi %hi(0x1ffc), REG2; \ + sllxREG2, 1, REG2; \ + brgez,pnREG1, FAIL_LABEL; \ +andn REG1, REG2, REG1; \ + and VADDR, REG2, REG2; \ + brlz,pt REG1, PTE_LABEL;\ +or REG1, REG2, REG1; \ +700: +#else +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +nop; +#endif + /* PMD has been loaded into REG1, interpret the value, seeing * if it is a HUGE PMD or a normal one. If it is not valid * then jump to FAIL_LABEL. If it is a HUGE PMD, and it @@ -242,6 +277,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys
[PATCH v5 3/3] sparc64: Cleanup hugepage table walk functions
Flatten out nested code structure in huge_pte_offset() and huge_pte_alloc(). Signed-off-by: Nitin Gupta --- arch/sparc/mm/hugetlbpage.c | 54 + 1 file changed, 20 insertions(+), 34 deletions(-) diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index 7acb84d..bcd8cdb 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); if (!pud) return NULL; - if (sz >= PUD_SIZE) - pte = (pte_t *)pud; - else { - pmd = pmd_alloc(mm, pud, addr); - if (!pmd) - return NULL; - - if (sz >= PMD_SIZE) - pte = (pte_t *)pmd; - else - pte = pte_alloc_map(mm, pmd, addr); - } - - return pte; + return (pte_t *)pud; + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return NULL; + if (sz >= PMD_SIZE) + return (pte_t *)pmd; + return pte_alloc_map(mm, pmd, addr); } pte_t *huge_pte_offset(struct mm_struct *mm, @@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); - if (!pgd_none(*pgd)) { - pud = pud_offset(pgd, addr); - if (!pud_none(*pud)) { - if (is_hugetlb_pud(*pud)) - pte = (pte_t *)pud; - else { - pmd = pmd_offset(pud, addr); - if (!pmd_none(*pmd)) { - if (is_hugetlb_pmd(*pmd)) - pte = (pte_t *)pmd; - else - pte = pte_offset_map(pmd, addr); - } - } - } - } - - return pte; + if (pgd_none(*pgd)) + return NULL; + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) + return NULL; + if (is_hugetlb_pud(*pud)) + return (pte_t *)pud; + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return NULL; + if (is_hugetlb_pmd(*pmd)) + return (pte_t *)pmd; + return pte_offset_map(pmd, addr); } void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, -- 2.9.2
Re: [PATCH 2/3] sparc64: Add 16GB hugepage support
On 07/20/2017 01:04 PM, David Miller wrote: > From: Nitin Gupta > Date: Thu, 13 Jul 2017 14:53:24 -0700 > >> Testing: >> >> Tested with the stream benchmark which allocates 48G of >> arrays backed by 16G hugepages and does RW operation on >> them in parallel. > > It would be great if we started adding tests under > tools/testing/selftests so that other people can recreate > your tests/benchmarks. > Yes, I would like to add the stream benchmark to selftests too. I will check if our internal version of stream can be released. >> diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h >> index 32258e0..7b240a3 100644 >> --- a/arch/sparc/include/asm/tsb.h >> +++ b/arch/sparc/include/asm/tsb.h >> @@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, >> __tsb_phys_patch_end; >> nop; \ >> 699: >> >> +/* PUD has been loaded into REG1, interpret the value, seeing >> + * if it is a HUGE PUD or a normal one. If it is not valid >> + * then jump to FAIL_LABEL. If it is a HUGE PUD, and it >> + * translates to a valid PTE, branch to PTE_LABEL. >> + * >> + * We have to propagate bits [32:22] from the virtual address >> + * to resolve at 4M granularity. >> + */ >> +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) >> +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, >> PTE_LABEL) \ >> +brz,pn REG1, FAIL_LABEL; \ >> + sethi %uhi(_PAGE_PUD_HUGE), REG2; \ >> +sllxREG2, 32, REG2; \ >> +andcc REG1, REG2, %g0;\ >> +be,pt %xcc, 700f; \ >> + sethi %hi(0x1ffc), REG2; \ >> +sllxREG2, 1, REG2; \ >> +brgez,pnREG1, FAIL_LABEL; \ >> + andn REG1, REG2, REG1; \ >> +and VADDR, REG2, REG2; \ >> +brlz,pt REG1, PTE_LABEL;\ >> + or REG1, REG2, REG1; \ >> +700: >> +#else >> +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, >> PTE_LABEL) \ >> +brz,pn REG1, FAIL_LABEL; \ >> + nop; >> +#endif >> + >> /* PMD has been loaded into REG1, interpret the value, seeing >> * if it is a HUGE PMD or a normal one. If it is not valid >> * then jump to FAIL_LABEL. If it is a HUGE PMD, and it >> @@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, >> __tsb_phys_patch_end; >> srlxREG2, 64 - PAGE_SHIFT, REG2; \ >> andnREG2, 0x7, REG2; \ >> ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \ >> +USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \ >> brz,pn REG1, FAIL_LABEL; \ >> sllx VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \ >> srlxREG2, 64 - PAGE_SHIFT, REG2; \ > > This macro is getting way out of control, every TLB/TSB miss is > going to invoke this sequence of code. > > Yes, it's just a two cycle constant load, a test modifying the > condition codes, and an easy to predict branch. > > But every machine will eat this overhead, even if they don't use > hugepages or don't set the 16GB knob. > > I think we can do better, using code patching or similar. > > Once the knob is set, you can know for sure that this code path > will never actually be taken. The simplest way I can think of is to add CONFIG_SPARC_16GB_HUGEPAGE and exclude PUD check if not enabled. Would this be okay? Thanks, Nitin
[PATCH] sparc64: Register hugepages during arch init
Add hstate for each supported hugepage size using arch initcall. This change fixes some hugepage parameter parsing inconsistencies: case 1: no hugepage parameters Without hugepage parameters, only a hugepages-8192kB entry is visible in sysfs. It's different from x86_64 where both 2M and 1G hugepage sizes are available. case 2: default_hugepagesz=[64K|256M|2G] When specifying only a default_hugepagesz parameter, the default hugepage size isn't really changed and it stays at 8M. This is again different from x86_64. Orabug: 25869946 Reviewed-by: Bob Picco Signed-off-by: Nitin Gupta --- arch/sparc/mm/init_64.c | 25 - 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 3c40ebd..fed73f1 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -325,6 +325,29 @@ static void __update_mmu_tsb_insert(struct mm_struct *mm, unsigned long tsb_inde } #ifdef CONFIG_HUGETLB_PAGE +static void __init add_huge_page_size(unsigned long size) +{ + unsigned int order; + + if (size_to_hstate(size)) + return; + + order = ilog2(size) - PAGE_SHIFT; + hugetlb_add_hstate(order); +} + +static int __init hugetlbpage_init(void) +{ + add_huge_page_size(1UL << HPAGE_64K_SHIFT); + add_huge_page_size(1UL << HPAGE_SHIFT); + add_huge_page_size(1UL << HPAGE_256MB_SHIFT); + add_huge_page_size(1UL << HPAGE_2GB_SHIFT); + + return 0; +} + +arch_initcall(hugetlbpage_init); + static int __init setup_hugepagesz(char *string) { unsigned long long hugepage_size; @@ -364,7 +387,7 @@ static int __init setup_hugepagesz(char *string) goto out; } - hugetlb_add_hstate(hugepage_shift - PAGE_SHIFT); + add_huge_page_size(hugepage_size); rc = 1; out: -- 2.9.2
[PATCH 2/3] sparc64: Add 16GB hugepage support
Adds support for 16GB hugepage size. To use this page size use kernel parameters as: default_hugepagesz=16G hugepagesz=16G hugepages=10 Testing: Tested with the stream benchmark which allocates 48G of arrays backed by 16G hugepages and does RW operation on them in parallel. Orabug: 25362942 Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/page_64.h| 3 +- arch/sparc/include/asm/pgtable_64.h | 5 +++ arch/sparc/include/asm/tsb.h| 30 +++ arch/sparc/kernel/tsb.S | 2 +- arch/sparc/mm/hugetlbpage.c | 74 ++--- arch/sparc/mm/init_64.c | 41 6 files changed, 125 insertions(+), 30 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index 5961b2d..8ee1f97 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,6 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 +#define HPAGE_16GB_SHIFT 34 #define HPAGE_2GB_SHIFT31 #define HPAGE_256MB_SHIFT 28 #define HPAGE_64K_SHIFT16 @@ -28,7 +29,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) -#define HUGE_MAX_HSTATE4 +#define HUGE_MAX_HSTATE5 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 2579f5a..4fefe37 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd) return !!(pmd_val(pmd) & _PAGE_PMD_HUGE); } +static inline bool is_hugetlb_pud(pud_t pud) +{ + return !!(pud_val(pud) & _PAGE_PUD_HUGE); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index 32258e0..7b240a3 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; nop; \ 699: + /* PUD has been loaded into REG1, interpret the value, seeing +* if it is a HUGE PUD or a normal one. If it is not valid +* then jump to FAIL_LABEL. If it is a HUGE PUD, and it +* translates to a valid PTE, branch to PTE_LABEL. +* +* We have to propagate bits [32:22] from the virtual address +* to resolve at 4M granularity. +*/ +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +sethi %uhi(_PAGE_PUD_HUGE), REG2; \ + sllxREG2, 32, REG2; \ + andcc REG1, REG2, %g0;\ + be,pt %xcc, 700f; \ +sethi %hi(0x1ffc), REG2; \ + sllxREG2, 1, REG2; \ + brgez,pnREG1, FAIL_LABEL; \ +andn REG1, REG2, REG1; \ + and VADDR, REG2, REG2; \ + brlz,pt REG1, PTE_LABEL;\ +or REG1, REG2, REG1; \ +700: +#else +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +nop; +#endif + /* PMD has been loaded into REG1, interpret the value, seeing * if it is a HUGE PMD or a normal one. If it is not valid * then jump to FAIL_LABEL. If it is a HUGE PMD, and it @@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; srlxREG2, 64 - PAGE_SHIFT, REG2; \ andnREG2, 0x7, REG2; \ ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \ + USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \ brz,pn REG1, FAIL_LABEL; \ sllx VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \ srlxREG2, 64 - PAGE_SHIFT, REG2; \ diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S index 07c0df9..5f42ac0 100644 --- a/arch/sparc/kernel/tsb.S +++ b/arch/sparc/kernel/tsb.S @@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath: /* Valid PTE is now in %g5. */ #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) - sethi %uhi(_PAGE_PMD_HUGE), %g7 + sethi %uhi(_PAGE_PMD_HUGE | _PAGE_PUD_HUGE), %g7 sllx%g7, 32, %g7 andcc %g5, %g7, %g0 diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
[PATCH 3/3] sparc64: Cleanup hugepage table walk functions
Flatten out nested code structure in huge_pte_offset() and huge_pte_alloc(). Signed-off-by: Nitin Gupta --- arch/sparc/mm/hugetlbpage.c | 54 + 1 file changed, 20 insertions(+), 34 deletions(-) diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index 7acb84d..bcd8cdb 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); if (!pud) return NULL; - if (sz >= PUD_SIZE) - pte = (pte_t *)pud; - else { - pmd = pmd_alloc(mm, pud, addr); - if (!pmd) - return NULL; - - if (sz >= PMD_SIZE) - pte = (pte_t *)pmd; - else - pte = pte_alloc_map(mm, pmd, addr); - } - - return pte; + return (pte_t *)pud; + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return NULL; + if (sz >= PMD_SIZE) + return (pte_t *)pmd; + return pte_alloc_map(mm, pmd, addr); } pte_t *huge_pte_offset(struct mm_struct *mm, @@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); - if (!pgd_none(*pgd)) { - pud = pud_offset(pgd, addr); - if (!pud_none(*pud)) { - if (is_hugetlb_pud(*pud)) - pte = (pte_t *)pud; - else { - pmd = pmd_offset(pud, addr); - if (!pmd_none(*pmd)) { - if (is_hugetlb_pmd(*pmd)) - pte = (pte_t *)pmd; - else - pte = pte_offset_map(pmd, addr); - } - } - } - } - - return pte; + if (pgd_none(*pgd)) + return NULL; + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) + return NULL; + if (is_hugetlb_pud(*pud)) + return (pte_t *)pud; + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return NULL; + if (is_hugetlb_pmd(*pmd)) + return (pte_t *)pmd; + return pte_offset_map(pmd, addr); } void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, -- 2.9.2
[PATCH 1/3] sparc64: Support huge PUD case in get_user_pages
get_user_pages() is used to do direct IO. It already handles the case where the address range is backed by PMD huge pages. This patch now adds the case where the range could be backed by PUD huge pages. Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/pgtable_64.h | 15 ++-- arch/sparc/mm/gup.c | 47 - 2 files changed, 59 insertions(+), 3 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 6fbd931..2579f5a 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd) return pte_write(pte); } +#define pud_write(pud) pte_write(__pte(pud_val(pud))) + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline unsigned long pmd_dirty(pmd_t pmd) { @@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd) return ((unsigned long) __va(pfn << PAGE_SHIFT)); } + +static inline unsigned long pud_page_vaddr(pud_t pud) +{ + pte_t pte = __pte(pud_val(pud)); + unsigned long pfn; + + pfn = pte_pfn(pte); + + return ((unsigned long) __va(pfn << PAGE_SHIFT)); +} + #define pmd_page(pmd) virt_to_page((void *)__pmd_page(pmd)) -#define pud_page_vaddr(pud)\ - ((unsigned long) __va(pud_val(pud))) #define pud_page(pud) virt_to_page((void *)pud_page_vaddr(pud)) #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL) #define pud_present(pud) (pud_val(pud) != 0U) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index f80cfc6..d777594 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, return 1; } +static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr, + unsigned long end, int write, struct page **pages, + int *nr) +{ + struct page *head, *page; + int refs; + + if (!(pud_val(pud) & _PAGE_VALID)) + return 0; + + if (write && !pud_write(pud)) + return 0; + + refs = 0; + head = pud_page(pud); + page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + if (PageTail(head)) + head = compound_head(head); + do { + VM_BUG_ON(compound_head(page) != head); + pages[*nr] = page; + (*nr)++; + page++; + refs++; + } while (addr += PAGE_SIZE, addr != end); + + if (!page_cache_add_speculative(head, refs)) { + *nr -= refs; + return 0; + } + + if (unlikely(pud_val(pud) != pud_val(*pudp))) { + *nr -= refs; + while (refs--) + put_page(head); + return 0; + } + + return 1; +} + static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { @@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, next = pud_addr_end(addr, end); if (pud_none(pud)) return 0; - if (!gup_pmd_range(pud, addr, next, write, pages, nr)) + if (unlikely(pud_large(pud))) { + if (!gup_huge_pud(pudp, pud, addr, next, + write, pages, nr)) + return 0; + } else if (!gup_pmd_range(pud, addr, next, write, pages, nr)) return 0; } while (pudp++, addr = next, addr != end); -- 2.9.2
[PATCH v2] sparc64: Fix gup_huge_pmd
The function assumes that each PMD points to head of a huge page. This is not correct as a PMD can point to start of any 8M region with a, say 256M, hugepage. The fix ensures that it points to the correct head of any PMD huge page. Cc: Julian Calaby Signed-off-by: Nitin Gupta --- Changes since v1 - Clarify use of 'head' variable (Julian Calaby) arch/sparc/mm/gup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index cd0e32b..f80cfc6 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -78,8 +78,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, return 0; refs = 0; - head = pmd_page(pmd); - page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + page = pmd_page(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + head = compound_head(page); do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; -- 2.9.2
Re: [PATCH] sparc64: Fix gup_huge_pmd
Hi Julian, On 6/22/17 3:53 AM, Julian Calaby wrote: On Thu, Jun 22, 2017 at 7:50 AM, Nitin Gupta wrote: The function assumes that each PMD points to head of a huge page. This is not correct as a PMD can point to start of any 8M region with a, say 256M, hugepage. The fix ensures that it points to the correct head of any PMD huge page. Signed-off-by: Nitin Gupta --- arch/sparc/mm/gup.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index cd0e32b..9116a6f 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, refs = 0; head = pmd_page(pmd); page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + if (PageTail(head)) + head = compound_head(head); Stupid question: shouldn't this go before the page calculation? No, it should be after page calculation: First, 'head' points to base of the PMD page, then 'page' points to an offset within that page. Finally, we make sure that head variable points to head of the compound page which contains the addr. I think confusion comes from the use of 'head' for pointing to a non-head page. So, maybe it would be more clear to write that part of the function this way: page = pmd_page(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); head = compound_head(page); Thanks, Nitin
[PATCH] sparc64: Fix gup_huge_pmd
The function assumes that each PMD points to head of a huge page. This is not correct as a PMD can point to start of any 8M region with a, say 256M, hugepage. The fix ensures that it points to the correct head of any PMD huge page. Signed-off-by: Nitin Gupta --- arch/sparc/mm/gup.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index cd0e32b..9116a6f 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, refs = 0; head = pmd_page(pmd); page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + if (PageTail(head)) + head = compound_head(head); do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; -- 2.9.2
[PATCH 3/4] sparc64: Fix gup_huge_pmd
The function assumes that each PMD points to head of a huge page. This is not correct as a PMD can point to start of any 8M region with a, say 256M, hugepage. The fix ensures that it points to the correct head of any PMD huge page. Signed-off-by: Nitin Gupta --- arch/sparc/mm/gup.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index 7cfa9c5..b1c649d 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, refs = 0; head = pmd_page(pmd); page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + if (PageTail(head)) + head = compound_head(head); do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; -- 2.9.2
[PATCH 2/4] sparc64: Support huge PUD case in get_user_pages
get_user_pages() is used to do direct IO. It already handles the case where the address range is backed by PMD huge pages. This patch now adds the case where the range could be backed by PUD huge pages. Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/pgtable_64.h | 15 ++-- arch/sparc/mm/gup.c | 47 - 2 files changed, 59 insertions(+), 3 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 2444b02..4fefe37 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -692,6 +692,8 @@ static inline unsigned long pmd_write(pmd_t pmd) return pte_write(pte); } +#define pud_write(pud) pte_write(__pte(pud_val(pud))) + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline unsigned long pmd_dirty(pmd_t pmd) { @@ -828,9 +830,18 @@ static inline unsigned long __pmd_page(pmd_t pmd) return ((unsigned long) __va(pfn << PAGE_SHIFT)); } + +static inline unsigned long pud_page_vaddr(pud_t pud) +{ + pte_t pte = __pte(pud_val(pud)); + unsigned long pfn; + + pfn = pte_pfn(pte); + + return ((unsigned long) __va(pfn << PAGE_SHIFT)); +} + #define pmd_page(pmd) virt_to_page((void *)__pmd_page(pmd)) -#define pud_page_vaddr(pud)\ - ((unsigned long) __va(pud_val(pud))) #define pud_page(pud) virt_to_page((void *)pud_page_vaddr(pud)) #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL) #define pud_present(pud) (pud_val(pud) != 0U) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index cd0e32b..7cfa9c5 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, return 1; } +static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr, + unsigned long end, int write, struct page **pages, + int *nr) +{ + struct page *head, *page; + int refs; + + if (!(pud_val(pud) & _PAGE_VALID)) + return 0; + + if (write && !pud_write(pud)) + return 0; + + refs = 0; + head = pud_page(pud); + page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + if (PageTail(head)) + head = compound_head(head); + do { + VM_BUG_ON(compound_head(page) != head); + pages[*nr] = page; + (*nr)++; + page++; + refs++; + } while (addr += PAGE_SIZE, addr != end); + + if (!page_cache_add_speculative(head, refs)) { + *nr -= refs; + return 0; + } + + if (unlikely(pud_val(pud) != pud_val(*pudp))) { + *nr -= refs; + while (refs--) + put_page(head); + return 0; + } + + return 1; +} + static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { @@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, next = pud_addr_end(addr, end); if (pud_none(pud)) return 0; - if (!gup_pmd_range(pud, addr, next, write, pages, nr)) + if (unlikely(pud_large(pud))) { + if (!gup_huge_pud(pudp, pud, addr, next, + write, pages, nr)) + return 0; + } else if (!gup_pmd_range(pud, addr, next, write, pages, nr)) return 0; } while (pudp++, addr = next, addr != end); -- 2.9.2
[PATCH 1/4] sparc64: Add 16GB hugepage support
Adds support for 16GB hugepage size. To use this page size use kernel parameters as: default_hugepagesz=16G hugepagesz=16G hugepages=10 Testing: Tested with the stream benchmark which allocates 48G of arrays backed by 16G hugepages and does RW operation on them in parallel. Orabug: 25362942 Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/page_64.h| 3 +- arch/sparc/include/asm/pgtable_64.h | 5 +++ arch/sparc/include/asm/tsb.h| 30 +++ arch/sparc/kernel/tsb.S | 2 +- arch/sparc/mm/hugetlbpage.c | 74 ++--- arch/sparc/mm/init_64.c | 41 6 files changed, 125 insertions(+), 30 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index 5961b2d..8ee1f97 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,6 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 +#define HPAGE_16GB_SHIFT 34 #define HPAGE_2GB_SHIFT31 #define HPAGE_256MB_SHIFT 28 #define HPAGE_64K_SHIFT16 @@ -28,7 +29,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) -#define HUGE_MAX_HSTATE4 +#define HUGE_MAX_HSTATE5 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 6fbd931..2444b02 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd) return !!(pmd_val(pmd) & _PAGE_PMD_HUGE); } +static inline bool is_hugetlb_pud(pud_t pud) +{ + return !!(pud_val(pud) & _PAGE_PUD_HUGE); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index 32258e0..7b240a3 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; nop; \ 699: + /* PUD has been loaded into REG1, interpret the value, seeing +* if it is a HUGE PUD or a normal one. If it is not valid +* then jump to FAIL_LABEL. If it is a HUGE PUD, and it +* translates to a valid PTE, branch to PTE_LABEL. +* +* We have to propagate bits [32:22] from the virtual address +* to resolve at 4M granularity. +*/ +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +sethi %uhi(_PAGE_PUD_HUGE), REG2; \ + sllxREG2, 32, REG2; \ + andcc REG1, REG2, %g0;\ + be,pt %xcc, 700f; \ +sethi %hi(0x1ffc), REG2; \ + sllxREG2, 1, REG2; \ + brgez,pnREG1, FAIL_LABEL; \ +andn REG1, REG2, REG1; \ + and VADDR, REG2, REG2; \ + brlz,pt REG1, PTE_LABEL;\ +or REG1, REG2, REG1; \ +700: +#else +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +nop; +#endif + /* PMD has been loaded into REG1, interpret the value, seeing * if it is a HUGE PMD or a normal one. If it is not valid * then jump to FAIL_LABEL. If it is a HUGE PMD, and it @@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; srlxREG2, 64 - PAGE_SHIFT, REG2; \ andnREG2, 0x7, REG2; \ ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \ + USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \ brz,pn REG1, FAIL_LABEL; \ sllx VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \ srlxREG2, 64 - PAGE_SHIFT, REG2; \ diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S index 07c0df9..5f42ac0 100644 --- a/arch/sparc/kernel/tsb.S +++ b/arch/sparc/kernel/tsb.S @@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath: /* Valid PTE is now in %g5. */ #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) - sethi %uhi(_PAGE_PMD_HUGE), %g7 + sethi %uhi(_PAGE_PMD_HUGE | _PAGE_PUD_HUGE), %g7 sllx%g7, 32, %g7 andcc %g5, %g7, %g0 diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
[PATCH 4/4] sparc64: Cleanup hugepage table walk functions
Flatten out nested code structure in huge_pte_offset() and huge_pte_alloc(). Signed-off-by: Nitin Gupta --- arch/sparc/mm/hugetlbpage.c | 54 + 1 file changed, 20 insertions(+), 34 deletions(-) diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index f0bb42d..e8b7245 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); if (!pud) return NULL; - if (sz >= PUD_SIZE) - pte = (pte_t *)pud; - else { - pmd = pmd_alloc(mm, pud, addr); - if (!pmd) - return NULL; - - if (sz >= PMD_SIZE) - pte = (pte_t *)pmd; - else - pte = pte_alloc_map(mm, pmd, addr); - } - - return pte; + return (pte_t *)pud; + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return NULL; + if (sz >= PMD_SIZE) + return (pte_t *)pmd; + return pte_alloc_map(mm, pmd, addr); } pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) @@ -294,27 +286,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); - if (!pgd_none(*pgd)) { - pud = pud_offset(pgd, addr); - if (!pud_none(*pud)) { - if (is_hugetlb_pud(*pud)) - pte = (pte_t *)pud; - else { - pmd = pmd_offset(pud, addr); - if (!pmd_none(*pmd)) { - if (is_hugetlb_pmd(*pmd)) - pte = (pte_t *)pmd; - else - pte = pte_offset_map(pmd, addr); - } - } - } - } - - return pte; + if (pgd_none(*pgd)) + return NULL; + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) + return NULL; + if (is_hugetlb_pud(*pud)) + return (pte_t *)pud; + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return NULL; + if (is_hugetlb_pmd(*pmd)) + return (pte_t *)pmd; + return pte_offset_map(pmd, addr); } void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, -- 2.9.2
[PATCH v3 3/4] sparc64: Fix gup_huge_pmd
The function assumes that each PMD points to head of a huge page. This is not correct as a PMD can point to start of any 8M region with a, say 256M, hugepage. The fix ensures that it points to the correct head of any PMD huge page. Signed-off-by: Nitin Gupta --- arch/sparc/mm/gup.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index 7cfa9c5..b1c649d 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, refs = 0; head = pmd_page(pmd); page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + if (PageTail(head)) + head = compound_head(head); do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; -- 2.9.2
[PATCH v3 4/4] sparc64: Cleanup hugepage table walk functions
Flatten out nested code structure in huge_pte_offset() and huge_pte_alloc(). Signed-off-by: Nitin Gupta --- arch/sparc/mm/hugetlbpage.c | 54 + 1 file changed, 20 insertions(+), 34 deletions(-) diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index f0bb42d..e8b7245 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); if (!pud) return NULL; - if (sz >= PUD_SIZE) - pte = (pte_t *)pud; - else { - pmd = pmd_alloc(mm, pud, addr); - if (!pmd) - return NULL; - - if (sz >= PMD_SIZE) - pte = (pte_t *)pmd; - else - pte = pte_alloc_map(mm, pmd, addr); - } - - return pte; + return (pte_t *)pud; + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return NULL; + if (sz >= PMD_SIZE) + return (pte_t *)pmd; + return pte_alloc_map(mm, pmd, addr); } pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) @@ -294,27 +286,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); - if (!pgd_none(*pgd)) { - pud = pud_offset(pgd, addr); - if (!pud_none(*pud)) { - if (is_hugetlb_pud(*pud)) - pte = (pte_t *)pud; - else { - pmd = pmd_offset(pud, addr); - if (!pmd_none(*pmd)) { - if (is_hugetlb_pmd(*pmd)) - pte = (pte_t *)pmd; - else - pte = pte_offset_map(pmd, addr); - } - } - } - } - - return pte; + if (pgd_none(*pgd)) + return NULL; + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) + return NULL; + if (is_hugetlb_pud(*pud)) + return (pte_t *)pud; + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return NULL; + if (is_hugetlb_pmd(*pmd)) + return (pte_t *)pmd; + return pte_offset_map(pmd, addr); } void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, -- 2.9.2
[PATCH v3 2/4] sparc64: Support huge PUD case in get_user_pages
get_user_pages() is used to do direct IO. It already handles the case where the address range is backed by PMD huge pages. This patch now adds the case where the range could be backed by PUD huge pages. Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/pgtable_64.h | 15 ++-- arch/sparc/mm/gup.c | 47 - 2 files changed, 59 insertions(+), 3 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 2444b02..4fefe37 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -692,6 +692,8 @@ static inline unsigned long pmd_write(pmd_t pmd) return pte_write(pte); } +#define pud_write(pud) pte_write(__pte(pud_val(pud))) + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline unsigned long pmd_dirty(pmd_t pmd) { @@ -828,9 +830,18 @@ static inline unsigned long __pmd_page(pmd_t pmd) return ((unsigned long) __va(pfn << PAGE_SHIFT)); } + +static inline unsigned long pud_page_vaddr(pud_t pud) +{ + pte_t pte = __pte(pud_val(pud)); + unsigned long pfn; + + pfn = pte_pfn(pte); + + return ((unsigned long) __va(pfn << PAGE_SHIFT)); +} + #define pmd_page(pmd) virt_to_page((void *)__pmd_page(pmd)) -#define pud_page_vaddr(pud)\ - ((unsigned long) __va(pud_val(pud))) #define pud_page(pud) virt_to_page((void *)pud_page_vaddr(pud)) #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL) #define pud_present(pud) (pud_val(pud) != 0U) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index cd0e32b..7cfa9c5 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, return 1; } +static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr, + unsigned long end, int write, struct page **pages, + int *nr) +{ + struct page *head, *page; + int refs; + + if (!(pud_val(pud) & _PAGE_VALID)) + return 0; + + if (write && !pud_write(pud)) + return 0; + + refs = 0; + head = pud_page(pud); + page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + if (PageTail(head)) + head = compound_head(head); + do { + VM_BUG_ON(compound_head(page) != head); + pages[*nr] = page; + (*nr)++; + page++; + refs++; + } while (addr += PAGE_SIZE, addr != end); + + if (!page_cache_add_speculative(head, refs)) { + *nr -= refs; + return 0; + } + + if (unlikely(pud_val(pud) != pud_val(*pudp))) { + *nr -= refs; + while (refs--) + put_page(head); + return 0; + } + + return 1; +} + static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { @@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, next = pud_addr_end(addr, end); if (pud_none(pud)) return 0; - if (!gup_pmd_range(pud, addr, next, write, pages, nr)) + if (unlikely(pud_large(pud))) { + if (!gup_huge_pud(pudp, pud, addr, next, + write, pages, nr)) + return 0; + } else if (!gup_pmd_range(pud, addr, next, write, pages, nr)) return 0; } while (pudp++, addr = next, addr != end); -- 2.9.2
[PATCH v3 1/4] sparc64: Add 16GB hugepage support
Adds support for 16GB hugepage size. To use this page size use kernel parameters as: default_hugepagesz=16G hugepagesz=16G hugepages=10 Testing: Tested with the stream benchmark which allocates 48G of arrays backed by 16G hugepages and does RW operation on them in parallel. Orabug: 25362942 Signed-off-by: Nitin Gupta --- Changelog v3 vs v2: - Fixed email headers so the subject shows up correctly Changelog v2 vs v1: - Remove redundant brgez,pn (Bob Picco) - Remove unncessary label rename from 700 to 701 (Rob Gardner) - Add patch description (Paul) - Add 16G case to get_user_pages() arch/sparc/include/asm/page_64.h| 3 +- arch/sparc/include/asm/pgtable_64.h | 5 +++ arch/sparc/include/asm/tsb.h| 30 +++ arch/sparc/kernel/tsb.S | 2 +- arch/sparc/mm/hugetlbpage.c | 74 ++--- arch/sparc/mm/init_64.c | 41 6 files changed, 125 insertions(+), 30 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index 5961b2d..8ee1f97 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,6 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 +#define HPAGE_16GB_SHIFT 34 #define HPAGE_2GB_SHIFT31 #define HPAGE_256MB_SHIFT 28 #define HPAGE_64K_SHIFT16 @@ -28,7 +29,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) -#define HUGE_MAX_HSTATE4 +#define HUGE_MAX_HSTATE5 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 6fbd931..2444b02 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd) return !!(pmd_val(pmd) & _PAGE_PMD_HUGE); } +static inline bool is_hugetlb_pud(pud_t pud) +{ + return !!(pud_val(pud) & _PAGE_PUD_HUGE); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index 32258e0..7b240a3 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; nop; \ 699: + /* PUD has been loaded into REG1, interpret the value, seeing +* if it is a HUGE PUD or a normal one. If it is not valid +* then jump to FAIL_LABEL. If it is a HUGE PUD, and it +* translates to a valid PTE, branch to PTE_LABEL. +* +* We have to propagate bits [32:22] from the virtual address +* to resolve at 4M granularity. +*/ +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +sethi %uhi(_PAGE_PUD_HUGE), REG2; \ + sllxREG2, 32, REG2; \ + andcc REG1, REG2, %g0;\ + be,pt %xcc, 700f; \ +sethi %hi(0x1ffc), REG2; \ + sllxREG2, 1, REG2; \ + brgez,pnREG1, FAIL_LABEL; \ +andn REG1, REG2, REG1; \ + and VADDR, REG2, REG2; \ + brlz,pt REG1, PTE_LABEL;\ +or REG1, REG2, REG1; \ +700: +#else +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +nop; +#endif + /* PMD has been loaded into REG1, interpret the value, seeing * if it is a HUGE PMD or a normal one. If it is not valid * then jump to FAIL_LABEL. If it is a HUGE PMD, and it @@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; srlxREG2, 64 - PAGE_SHIFT, REG2; \ andnREG2, 0x7, REG2; \ ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \ + USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \ brz,pn REG1, FAIL_LABEL; \ sllx VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \ srlxREG2, 64 - PAGE_SHIFT, REG2; \ diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S index 07c0df9..5f42ac0 100644 --- a/arch/sparc/kernel/tsb.S +++ b/arch/sparc/kernel/tsb.S @@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath: /* Valid PTE is now in %g5. */ #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
Re: From: Nitin Gupta
Please ignore this patch series. I will resend again with correct email headers. Nitin On 6/19/17 2:48 PM, Nitin Gupta wrote: Adds support for 16GB hugepage size. To use this page size use kernel parameters as: default_hugepagesz=16G hugepagesz=16G hugepages=10 Testing: Tested with the stream benchmark which allocates 48G of arrays backed by 16G hugepages and does RW operation on them in parallel. Orabug: 25362942 Signed-off-by: Nitin Gupta --- Changelog v2 vs v1: - Remove redundant brgez,pn (Bob Picco) - Remove unncessary label rename from 700 to 701 (Rob Gardner) - Add patch description (Paul) - Add 16G case to get_user_pages() arch/sparc/include/asm/page_64.h| 3 +- arch/sparc/include/asm/pgtable_64.h | 5 +++ arch/sparc/include/asm/tsb.h| 30 +++ arch/sparc/kernel/tsb.S | 2 +- arch/sparc/mm/hugetlbpage.c | 74 ++--- arch/sparc/mm/init_64.c | 41 6 files changed, 125 insertions(+), 30 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index 5961b2d..8ee1f97 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,6 +17,7 @@ #define HPAGE_SHIFT 23 #define REAL_HPAGE_SHIFT 22 +#define HPAGE_16GB_SHIFT 34 #define HPAGE_2GB_SHIFT 31 #define HPAGE_256MB_SHIFT 28 #define HPAGE_64K_SHIFT 16 @@ -28,7 +29,7 @@ #define HUGETLB_PAGE_ORDER(HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) -#define HUGE_MAX_HSTATE4 +#define HUGE_MAX_HSTATE5 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 6fbd931..2444b02 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd) return !!(pmd_val(pmd) & _PAGE_PMD_HUGE); } +static inline bool is_hugetlb_pud(pud_t pud) +{ + return !!(pud_val(pud) & _PAGE_PUD_HUGE); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index 32258e0..7b240a3 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; nop; \ 699: + /* PUD has been loaded into REG1, interpret the value, seeing +* if it is a HUGE PUD or a normal one. If it is not valid +* then jump to FAIL_LABEL. If it is a HUGE PUD, and it +* translates to a valid PTE, branch to PTE_LABEL. +* +* We have to propagate bits [32:22] from the virtual address +* to resolve at 4M granularity. +*/ +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +sethi %uhi(_PAGE_PUD_HUGE), REG2; \ + sllxREG2, 32, REG2; \ + andcc REG1, REG2, %g0;\ + be,pt %xcc, 700f; \ +sethi %hi(0x1ffc), REG2; \ + sllxREG2, 1, REG2; \ + brgez,pnREG1, FAIL_LABEL; \ +andn REG1, REG2, REG1; \ + and VADDR, REG2, REG2; \ + brlz,pt REG1, PTE_LABEL;\ +or REG1, REG2, REG1; \ +700: +#else +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +nop; +#endif + /* PMD has been loaded into REG1, interpret the value, seeing * if it is a HUGE PMD or a normal one. If it is not valid * then jump to FAIL_LABEL. If it is a HUGE PMD, and it @@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; srlxREG2, 64 - PAGE_SHIFT, REG2; \ andnREG2, 0x7, REG2; \ ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \ + USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \ brz,pn REG1, FAIL_LABEL; \ sllx VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \ srlxREG2, 64 - PAGE_SHIFT, REG2; \ diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S index 07c0df9..5f42ac0 100644 --- a/arch/sparc/kernel/tsb.S +++ b/arch/sparc/kernel/tsb.S @@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath: /* Valid PTE is now in %g5. */
[PATCH v2 3/4] sparc64: Fix gup_huge_pmd
The function assumes that each PMD points to head of a huge page. This is not correct as a PMD can point to start of any 8M region with a, say 256M, hugepage. The fix ensures that it points to the correct head of any PMD huge page. Signed-off-by: Nitin Gupta --- arch/sparc/mm/gup.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index 7cfa9c5..b1c649d 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, refs = 0; head = pmd_page(pmd); page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + if (PageTail(head)) + head = compound_head(head); do { VM_BUG_ON(compound_head(page) != head); pages[*nr] = page; -- 2.9.2
From: Nitin Gupta
Adds support for 16GB hugepage size. To use this page size use kernel parameters as: default_hugepagesz=16G hugepagesz=16G hugepages=10 Testing: Tested with the stream benchmark which allocates 48G of arrays backed by 16G hugepages and does RW operation on them in parallel. Orabug: 25362942 Signed-off-by: Nitin Gupta --- Changelog v2 vs v1: - Remove redundant brgez,pn (Bob Picco) - Remove unncessary label rename from 700 to 701 (Rob Gardner) - Add patch description (Paul) - Add 16G case to get_user_pages() arch/sparc/include/asm/page_64.h| 3 +- arch/sparc/include/asm/pgtable_64.h | 5 +++ arch/sparc/include/asm/tsb.h| 30 +++ arch/sparc/kernel/tsb.S | 2 +- arch/sparc/mm/hugetlbpage.c | 74 ++--- arch/sparc/mm/init_64.c | 41 6 files changed, 125 insertions(+), 30 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index 5961b2d..8ee1f97 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,6 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 +#define HPAGE_16GB_SHIFT 34 #define HPAGE_2GB_SHIFT31 #define HPAGE_256MB_SHIFT 28 #define HPAGE_64K_SHIFT16 @@ -28,7 +29,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) -#define HUGE_MAX_HSTATE4 +#define HUGE_MAX_HSTATE5 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 6fbd931..2444b02 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd) return !!(pmd_val(pmd) & _PAGE_PMD_HUGE); } +static inline bool is_hugetlb_pud(pud_t pud) +{ + return !!(pud_val(pud) & _PAGE_PUD_HUGE); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index 32258e0..7b240a3 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; nop; \ 699: + /* PUD has been loaded into REG1, interpret the value, seeing +* if it is a HUGE PUD or a normal one. If it is not valid +* then jump to FAIL_LABEL. If it is a HUGE PUD, and it +* translates to a valid PTE, branch to PTE_LABEL. +* +* We have to propagate bits [32:22] from the virtual address +* to resolve at 4M granularity. +*/ +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +sethi %uhi(_PAGE_PUD_HUGE), REG2; \ + sllxREG2, 32, REG2; \ + andcc REG1, REG2, %g0;\ + be,pt %xcc, 700f; \ +sethi %hi(0x1ffc), REG2; \ + sllxREG2, 1, REG2; \ + brgez,pnREG1, FAIL_LABEL; \ +andn REG1, REG2, REG1; \ + and VADDR, REG2, REG2; \ + brlz,pt REG1, PTE_LABEL;\ +or REG1, REG2, REG1; \ +700: +#else +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +nop; +#endif + /* PMD has been loaded into REG1, interpret the value, seeing * if it is a HUGE PMD or a normal one. If it is not valid * then jump to FAIL_LABEL. If it is a HUGE PMD, and it @@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; srlxREG2, 64 - PAGE_SHIFT, REG2; \ andnREG2, 0x7, REG2; \ ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \ + USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \ brz,pn REG1, FAIL_LABEL; \ sllx VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \ srlxREG2, 64 - PAGE_SHIFT, REG2; \ diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S index 07c0df9..5f42ac0 100644 --- a/arch/sparc/kernel/tsb.S +++ b/arch/sparc/kernel/tsb.S @@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath: /* Valid PTE is now in %g5. */ #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) - sethi %uhi(_PAGE_PMD_HUGE), %g7 + sethi
[PATCH v2 4/4] sparc64: Cleanup hugepage table walk functions
Flatten out nested code structure in huge_pte_offset() and huge_pte_alloc(). Signed-off-by: Nitin Gupta --- arch/sparc/mm/hugetlbpage.c | 54 + 1 file changed, 20 insertions(+), 34 deletions(-) diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index f0bb42d..e8b7245 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); if (!pud) return NULL; - if (sz >= PUD_SIZE) - pte = (pte_t *)pud; - else { - pmd = pmd_alloc(mm, pud, addr); - if (!pmd) - return NULL; - - if (sz >= PMD_SIZE) - pte = (pte_t *)pmd; - else - pte = pte_alloc_map(mm, pmd, addr); - } - - return pte; + return (pte_t *)pud; + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return NULL; + if (sz >= PMD_SIZE) + return (pte_t *)pmd; + return pte_alloc_map(mm, pmd, addr); } pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) @@ -294,27 +286,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) pgd_t *pgd; pud_t *pud; pmd_t *pmd; - pte_t *pte = NULL; pgd = pgd_offset(mm, addr); - if (!pgd_none(*pgd)) { - pud = pud_offset(pgd, addr); - if (!pud_none(*pud)) { - if (is_hugetlb_pud(*pud)) - pte = (pte_t *)pud; - else { - pmd = pmd_offset(pud, addr); - if (!pmd_none(*pmd)) { - if (is_hugetlb_pmd(*pmd)) - pte = (pte_t *)pmd; - else - pte = pte_offset_map(pmd, addr); - } - } - } - } - - return pte; + if (pgd_none(*pgd)) + return NULL; + pud = pud_offset(pgd, addr); + if (pud_none(*pud)) + return NULL; + if (is_hugetlb_pud(*pud)) + return (pte_t *)pud; + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) + return NULL; + if (is_hugetlb_pmd(*pmd)) + return (pte_t *)pmd; + return pte_offset_map(pmd, addr); } void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, -- 2.9.2
[PATCH v2 2/4] sparc64: Support huge PUD case in get_user_pages
get_user_pages() is used to do direct IO. It already handles the case where the address range is backed by PMD huge pages. This patch now adds the case where the range could be backed by PUD huge pages. Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/pgtable_64.h | 15 ++-- arch/sparc/mm/gup.c | 47 - 2 files changed, 59 insertions(+), 3 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 2444b02..4fefe37 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -692,6 +692,8 @@ static inline unsigned long pmd_write(pmd_t pmd) return pte_write(pte); } +#define pud_write(pud) pte_write(__pte(pud_val(pud))) + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline unsigned long pmd_dirty(pmd_t pmd) { @@ -828,9 +830,18 @@ static inline unsigned long __pmd_page(pmd_t pmd) return ((unsigned long) __va(pfn << PAGE_SHIFT)); } + +static inline unsigned long pud_page_vaddr(pud_t pud) +{ + pte_t pte = __pte(pud_val(pud)); + unsigned long pfn; + + pfn = pte_pfn(pte); + + return ((unsigned long) __va(pfn << PAGE_SHIFT)); +} + #define pmd_page(pmd) virt_to_page((void *)__pmd_page(pmd)) -#define pud_page_vaddr(pud)\ - ((unsigned long) __va(pud_val(pud))) #define pud_page(pud) virt_to_page((void *)pud_page_vaddr(pud)) #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL) #define pud_present(pud) (pud_val(pud) != 0U) diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index cd0e32b..7cfa9c5 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr, return 1; } +static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr, + unsigned long end, int write, struct page **pages, + int *nr) +{ + struct page *head, *page; + int refs; + + if (!(pud_val(pud) & _PAGE_VALID)) + return 0; + + if (write && !pud_write(pud)) + return 0; + + refs = 0; + head = pud_page(pud); + page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + if (PageTail(head)) + head = compound_head(head); + do { + VM_BUG_ON(compound_head(page) != head); + pages[*nr] = page; + (*nr)++; + page++; + refs++; + } while (addr += PAGE_SIZE, addr != end); + + if (!page_cache_add_speculative(head, refs)) { + *nr -= refs; + return 0; + } + + if (unlikely(pud_val(pud) != pud_val(*pudp))) { + *nr -= refs; + while (refs--) + put_page(head); + return 0; + } + + return 1; +} + static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { @@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, next = pud_addr_end(addr, end); if (pud_none(pud)) return 0; - if (!gup_pmd_range(pud, addr, next, write, pages, nr)) + if (unlikely(pud_large(pud))) { + if (!gup_huge_pud(pudp, pud, addr, next, + write, pages, nr)) + return 0; + } else if (!gup_pmd_range(pud, addr, next, write, pages, nr)) return 0; } while (pudp++, addr = next, addr != end); -- 2.9.2
Re: [PATCH] sparc64: Add 16GB hugepage support
On 5/24/17 8:45 PM, David Miller wrote: > From: Paul Gortmaker > Date: Wed, 24 May 2017 23:34:42 -0400 > >> [[PATCH] sparc64: Add 16GB hugepage support] On 24/05/2017 (Wed 17:29) Nitin >> Gupta wrote: >> >>> Orabug: 25362942 >>> >>> Signed-off-by: Nitin Gupta >> >> If this wasn't an accidental git send-email misfire, then there should >> be a long log indicating the use case, the perforamnce increase, the >> testing that was done, etc. etc. >> >> Normally I'd not notice but since I was Cc'd I figured it was worth a >> mention -- for example the vendor ID above doesn't mean a thing to >> all the rest of us, hence why I suspect it was a git send-email misfire; >> sadly, I think we've all accidentally done that at least once > > Agreed. > > No commit message whatsoever is basically unacceptable for something > like this. > Ok, I will include usage, testing notes, performance numbers etc., in v2 patch. Still, I do try to include "Orabug" for better tracking of bugs internally; I hope that's okay. Thanks, Nitin
[PATCH] sparc64: Add 16GB hugepage support
Orabug: 25362942 Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/page_64.h| 3 +- arch/sparc/include/asm/pgtable_64.h | 5 +++ arch/sparc/include/asm/tsb.h| 35 +- arch/sparc/kernel/tsb.S | 2 +- arch/sparc/mm/hugetlbpage.c | 74 ++--- arch/sparc/mm/init_64.c | 41 6 files changed, 128 insertions(+), 32 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index 5961b2d..8ee1f97 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,6 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 +#define HPAGE_16GB_SHIFT 34 #define HPAGE_2GB_SHIFT31 #define HPAGE_256MB_SHIFT 28 #define HPAGE_64K_SHIFT16 @@ -28,7 +29,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) -#define HUGE_MAX_HSTATE4 +#define HUGE_MAX_HSTATE5 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 6fbd931..2444b02 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd) return !!(pmd_val(pmd) & _PAGE_PMD_HUGE); } +static inline bool is_hugetlb_pud(pud_t pud) +{ + return !!(pud_val(pud) & _PAGE_PUD_HUGE); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index 32258e0..fbd8da7 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -195,6 +195,36 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; nop; \ 699: + /* PUD has been loaded into REG1, interpret the value, seeing +* if it is a HUGE PUD or a normal one. If it is not valid +* then jump to FAIL_LABEL. If it is a HUGE PUD, and it +* translates to a valid PTE, branch to PTE_LABEL. +* +* We have to propagate bits [32:22] from the virtual address +* to resolve at 4M granularity. +*/ +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +sethi %uhi(_PAGE_PUD_HUGE), REG2; \ + sllxREG2, 32, REG2; \ + andcc REG1, REG2, %g0;\ + be,pt %xcc, 700f; \ +sethi %hi(0x1ffc), REG2; \ + brgez,pnREG1, FAIL_LABEL; \ +sllx REG2, 1, REG2; \ + brgez,pnREG1, FAIL_LABEL; \ +andn REG1, REG2, REG1; \ + and VADDR, REG2, REG2; \ + brlz,pt REG1, PTE_LABEL;\ +or REG1, REG2, REG1; \ +700: +#else +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ + brz,pn REG1, FAIL_LABEL; \ +nop; +#endif + /* PMD has been loaded into REG1, interpret the value, seeing * if it is a HUGE PMD or a normal one. If it is not valid * then jump to FAIL_LABEL. If it is a HUGE PMD, and it @@ -209,14 +239,14 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; sethi %uhi(_PAGE_PMD_HUGE), REG2; \ sllxREG2, 32, REG2; \ andcc REG1, REG2, %g0;\ - be,pt %xcc, 700f; \ + be,pt %xcc, 701f; \ sethi %hi(4 * 1024 * 1024), REG2; \ brgez,pnREG1, FAIL_LABEL; \ andn REG1, REG2, REG1; \ and VADDR, REG2, REG2; \ brlz,pt REG1, PTE_LABEL;\ or REG1, REG2, REG1; \ -700: +701: #else #define USER_PGTABLE_CHECK_PMD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ brz,pn REG1, FAIL_LABEL; \ @@ -242,6 +272,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; srlxREG2, 64 - PAGE_SHIFT, REG2; \ andnREG2, 0x7, REG2; \ ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \ + USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \ brz,pn REG1, FAIL_LABEL; \ sllx VADDR, 64 - (PMD_SHIFT + P
[PATCH] sparc64: Fix mapping of 64k pages with MAP_FIXED
An incorrect huge page alignment check caused mmap failure for 64K pages when MAP_FIXED is used with address not aligned to HPAGE_SIZE. Orabug: 25885991 Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/hugetlb.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h index dcbf985..d1f837d 100644 --- a/arch/sparc/include/asm/hugetlb.h +++ b/arch/sparc/include/asm/hugetlb.h @@ -24,9 +24,11 @@ static inline int is_hugepage_only_range(struct mm_struct *mm, static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len) { - if (len & ~HPAGE_MASK) + struct hstate *h = hstate_file(file); + + if (len & ~huge_page_mask(h)) return -EINVAL; - if (addr & ~HPAGE_MASK) + if (addr & ~huge_page_mask(h)) return -EINVAL; return 0; } -- 2.9.2
[PATCH] sparc64: Fix hugepage page table free
Make sure the start adderess is aligned to PMD_SIZE boundary when freeing page table backing a hugepage region. The issue was causing segfaults when a region backed by 64K pages was unmapped since such a region is in general not PMD_SIZE aligned. Signed-off-by: Nitin Gupta --- arch/sparc/mm/hugetlbpage.c | 16 1 file changed, 16 insertions(+) diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index ee5273a..7c29d38 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -461,6 +461,22 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, pgd_t *pgd; unsigned long next; + addr &= PMD_MASK; + if (addr < floor) { + addr += PMD_SIZE; + if (!addr) + return; + } + if (ceiling) { + ceiling &= PMD_MASK; + if (!ceiling) + return; + } + if (end - 1 > ceiling - 1) + end -= PMD_SIZE; + if (addr > end - 1) + return; + pgd = pgd_offset(tlb->mm, addr); do { next = pgd_addr_end(addr, end); -- 2.9.2
[PATCH] sparc64: Fix memory corruption when THP is enabled
The memory corruption was happening due to incorrect TLB/TSB flushing of hugepages. Reported-by: David S. Miller Signed-off-by: Nitin Gupta --- arch/sparc/mm/tlb.c | 6 +++--- arch/sparc/mm/tsb.c | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c index afda3bb..ee8066c 100644 --- a/arch/sparc/mm/tlb.c +++ b/arch/sparc/mm/tlb.c @@ -154,7 +154,7 @@ static void tlb_batch_pmd_scan(struct mm_struct *mm, unsigned long vaddr, if (pte_val(*pte) & _PAGE_VALID) { bool exec = pte_exec(*pte); - tlb_batch_add_one(mm, vaddr, exec, false); + tlb_batch_add_one(mm, vaddr, exec, PAGE_SHIFT); } pte++; vaddr += PAGE_SIZE; @@ -209,9 +209,9 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr, pte_t orig_pte = __pte(pmd_val(orig)); bool exec = pte_exec(orig_pte); - tlb_batch_add_one(mm, addr, exec, true); + tlb_batch_add_one(mm, addr, exec, REAL_HPAGE_SHIFT); tlb_batch_add_one(mm, addr + REAL_HPAGE_SIZE, exec, - true); + REAL_HPAGE_SHIFT); } else { tlb_batch_pmd_scan(mm, addr, orig); } diff --git a/arch/sparc/mm/tsb.c b/arch/sparc/mm/tsb.c index 0a04811..bedf08b 100644 --- a/arch/sparc/mm/tsb.c +++ b/arch/sparc/mm/tsb.c @@ -122,7 +122,7 @@ void flush_tsb_user(struct tlb_batch *tb) spin_lock_irqsave(&mm->context.lock, flags); - if (tb->hugepage_shift < HPAGE_SHIFT) { + if (tb->hugepage_shift < REAL_HPAGE_SHIFT) { base = (unsigned long) mm->context.tsb_block[MM_TSB_BASE].tsb; nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries; if (tlb_type == cheetah_plus || tlb_type == hypervisor) @@ -155,7 +155,7 @@ void flush_tsb_user_page(struct mm_struct *mm, unsigned long vaddr, spin_lock_irqsave(&mm->context.lock, flags); - if (hugepage_shift < HPAGE_SHIFT) { + if (hugepage_shift < REAL_HPAGE_SHIFT) { base = (unsigned long) mm->context.tsb_block[MM_TSB_BASE].tsb; nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries; if (tlb_type == cheetah_plus || tlb_type == hypervisor) -- 2.9.2
[PATCH] sparc64: Add support for 2G hugepages
Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/page_64.h | 3 ++- arch/sparc/mm/hugetlbpage.c | 7 +++ arch/sparc/mm/init_64.c | 4 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index f294dd4..5961b2d 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,6 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 +#define HPAGE_2GB_SHIFT31 #define HPAGE_256MB_SHIFT 28 #define HPAGE_64K_SHIFT16 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT) @@ -27,7 +28,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) -#define HUGE_MAX_HSTATE3 +#define HUGE_MAX_HSTATE4 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index 3016850..ee5273a 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -143,6 +143,10 @@ static pte_t sun4v_hugepage_shift_to_tte(pte_t entry, unsigned int shift) pte_val(entry) = pte_val(entry) & ~_PAGE_SZALL_4V; switch (shift) { + case HPAGE_2GB_SHIFT: + hugepage_size = _PAGE_SZ2GB_4V; + pte_val(entry) |= _PAGE_PMD_HUGE; + break; case HPAGE_256MB_SHIFT: hugepage_size = _PAGE_SZ256MB_4V; pte_val(entry) |= _PAGE_PMD_HUGE; @@ -183,6 +187,9 @@ static unsigned int sun4v_huge_tte_to_shift(pte_t entry) unsigned int shift; switch (tte_szbits) { + case _PAGE_SZ2GB_4V: + shift = HPAGE_2GB_SHIFT; + break; case _PAGE_SZ256MB_4V: shift = HPAGE_256MB_SHIFT; break; diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index ccd4553..3328043 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -337,6 +337,10 @@ static int __init setup_hugepagesz(char *string) hugepage_shift = ilog2(hugepage_size); switch (hugepage_shift) { + case HPAGE_2GB_SHIFT: + hv_pgsz_mask = HV_PGSZ_MASK_2GB; + hv_pgsz_idx = HV_PGSZ_IDX_2GB; + break; case HPAGE_256MB_SHIFT: hv_pgsz_mask = HV_PGSZ_MASK_256MB; hv_pgsz_idx = HV_PGSZ_IDX_256MB; -- 2.9.2
[PATCH] sparc64: Fix size check in huge_pte_alloc
Signed-off-by: Nitin Gupta --- arch/sparc/mm/hugetlbpage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index 323bc6b..3016850 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -261,7 +261,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, if (!pmd) return NULL; - if (sz == PMD_SHIFT) + if (sz >= PMD_SIZE) pte = (pte_t *)pmd; else pte = pte_alloc_map(mm, pmd, addr); -- 2.9.2
[PATCH] sparc64: Fix build error in flush_tsb_user_page
Patch "sparc64: Add 64K page size support" unconditionally used __flush_huge_tsb_one_entry() which is available only when hugetlb support is enabled. Another issue was incorrect TSB flushing for 64K pages in flush_tsb_user(). Signed-off-by: Nitin Gupta --- arch/sparc/mm/hugetlbpage.c | 5 +++-- arch/sparc/mm/tsb.c | 20 2 files changed, 19 insertions(+), 6 deletions(-) diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index 605bfce..e98a3f2 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -309,7 +309,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, addr &= ~(size - 1); orig = *ptep; - orig_shift = pte_none(orig) ? PAGE_SIZE : huge_tte_to_shift(orig); + orig_shift = pte_none(orig) ? PAGE_SHIFT : huge_tte_to_shift(orig); for (i = 0; i < nptes; i++) ptep[i] = __pte(pte_val(entry) + (i << shift)); @@ -335,7 +335,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, else nptes = size >> PAGE_SHIFT; - hugepage_shift = pte_none(entry) ? PAGE_SIZE : huge_tte_to_shift(entry); + hugepage_shift = pte_none(entry) ? PAGE_SHIFT : + huge_tte_to_shift(entry); if (pte_present(entry)) mm->context.hugetlb_pte_count -= nptes; diff --git a/arch/sparc/mm/tsb.c b/arch/sparc/mm/tsb.c index e39fc57..23479c3 100644 --- a/arch/sparc/mm/tsb.c +++ b/arch/sparc/mm/tsb.c @@ -120,12 +120,18 @@ void flush_tsb_user(struct tlb_batch *tb) spin_lock_irqsave(&mm->context.lock, flags); - if (tb->hugepage_shift == PAGE_SHIFT) { + if (tb->hugepage_shift < HPAGE_SHIFT) { base = (unsigned long) mm->context.tsb_block[MM_TSB_BASE].tsb; nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries; if (tlb_type == cheetah_plus || tlb_type == hypervisor) base = __pa(base); - __flush_tsb_one(tb, PAGE_SHIFT, base, nentries); + if (tb->hugepage_shift == PAGE_SHIFT) + __flush_tsb_one(tb, PAGE_SHIFT, base, nentries); +#if defined(CONFIG_HUGETLB_PAGE) + else + __flush_huge_tsb_one(tb, PAGE_SHIFT, base, nentries, +tb->hugepage_shift); +#endif } #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) else if (mm->context.tsb_block[MM_TSB_HUGE].tsb) { @@ -152,8 +158,14 @@ void flush_tsb_user_page(struct mm_struct *mm, unsigned long vaddr, nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries; if (tlb_type == cheetah_plus || tlb_type == hypervisor) base = __pa(base); - __flush_huge_tsb_one_entry(base, vaddr, PAGE_SHIFT, nentries, - hugepage_shift); + if (hugepage_shift == PAGE_SHIFT) + __flush_tsb_one_entry(base, vaddr, PAGE_SHIFT, + nentries); +#if defined(CONFIG_HUGETLB_PAGE) + else + __flush_huge_tsb_one_entry(base, vaddr, PAGE_SHIFT, + nentries, hugepage_shift); +#endif } #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) else if (mm->context.tsb_block[MM_TSB_HUGE].tsb) { -- 2.9.2
[PATCH] sparc64: Add 64K page size support
This patch depends on: [v6] sparc64: Multi-page size support - Testing Tested on Sonoma by running stream benchmark instance which allocated 48G worth of 64K pages. boot params: default_hugepagesz=64K hugepagesz=64K hugepages=1310720 Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/page_64.h | 3 ++- arch/sparc/mm/hugetlbpage.c | 54 arch/sparc/mm/init_64.c | 4 +++ arch/sparc/mm/tsb.c | 5 ++-- 4 files changed, 52 insertions(+), 14 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index d76f38d..f294dd4 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -18,6 +18,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 #define HPAGE_256MB_SHIFT 28 +#define HPAGE_64K_SHIFT16 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT) #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) @@ -26,7 +27,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) -#define HUGE_MAX_HSTATE2 +#define HUGE_MAX_HSTATE3 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index 618a568..605bfce 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -149,6 +149,9 @@ static pte_t sun4v_hugepage_shift_to_tte(pte_t entry, unsigned int shift) case HPAGE_SHIFT: pte_val(entry) |= _PAGE_PMD_HUGE; break; + case HPAGE_64K_SHIFT: + hugepage_size = _PAGE_SZ64K_4V; + break; default: WARN_ONCE(1, "unsupported hugepage shift=%u\n", shift); } @@ -185,6 +188,9 @@ static unsigned int sun4v_huge_tte_to_shift(pte_t entry) case _PAGE_SZ4MB_4V: shift = REAL_HPAGE_SHIFT; break; + case _PAGE_SZ64K_4V: + shift = HPAGE_64K_SHIFT; + break; default: shift = PAGE_SHIFT; break; @@ -204,6 +210,9 @@ static unsigned int sun4u_huge_tte_to_shift(pte_t entry) case _PAGE_SZ4MB_4U: shift = REAL_HPAGE_SHIFT; break; + case _PAGE_SZ64K_4U: + shift = HPAGE_64K_SHIFT; + break; default: shift = PAGE_SHIFT; break; @@ -241,12 +250,21 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, { pgd_t *pgd; pud_t *pud; + pmd_t *pmd; pte_t *pte = NULL; pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); - if (pud) - pte = (pte_t *)pmd_alloc(mm, pud, addr); + if (pud) { + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return NULL; + + if (sz == PMD_SHIFT) + pte = (pte_t *)pmd; + else + pte = pte_alloc_map(mm, pmd, addr); + } return pte; } @@ -255,42 +273,52 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) { pgd_t *pgd; pud_t *pud; + pmd_t *pmd; pte_t *pte = NULL; pgd = pgd_offset(mm, addr); if (!pgd_none(*pgd)) { pud = pud_offset(pgd, addr); - if (!pud_none(*pud)) - pte = (pte_t *)pmd_offset(pud, addr); + if (!pud_none(*pud)) { + pmd = pmd_offset(pud, addr); + if (!pmd_none(*pmd)) { + if (is_hugetlb_pmd(*pmd)) + pte = (pte_t *)pmd; + else + pte = pte_offset_map(pmd, addr); + } + } } + return pte; } void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t entry) { - unsigned int i, nptes, hugepage_shift; + unsigned int i, nptes, orig_shift, shift; unsigned long size; pte_t orig; size = huge_tte_to_size(entry); - nptes = size >> PMD_SHIFT; + shift = size >= HPAGE_SIZE ? PMD_SHIFT : PAGE_SHIFT; + nptes = size >> shift; if (!pte_present(*ptep) && pte_present(entry)) mm->context.hugetlb_pte_count += nptes; addr &= ~(size - 1); orig = *ptep; - hugepage_shift = pte_none(orig) ? PAGE_SIZE : huge_tte_to_shift(orig); + orig_shift = pte_none(orig) ? PAGE_SIZE : huge_tte_to_shift(orig); for (i = 0; i < nptes; i++) - ptep[i] = __pte(pte_val(entry) + (i << PMD_SHIFT));
[PATCH v6] sparc64: Multi-page size support
Add support for using multiple hugepage sizes simultaneously on mainline. Currently, support for 256M has been added which can be used along with 8M pages. Page tables are set like this (e.g. for 256M page): VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31] and TSB is set similarly: VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63] - Testing Tested on Sonoma (which supports 256M pages) by running stream benchmark instances in parallel: one instance uses 8M pages and another uses 256M pages, consuming 48G each. Boot params used: default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M hugepages=1 Signed-off-by: Nitin Gupta --- Changelog v6 vs v5: - Fix _flush_huge_tsb_one_entry: add correct offset to base vaddr Changelog v4 vs v5: - Enable hugepage initialization on sun4u Changelog v3 vs v4: - Remove incorrect WARN_ON in __flush_huge_tsb_one_entry() Changelog v2 vs v3: - Remove unused label in tsb.S (David) - Order local variables from longest to shortest line (David) Changelog v1 vs v2: - Fix warning due to unused __flush_huge_tsb_one() when CONFIG_HUGETLB is not defined. --- arch/sparc/include/asm/page_64.h | 3 +- arch/sparc/include/asm/pgtable_64.h | 23 +++-- arch/sparc/include/asm/tlbflush_64.h | 5 +- arch/sparc/kernel/tsb.S | 21 + arch/sparc/mm/hugetlbpage.c | 160 +++ arch/sparc/mm/init_64.c | 42 - arch/sparc/mm/tlb.c | 17 ++-- arch/sparc/mm/tsb.c | 44 -- 8 files changed, 253 insertions(+), 62 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index c1263fc..d76f38d 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,7 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 - +#define HPAGE_256MB_SHIFT 28 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT) #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) @@ -26,6 +26,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) +#define HUGE_MAX_HSTATE2 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 314b668..7932a4a 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot) #define pgprot_noncached pgprot_noncached #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) -static inline unsigned long __pte_huge_mask(void) +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, + struct page *page, int writable); +#define arch_make_huge_pte arch_make_huge_pte +static inline unsigned long __pte_default_huge_mask(void) { unsigned long mask; @@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void) static inline pte_t pte_mkhuge(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask()); + return __pte(pte_val(pte) | __pte_default_huge_mask()); } -static inline bool is_hugetlb_pte(pte_t pte) +static inline bool is_default_hugetlb_pte(pte_t pte) { - return !!(pte_val(pte) & __pte_huge_mask()); + unsigned long mask = __pte_default_huge_mask(); + + return (pte_val(pte) & mask) == mask; } static inline bool is_hugetlb_pmd(pmd_t pmd) @@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud) /* Actual page table PTE updates. */ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm); + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift); static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm) + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift) { /* It is more efficient to let flush_tlb_kernel_range() * handle init_mm tlb flushes. @@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, * and SUN4V pte layout, so this inline test is fine. */ if (likely(mm != &init_mm) && pte_accessible(mm, orig)) - tlb_batch_add(mm, vaddr, ptep, orig, fullmm); + tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift); } #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR @@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, pte
[PATCH v5] sparc64: Multi-page size support
Add support for using multiple hugepage sizes simultaneously on mainline. Currently, support for 256M has been added which can be used along with 8M pages. Page tables are set like this (e.g. for 256M page): VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31] and TSB is set similarly: VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63] - Testing Tested on Sonoma (which supports 256M pages) by running stream benchmark instances in parallel: one instance uses 8M pages and another uses 256M pages, consuming 48G each. Boot params used: default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M hugepages=1 Signed-off-by: Nitin Gupta --- Changelog v4 vs v5: - Enable hugepage initialization on sun4u (this patch has been tested only on sun4v). Changelog v3 vs v4: - Remove incorrect WARN_ON in __flush_huge_tsb_one_entry() Changelog v2 vs v3: - Remove unused label in tsb.S (David) - Order local variables from longest to shortest line (David) Changelog v1 vs v2: - Fix warning due to unused __flush_huge_tsb_one() when CONFIG_HUGETLB is not defined. --- arch/sparc/include/asm/page_64.h | 3 +- arch/sparc/include/asm/pgtable_64.h | 23 +++-- arch/sparc/include/asm/tlbflush_64.h | 5 +- arch/sparc/kernel/tsb.S | 21 + arch/sparc/mm/hugetlbpage.c | 160 +++ arch/sparc/mm/init_64.c | 42 - arch/sparc/mm/tlb.c | 17 ++-- arch/sparc/mm/tsb.c | 44 -- 8 files changed, 253 insertions(+), 62 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index c1263fc..d76f38d 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,7 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 - +#define HPAGE_256MB_SHIFT 28 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT) #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) @@ -26,6 +26,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) +#define HUGE_MAX_HSTATE2 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 314b668..7932a4a 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot) #define pgprot_noncached pgprot_noncached #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) -static inline unsigned long __pte_huge_mask(void) +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, + struct page *page, int writable); +#define arch_make_huge_pte arch_make_huge_pte +static inline unsigned long __pte_default_huge_mask(void) { unsigned long mask; @@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void) static inline pte_t pte_mkhuge(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask()); + return __pte(pte_val(pte) | __pte_default_huge_mask()); } -static inline bool is_hugetlb_pte(pte_t pte) +static inline bool is_default_hugetlb_pte(pte_t pte) { - return !!(pte_val(pte) & __pte_huge_mask()); + unsigned long mask = __pte_default_huge_mask(); + + return (pte_val(pte) & mask) == mask; } static inline bool is_hugetlb_pmd(pmd_t pmd) @@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud) /* Actual page table PTE updates. */ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm); + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift); static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm) + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift) { /* It is more efficient to let flush_tlb_kernel_range() * handle init_mm tlb flushes. @@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, * and SUN4V pte layout, so this inline test is fine. */ if (likely(mm != &init_mm) && pte_accessible(mm, orig)) - tlb_batch_add(mm, vaddr, ptep, orig, fullmm); + tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift); } #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR @@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t orig = *ptep; *ptep = pte; - ma
Re: [PATCH v4] sparc64: Multi-page size support
On 12/27/2016 09:34 AM, David Miller wrote: > From: Nitin Gupta > Date: Tue, 13 Dec 2016 10:03:18 -0800 > >> +static unsigned int sun4u_huge_tte_to_shift(pte_t entry) >> +{ >> +unsigned long tte_szbits = pte_val(entry) & _PAGE_SZALL_4V; >> +unsigned int shift; >> + >> +switch (tte_szbits) { >> +case _PAGE_SZ256MB_4U: >> +shift = HPAGE_256MB_SHIFT; >> +break; > > You added all the code necessary to do this on the sun4u chips that support > 256MB TTEs, so you might as well enable it in the initialization code. > > I'm pretty sure this is an UltraSPARC-IV and later feature. > I added sun4u related changes just for completeness sake. I don't have access to a sun4u machine so can't be sure if sun4u would work. That's why that _PAGE_SZALL_4V typo escaped my notice. I will enable setup_hugepagesz() for non-hypervisor case and send a v5. Thanks, Nitin
[PATCH v4] sparc64: Multi-page size support
Add support for using multiple hugepage sizes simultaneously on mainline. Currently, support for 256M has been added which can be used along with 8M pages. Page tables are set like this (e.g. for 256M page): VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31] and TSB is set similarly: VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63] - Testing Tested on Sonoma (which supports 256M pages) by running stream benchmark instances in parallel: one instance uses 8M pages and another uses 256M pages, consuming 48G each. Boot params used: default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M hugepages=1 Signed-off-by: Nitin Gupta --- Changelog v3 vs v4: - Remove incorrect WARN_ON in __flush_huge_tsb_one_entry() Changelog v2 vs v3: - Remove unused label in tsb.S (David) - Order local variables from longest to shortest line (David) Changelog v1 vs v2: - Fix warning due to unused __flush_huge_tsb_one() when CONFIG_HUGETLB is not defined. --- arch/sparc/include/asm/page_64.h | 3 +- arch/sparc/include/asm/pgtable_64.h | 23 +++-- arch/sparc/include/asm/tlbflush_64.h | 5 +- arch/sparc/kernel/tsb.S | 21 + arch/sparc/mm/hugetlbpage.c | 160 +++ arch/sparc/mm/init_64.c | 45 +- arch/sparc/mm/tlb.c | 17 ++-- arch/sparc/mm/tsb.c | 44 -- 8 files changed, 256 insertions(+), 62 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index c1263fc..d76f38d 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,7 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 - +#define HPAGE_256MB_SHIFT 28 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT) #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) @@ -26,6 +26,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) +#define HUGE_MAX_HSTATE2 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 314b668..7932a4a 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot) #define pgprot_noncached pgprot_noncached #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) -static inline unsigned long __pte_huge_mask(void) +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, + struct page *page, int writable); +#define arch_make_huge_pte arch_make_huge_pte +static inline unsigned long __pte_default_huge_mask(void) { unsigned long mask; @@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void) static inline pte_t pte_mkhuge(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask()); + return __pte(pte_val(pte) | __pte_default_huge_mask()); } -static inline bool is_hugetlb_pte(pte_t pte) +static inline bool is_default_hugetlb_pte(pte_t pte) { - return !!(pte_val(pte) & __pte_huge_mask()); + unsigned long mask = __pte_default_huge_mask(); + + return (pte_val(pte) & mask) == mask; } static inline bool is_hugetlb_pmd(pmd_t pmd) @@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud) /* Actual page table PTE updates. */ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm); + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift); static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm) + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift) { /* It is more efficient to let flush_tlb_kernel_range() * handle init_mm tlb flushes. @@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, * and SUN4V pte layout, so this inline test is fine. */ if (likely(mm != &init_mm) && pte_accessible(mm, orig)) - tlb_batch_add(mm, vaddr, ptep, orig, fullmm); + tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift); } #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR @@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t orig = *ptep; *ptep = pte; - maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm); + maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PA
Re: [PATCH v3] sparc64: Multi-page size support
On 12/11/2016 06:14 PM, David Miller wrote: > From: David Miller > Date: Sun, 11 Dec 2016 21:06:30 -0500 (EST) > >> Applied. > > Actually, I'm reverting. > > Just doing a simply "make -s -j128" kernel build on a T4-2 I'm > getting kernel log warnings: > > [2024810.925975] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [2024909.011397] random: crng init done > [2024970.860642] [ cut here ] > [2024970.869970] WARNING: CPU: 85 PID: 19335 at arch/sparc/mm/tsb.c:99 > __flush_huge_tsb_one_entry.constprop.0+0x30/0x74 > [2024970.890947] Modules linked in: ipv6 usb_storage loop ehci_pci sg sr_mod > igb ptp pps_core ehci_hcd n2_rng rng_core > [2024970.911785] CPU: 85 PID: 19335 Comm: ld Not tainted 4.9.0+ #9 > [2024970.923588] Call Trace: > [2024970.928807] [00463a2c] __warn+0xb0/0xc8 > [2024970.938349] [0044efa4] > __flush_huge_tsb_one_entry.constprop.0+0x30/0x74 > [2024970.953463] [0044f224] flush_tsb_user_page+0x88/0x9c > [2024970.965268] [0044eabc] tlb_batch_add_one+0x5c/0xd4 > [2024970.976722] [0044ed84] set_pmd_at+0x104/0x184 > [2024970.987329] [00552594] zap_huge_pmd+0x30/0x244 > [2024970.998097] [0052a7a8] unmap_page_range+0x18c/0x794 > [2024971.009721] [0052b05c] unmap_vmas+0x18/0x44 > [2024971.019976] [005315f8] exit_mmap+0x94/0x114 > [2024971.030207] [00461930] mmput+0x50/0xf8 > [2024971.039593] [004674ac] do_exit+0x310/0x904 > [2024971.049651] [00467c10] do_group_exit+0x80/0xbc > [2024971.060415] [00467c60] SyS_exit_group+0x14/0x20 > [2024971.071363] [00406194] linux_sparc_syscall32+0x34/0x60 > > which is: > > #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) > static void __flush_huge_tsb_one_entry(unsigned long tsb, unsigned long v, >unsigned long hash_shift, >unsigned long nentries, >unsigned int hugepage_shift) > { > unsigned int hpage_entries; > unsigned int i; > > hpage_entries = 1 << (hugepage_shift - REAL_HPAGE_SHIFT); > WARN_ON(v & ((1ul << hugepage_shift) - 1)); > ^^^ The warning is getting triggered since we do 'vaddr |= 1' in case the addr is executable in tlb_batch_add_one(). I originally tested the patch with stream which allocates pages only for heap. Somehow, I cannot not reproduce it (on a Sonoma) even when I back text segments with huagepages using: hugectl --force-preload --heap=8m --text=8m make -s -j128 I added this WARN_ON during debugging and can simply be removed. Do you want to see a v4 with this warning removed or can you reapply with just this change? Thanks, Nitin
[PATCH v3] sparc64: Multi-page size support
Add support for using multiple hugepage sizes simultaneously on mainline. Currently, support for 256M has been added which can be used along with 8M pages. Page tables are set like this (e.g. for 256M page): VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31] and TSB is set similarly: VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63] - Testing Tested on Sonoma (which supports 256M pages) by running stream benchmark instances in parallel: one instance uses 8M pages and another uses 256M pages, consuming 48G each. Boot params used: default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M hugepages=1 Signed-off-by: Nitin Gupta --- Changelog v2 vs v3: - Remove unused label in tsb.S (David) - Order local variables from longest to shortest line (David) Changelog v1 vs v2: - Fix warning due to unused __flush_huge_tsb_one() when CONFIG_HUGETLB is not defined. --- arch/sparc/include/asm/page_64.h | 3 +- arch/sparc/include/asm/pgtable_64.h | 23 +++-- arch/sparc/include/asm/tlbflush_64.h | 5 +- arch/sparc/kernel/tsb.S | 21 + arch/sparc/mm/hugetlbpage.c | 160 +++ arch/sparc/mm/init_64.c | 45 +- arch/sparc/mm/tlb.c | 17 ++-- arch/sparc/mm/tsb.c | 45 -- 8 files changed, 257 insertions(+), 62 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index c1263fc..d76f38d 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,7 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 - +#define HPAGE_256MB_SHIFT 28 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT) #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) @@ -26,6 +26,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) +#define HUGE_MAX_HSTATE2 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 1fb317f..96005b0 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot) #define pgprot_noncached pgprot_noncached #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) -static inline unsigned long __pte_huge_mask(void) +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, + struct page *page, int writable); +#define arch_make_huge_pte arch_make_huge_pte +static inline unsigned long __pte_default_huge_mask(void) { unsigned long mask; @@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void) static inline pte_t pte_mkhuge(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask()); + return __pte(pte_val(pte) | __pte_default_huge_mask()); } -static inline bool is_hugetlb_pte(pte_t pte) +static inline bool is_default_hugetlb_pte(pte_t pte) { - return !!(pte_val(pte) & __pte_huge_mask()); + unsigned long mask = __pte_default_huge_mask(); + + return (pte_val(pte) & mask) == mask; } static inline bool is_hugetlb_pmd(pmd_t pmd) @@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud) /* Actual page table PTE updates. */ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm); + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift); static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm) + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift) { /* It is more efficient to let flush_tlb_kernel_range() * handle init_mm tlb flushes. @@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, * and SUN4V pte layout, so this inline test is fine. */ if (likely(mm != &init_mm) && pte_accessible(mm, orig)) - tlb_batch_add(mm, vaddr, ptep, orig, fullmm); + tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift); } #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR @@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t orig = *ptep; *ptep = pte; - maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm); + maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT); } #define set_pte_at(mm,addr,ptep,pte) \ diff --git a/arch/s
[PATCH] sparc64: Make SLUB the default allocator
SLUB has better debugging support. Signed-off-by: Nitin Gupta --- arch/sparc/configs/sparc64_defconfig | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/sparc/configs/sparc64_defconfig b/arch/sparc/configs/sparc64_defconfig index 3583d67..0a615b0 100644 --- a/arch/sparc/configs/sparc64_defconfig +++ b/arch/sparc/configs/sparc64_defconfig @@ -7,7 +7,9 @@ CONFIG_LOG_BUF_SHIFT=18 CONFIG_BLK_DEV_INITRD=y CONFIG_PERF_EVENTS=y # CONFIG_COMPAT_BRK is not set -CONFIG_SLAB=y +CONFIG_SLUB_DEBUG=y +CONFIG_SLUB=y +CONFIG_SLUB_CPU_PARTIAL=y CONFIG_PROFILING=y CONFIG_OPROFILE=m CONFIG_KPROBES=y -- 2.9.2
[PATCH v2] sparc64: Multi-page size support
Add support for using multiple hugepage sizes simultaneously on mainline. Currently, support for 256M has been added which can be used along with 8M pages. Page tables are set like this (e.g. for 256M page): VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31] and TSB is set similarly: VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63] - Testing Tested on Sonoma (which supports 256M pages) by running stream benchmark instances in parallel: one instance uses 8M pages and another uses 256M pages, consuming 48G each. Boot params used: default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M hugepages=1 Signed-off-by: Nitin Gupta --- Changelog v1 vs v2: - Fix warning due to unused __flush_huge_tsb_one() when CONFIG_HUGETLB is not defined. arch/sparc/include/asm/page_64.h | 3 +- arch/sparc/include/asm/pgtable_64.h | 23 +++-- arch/sparc/include/asm/tlbflush_64.h | 5 +- arch/sparc/kernel/tsb.S | 21 + arch/sparc/mm/hugetlbpage.c | 158 +++ arch/sparc/mm/init_64.c | 45 +- arch/sparc/mm/tlb.c | 17 ++-- arch/sparc/mm/tsb.c | 44 -- 8 files changed, 254 insertions(+), 62 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index c1263fc..d76f38d 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,7 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 - +#define HPAGE_256MB_SHIFT 28 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT) #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) @@ -26,6 +26,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) +#define HUGE_MAX_HSTATE2 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 1fb317f..96005b0 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot) #define pgprot_noncached pgprot_noncached #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) -static inline unsigned long __pte_huge_mask(void) +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, + struct page *page, int writable); +#define arch_make_huge_pte arch_make_huge_pte +static inline unsigned long __pte_default_huge_mask(void) { unsigned long mask; @@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void) static inline pte_t pte_mkhuge(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask()); + return __pte(pte_val(pte) | __pte_default_huge_mask()); } -static inline bool is_hugetlb_pte(pte_t pte) +static inline bool is_default_hugetlb_pte(pte_t pte) { - return !!(pte_val(pte) & __pte_huge_mask()); + unsigned long mask = __pte_default_huge_mask(); + + return (pte_val(pte) & mask) == mask; } static inline bool is_hugetlb_pmd(pmd_t pmd) @@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud) /* Actual page table PTE updates. */ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm); + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift); static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm) + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift) { /* It is more efficient to let flush_tlb_kernel_range() * handle init_mm tlb flushes. @@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, * and SUN4V pte layout, so this inline test is fine. */ if (likely(mm != &init_mm) && pte_accessible(mm, orig)) - tlb_batch_add(mm, vaddr, ptep, orig, fullmm); + tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift); } #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR @@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t orig = *ptep; *ptep = pte; - maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm); + maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT); } #define set_pte_at(mm,addr,ptep,pte) \ diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h index a8e192e..54be88a 100644 --- a/arch/sparc/include/asm/
[PATCH] sparc64: Multi-page size support
Add support for using multiple hugepage sizes simultaneously on mainline. Currently, support for 256M has been added which can be used along with 8M pages. Page tables are set like this (e.g. for 256M page): VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31] and TSB is set similarly: VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63] - Testing Tested on Sonoma (which supports 256M pages) by running stream benchmark instances in parallel: one instance uses 8M pages and another uses 256M pages, consuming 48G each. Boot params used: default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M hugepages=1 Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/page_64.h | 3 +- arch/sparc/include/asm/pgtable_64.h | 23 +++-- arch/sparc/include/asm/tlbflush_64.h | 5 +- arch/sparc/kernel/tsb.S | 21 + arch/sparc/mm/hugetlbpage.c | 158 +++ arch/sparc/mm/init_64.c | 45 +- arch/sparc/mm/tlb.c | 17 ++-- arch/sparc/mm/tsb.c | 42 -- 8 files changed, 252 insertions(+), 62 deletions(-) diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h index c1263fc..d76f38d 100644 --- a/arch/sparc/include/asm/page_64.h +++ b/arch/sparc/include/asm/page_64.h @@ -17,7 +17,7 @@ #define HPAGE_SHIFT23 #define REAL_HPAGE_SHIFT 22 - +#define HPAGE_256MB_SHIFT 28 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT) #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) @@ -26,6 +26,7 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define REAL_HPAGE_PER_HPAGE (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT)) +#define HUGE_MAX_HSTATE2 #endif #ifndef __ASSEMBLY__ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 1fb317f..96005b0 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot) #define pgprot_noncached pgprot_noncached #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) -static inline unsigned long __pte_huge_mask(void) +extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, + struct page *page, int writable); +#define arch_make_huge_pte arch_make_huge_pte +static inline unsigned long __pte_default_huge_mask(void) { unsigned long mask; @@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void) static inline pte_t pte_mkhuge(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask()); + return __pte(pte_val(pte) | __pte_default_huge_mask()); } -static inline bool is_hugetlb_pte(pte_t pte) +static inline bool is_default_hugetlb_pte(pte_t pte) { - return !!(pte_val(pte) & __pte_huge_mask()); + unsigned long mask = __pte_default_huge_mask(); + + return (pte_val(pte) & mask) == mask; } static inline bool is_hugetlb_pmd(pmd_t pmd) @@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud) /* Actual page table PTE updates. */ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm); + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift); static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, - pte_t *ptep, pte_t orig, int fullmm) + pte_t *ptep, pte_t orig, int fullmm, + unsigned int hugepage_shift) { /* It is more efficient to let flush_tlb_kernel_range() * handle init_mm tlb flushes. @@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, * and SUN4V pte layout, so this inline test is fine. */ if (likely(mm != &init_mm) && pte_accessible(mm, orig)) - tlb_batch_add(mm, vaddr, ptep, orig, fullmm); + tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift); } #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR @@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t orig = *ptep; *ptep = pte; - maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm); + maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT); } #define set_pte_at(mm,addr,ptep,pte) \ diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h index a8e192e..54be88a 100644 --- a/arch/sparc/include/asm/tlbflush_64.h +++ b/arch/sparc/include/asm/tlbflush_64.h @@ -8,7 +8,7 @@ #define TLB_BATCH_NR 192
[PATCH v2] sparc64: Trim page tables for 8M hugepages
For PMD aligned (8M) hugepages, we currently allocate all four page table levels which is wasteful. We now allocate till PMD level only which saves memory usage from page tables. Also, when freeing page table for 8M hugepage backed region, make sure we don't try to access non-existent PTE level. Orabug: 22630259 Signed-off-by: Nitin Gupta --- Changelog v2 vs v1 - Combine fix for page table freeing with the main trimming patch (Dave) arch/sparc/include/asm/hugetlb.h| 12 +-- arch/sparc/include/asm/pgtable_64.h |7 ++- arch/sparc/include/asm/tsb.h|2 +- arch/sparc/mm/fault_64.c|4 +- arch/sparc/mm/hugetlbpage.c | 166 +++--- arch/sparc/mm/init_64.c |5 +- 6 files changed, 129 insertions(+), 67 deletions(-) diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h index 139e711..dcbf985 100644 --- a/arch/sparc/include/asm/hugetlb.h +++ b/arch/sparc/include/asm/hugetlb.h @@ -31,14 +31,6 @@ static inline int prepare_hugepage_range(struct file *file, return 0; } -static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb, - unsigned long addr, unsigned long end, - unsigned long floor, - unsigned long ceiling) -{ - free_pgd_range(tlb, addr, end, floor, ceiling); -} - static inline void huge_ptep_clear_flush(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { @@ -82,4 +74,8 @@ static inline void arch_clear_hugepage_flags(struct page *page) { } +void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr, + unsigned long end, unsigned long floor, + unsigned long ceiling); + #endif /* _ASM_SPARC64_HUGETLB_H */ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index e7d8280..1fb317f 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -395,7 +395,7 @@ static inline unsigned long __pte_huge_mask(void) static inline pte_t pte_mkhuge(pte_t pte) { - return __pte(pte_val(pte) | __pte_huge_mask()); + return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask()); } static inline bool is_hugetlb_pte(pte_t pte) @@ -403,6 +403,11 @@ static inline bool is_hugetlb_pte(pte_t pte) return !!(pte_val(pte) & __pte_huge_mask()); } +static inline bool is_hugetlb_pmd(pmd_t pmd) +{ + return !!(pmd_val(pmd) & _PAGE_PMD_HUGE); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index c6a155c..32258e0 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -203,7 +203,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; * We have to propagate the 4MB bit of the virtual address * because we are fabricating 8MB pages using 4MB hw pages. */ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) #define USER_PGTABLE_CHECK_PMD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ brz,pn REG1, FAIL_LABEL; \ sethi %uhi(_PAGE_PMD_HUGE), REG2; \ diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 6c43b92..575ecfe 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -111,8 +111,8 @@ static unsigned int get_user_insn(unsigned long tpc) if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp))) goto out_irq_enable; -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (pmd_trans_huge(*pmdp)) { +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) + if (is_hugetlb_pmd(*pmdp)) { pa = pmd_pfn(*pmdp) << PAGE_SHIFT; pa += tpc & ~HPAGE_MASK; diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index ba52e64..494c390 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -12,6 +12,7 @@ #include #include +#include #include #include #include @@ -131,23 +132,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, { pgd_t *pgd; pud_t *pud; - pmd_t *pmd; pte_t *pte = NULL; - /* We must align the address, because our caller will run -* set_huge_pte_at() on whatever we return, which writes out -* all of the sub-ptes for the hugepage range. So we have -* to give it the first such sub-pte. -*/ - addr &= HPAGE_MASK; - pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); - if (pud) { - pmd = pmd_alloc(mm, pud, addr); - if (pmd) - pte =
[PATCH v2 2/2] sparc64: Fix pagetable freeing for hugepage regions
8M pages now allocate page tables till PMD level only. So, when freeing page table for 8M hugepage backed region, make sure we don't try to access non-existent PTE level. Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/hugetlb.h | 12 ++--- arch/sparc/mm/hugetlbpage.c | 98 ++ 2 files changed, 102 insertions(+), 8 deletions(-) diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h index 139e711..dcbf985 100644 --- a/arch/sparc/include/asm/hugetlb.h +++ b/arch/sparc/include/asm/hugetlb.h @@ -31,14 +31,6 @@ static inline int prepare_hugepage_range(struct file *file, return 0; } -static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb, - unsigned long addr, unsigned long end, - unsigned long floor, - unsigned long ceiling) -{ - free_pgd_range(tlb, addr, end, floor, ceiling); -} - static inline void huge_ptep_clear_flush(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { @@ -82,4 +74,8 @@ static inline void arch_clear_hugepage_flags(struct page *page) { } +void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr, + unsigned long end, unsigned long floor, + unsigned long ceiling); + #endif /* _ASM_SPARC64_HUGETLB_H */ diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index cafb5ca..494c390 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -12,6 +12,7 @@ #include #include +#include #include #include #include @@ -202,3 +203,100 @@ int pud_huge(pud_t pud) { return 0; } + +static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, + unsigned long addr) +{ + pgtable_t token = pmd_pgtable(*pmd); + + pmd_clear(pmd); + pte_free_tlb(tlb, token, addr); + atomic_long_dec(&tlb->mm->nr_ptes); +} + +static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + pmd_t *pmd; + unsigned long next; + unsigned long start; + + start = addr; + pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (pmd_none(*pmd)) + continue; + if (is_hugetlb_pmd(*pmd)) + pmd_clear(pmd); + else + hugetlb_free_pte_range(tlb, pmd, addr); + } while (pmd++, addr = next, addr != end); + + start &= PUD_MASK; + if (start < floor) + return; + if (ceiling) { + ceiling &= PUD_MASK; + if (!ceiling) + return; + } + if (end - 1 > ceiling - 1) + return; + + pmd = pmd_offset(pud, start); + pud_clear(pud); + pmd_free_tlb(tlb, pmd, start); + mm_dec_nr_pmds(tlb->mm); +} + +static void hugetlb_free_pud_range(struct mmu_gather *tlb, pgd_t *pgd, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + pud_t *pud; + unsigned long next; + unsigned long start; + + start = addr; + pud = pud_offset(pgd, addr); + do { + next = pud_addr_end(addr, end); + if (pud_none_or_clear_bad(pud)) + continue; + hugetlb_free_pmd_range(tlb, pud, addr, next, floor, + ceiling); + } while (pud++, addr = next, addr != end); + + start &= PGDIR_MASK; + if (start < floor) + return; + if (ceiling) { + ceiling &= PGDIR_MASK; + if (!ceiling) + return; + } + if (end - 1 > ceiling - 1) + return; + + pud = pud_offset(pgd, start); + pgd_clear(pgd); + pud_free_tlb(tlb, pud, start); +} + +void hugetlb_free_pgd_range(struct mmu_gather *tlb, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + pgd_t *pgd; + unsigned long next; + + pgd = pgd_offset(tlb->mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) + continue; + hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling); + } while (pgd++, addr = next, addr != end); +} -- 1.7.1
[PATCH v2 1/2] sparc64: Trim page tables for 8M hugepages
For PMD aligned (8M) hugepages, we currently allocate all four page table levels which is wasteful. We now allocate till PMD level only which saves memory usage from page tables. Orabug: 22630259 Signed-off-by: Nitin Gupta --- Changelog v2 vs v1: - Move sparc specific declaration of hugetlb_free_pgd_range to arch specific hugetlb.h header. arch/sparc/include/asm/pgtable_64.h |7 +++- arch/sparc/include/asm/tsb.h|2 +- arch/sparc/mm/fault_64.c|4 +- arch/sparc/mm/hugetlbpage.c | 68 +++--- arch/sparc/mm/init_64.c |5 ++- 5 files changed, 27 insertions(+), 59 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index e7d8280..1fb317f 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -395,7 +395,7 @@ static inline unsigned long __pte_huge_mask(void) static inline pte_t pte_mkhuge(pte_t pte) { - return __pte(pte_val(pte) | __pte_huge_mask()); + return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask()); } static inline bool is_hugetlb_pte(pte_t pte) @@ -403,6 +403,11 @@ static inline bool is_hugetlb_pte(pte_t pte) return !!(pte_val(pte) & __pte_huge_mask()); } +static inline bool is_hugetlb_pmd(pmd_t pmd) +{ + return !!(pmd_val(pmd) & _PAGE_PMD_HUGE); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index c6a155c..32258e0 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -203,7 +203,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; * We have to propagate the 4MB bit of the virtual address * because we are fabricating 8MB pages using 4MB hw pages. */ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) #define USER_PGTABLE_CHECK_PMD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ brz,pn REG1, FAIL_LABEL; \ sethi %uhi(_PAGE_PMD_HUGE), REG2; \ diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index cb841a3..ff3f9f9 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -111,8 +111,8 @@ static unsigned int get_user_insn(unsigned long tpc) if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp))) goto out_irq_enable; -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (pmd_trans_huge(*pmdp)) { +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) + if (is_hugetlb_pmd(*pmdp)) { pa = pmd_pfn(*pmdp) << PAGE_SHIFT; pa += tpc & ~HPAGE_MASK; diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index ba52e64..cafb5ca 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -131,23 +131,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, { pgd_t *pgd; pud_t *pud; - pmd_t *pmd; pte_t *pte = NULL; - /* We must align the address, because our caller will run -* set_huge_pte_at() on whatever we return, which writes out -* all of the sub-ptes for the hugepage range. So we have -* to give it the first such sub-pte. -*/ - addr &= HPAGE_MASK; - pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); - if (pud) { - pmd = pmd_alloc(mm, pud, addr); - if (pmd) - pte = pte_alloc_map(mm, pmd, addr); - } + if (pud) + pte = (pte_t *)pmd_alloc(mm, pud, addr); + return pte; } @@ -155,19 +145,13 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) { pgd_t *pgd; pud_t *pud; - pmd_t *pmd; pte_t *pte = NULL; - addr &= HPAGE_MASK; - pgd = pgd_offset(mm, addr); if (!pgd_none(*pgd)) { pud = pud_offset(pgd, addr); - if (!pud_none(*pud)) { - pmd = pmd_offset(pud, addr); - if (!pmd_none(*pmd)) - pte = pte_offset_map(pmd, addr); - } + if (!pud_none(*pud)) + pte = (pte_t *)pmd_offset(pud, addr); } return pte; } @@ -175,67 +159,43 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t entry) { - int i; - pte_t orig[2]; - unsigned long nptes; + pte_t orig; if (!pte_present(*ptep) && pte_present(entry)) mm->context.huge_pte_count++; addr &= HPAGE_MASK; - - nptes = 1 << HUGETLB_PAGE_ORDER; - orig[0] = *ptep; - orig[
[PATCH 2/2] sparc64: Fix pagetable freeing for hugepage regions
8M pages now allocate page tables till PMD level only. So, when freeing page table for 8M hugepage backed region, make sure we don't try to access non-existent PTE level. Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/hugetlb.h | 8 arch/sparc/mm/hugetlbpage.c | 98 include/linux/hugetlb.h | 4 ++ 3 files changed, 102 insertions(+), 8 deletions(-) diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h index 139e711..1a6708c 100644 --- a/arch/sparc/include/asm/hugetlb.h +++ b/arch/sparc/include/asm/hugetlb.h @@ -31,14 +31,6 @@ static inline int prepare_hugepage_range(struct file *file, return 0; } -static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb, - unsigned long addr, unsigned long end, - unsigned long floor, - unsigned long ceiling) -{ - free_pgd_range(tlb, addr, end, floor, ceiling); -} - static inline void huge_ptep_clear_flush(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index cafb5ca..494c390 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -12,6 +12,7 @@ #include #include +#include #include #include #include @@ -202,3 +203,100 @@ int pud_huge(pud_t pud) { return 0; } + +static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, + unsigned long addr) +{ + pgtable_t token = pmd_pgtable(*pmd); + + pmd_clear(pmd); + pte_free_tlb(tlb, token, addr); + atomic_long_dec(&tlb->mm->nr_ptes); +} + +static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + pmd_t *pmd; + unsigned long next; + unsigned long start; + + start = addr; + pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (pmd_none(*pmd)) + continue; + if (is_hugetlb_pmd(*pmd)) + pmd_clear(pmd); + else + hugetlb_free_pte_range(tlb, pmd, addr); + } while (pmd++, addr = next, addr != end); + + start &= PUD_MASK; + if (start < floor) + return; + if (ceiling) { + ceiling &= PUD_MASK; + if (!ceiling) + return; + } + if (end - 1 > ceiling - 1) + return; + + pmd = pmd_offset(pud, start); + pud_clear(pud); + pmd_free_tlb(tlb, pmd, start); + mm_dec_nr_pmds(tlb->mm); +} + +static void hugetlb_free_pud_range(struct mmu_gather *tlb, pgd_t *pgd, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + pud_t *pud; + unsigned long next; + unsigned long start; + + start = addr; + pud = pud_offset(pgd, addr); + do { + next = pud_addr_end(addr, end); + if (pud_none_or_clear_bad(pud)) + continue; + hugetlb_free_pmd_range(tlb, pud, addr, next, floor, + ceiling); + } while (pud++, addr = next, addr != end); + + start &= PGDIR_MASK; + if (start < floor) + return; + if (ceiling) { + ceiling &= PGDIR_MASK; + if (!ceiling) + return; + } + if (end - 1 > ceiling - 1) + return; + + pud = pud_offset(pgd, start); + pgd_clear(pgd); + pud_free_tlb(tlb, pud, start); +} + +void hugetlb_free_pgd_range(struct mmu_gather *tlb, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + pgd_t *pgd; + unsigned long next; + + pgd = pgd_offset(tlb->mm, addr); + do { + next = pgd_addr_end(addr, end); + if (pgd_none_or_clear_bad(pgd)) + continue; + hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling); + } while (pgd++, addr = next, addr != end); +} diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index c26d463..4461309 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -120,6 +120,10 @@ int pud_huge(pud_t pmd); unsigned long hugetlb_change_protection(struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot); +void hugetlb_free_pgd
[PATCH 1/2] sparc64: Trim page tables for 8M hugepages
For PMD aligned (8M) hugepages, we currently allocate all four page table levels which is wasteful. We now allocate till PMD level only which saves memory usage from page tables. Orabug: 22630259 Signed-off-by: Nitin Gupta --- arch/sparc/include/asm/pgtable_64.h | 7 +++- arch/sparc/include/asm/tsb.h| 2 +- arch/sparc/mm/fault_64.c| 4 +-- arch/sparc/mm/hugetlbpage.c | 68 - arch/sparc/mm/init_64.c | 5 ++- 5 files changed, 27 insertions(+), 59 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index e7d8280..1fb317f 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -395,7 +395,7 @@ static inline unsigned long __pte_huge_mask(void) static inline pte_t pte_mkhuge(pte_t pte) { - return __pte(pte_val(pte) | __pte_huge_mask()); + return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask()); } static inline bool is_hugetlb_pte(pte_t pte) @@ -403,6 +403,11 @@ static inline bool is_hugetlb_pte(pte_t pte) return !!(pte_val(pte) & __pte_huge_mask()); } +static inline bool is_hugetlb_pmd(pmd_t pmd) +{ + return !!(pmd_val(pmd) & _PAGE_PMD_HUGE); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h index c6a155c..32258e0 100644 --- a/arch/sparc/include/asm/tsb.h +++ b/arch/sparc/include/asm/tsb.h @@ -203,7 +203,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end; * We have to propagate the 4MB bit of the virtual address * because we are fabricating 8MB pages using 4MB hw pages. */ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) #define USER_PGTABLE_CHECK_PMD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \ brz,pn REG1, FAIL_LABEL; \ sethi %uhi(_PAGE_PMD_HUGE), REG2; \ diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index cb841a3..ff3f9f9 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -111,8 +111,8 @@ static unsigned int get_user_insn(unsigned long tpc) if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp))) goto out_irq_enable; -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (pmd_trans_huge(*pmdp)) { +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) + if (is_hugetlb_pmd(*pmdp)) { pa = pmd_pfn(*pmdp) << PAGE_SHIFT; pa += tpc & ~HPAGE_MASK; diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index ba52e64..cafb5ca 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -131,23 +131,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, { pgd_t *pgd; pud_t *pud; - pmd_t *pmd; pte_t *pte = NULL; - /* We must align the address, because our caller will run -* set_huge_pte_at() on whatever we return, which writes out -* all of the sub-ptes for the hugepage range. So we have -* to give it the first such sub-pte. -*/ - addr &= HPAGE_MASK; - pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); - if (pud) { - pmd = pmd_alloc(mm, pud, addr); - if (pmd) - pte = pte_alloc_map(mm, pmd, addr); - } + if (pud) + pte = (pte_t *)pmd_alloc(mm, pud, addr); + return pte; } @@ -155,19 +145,13 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) { pgd_t *pgd; pud_t *pud; - pmd_t *pmd; pte_t *pte = NULL; - addr &= HPAGE_MASK; - pgd = pgd_offset(mm, addr); if (!pgd_none(*pgd)) { pud = pud_offset(pgd, addr); - if (!pud_none(*pud)) { - pmd = pmd_offset(pud, addr); - if (!pmd_none(*pmd)) - pte = pte_offset_map(pmd, addr); - } + if (!pud_none(*pud)) + pte = (pte_t *)pmd_offset(pud, addr); } return pte; } @@ -175,67 +159,43 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t entry) { - int i; - pte_t orig[2]; - unsigned long nptes; + pte_t orig; if (!pte_present(*ptep) && pte_present(entry)) mm->context.huge_pte_count++; addr &= HPAGE_MASK; - - nptes = 1 << HUGETLB_PAGE_ORDER; - orig[0] = *ptep; - orig[1] = *(ptep + nptes / 2); - for (i = 0; i < nptes; i++) { - *ptep = entry; - ptep++; -
[PATCH v4] sparc64: Reduce TLB flushes during hugepte changes
During hugepage map/unmap, TSB and TLB flushes are currently issued at every PAGE_SIZE'd boundary which is unnecessary. We now issue the flush at REAL_HPAGE_SIZE boundaries only. Without this patch workloads which unmap a large hugepage backed VMA region get CPU lockups due to excessive TLB flush calls. Orabug: 22365539, 22643230, 22995196 Signed-off-by: Nitin Gupta --- Changelog v4 vs v3: - Fix build error when CONFIG_HUGETLB_PAGE is not defined - Tested build with randconfig, allyesconfig, allnoconfig Changelog v3 vs v2: - Changed patch title to reflect that both map/unmap cases are affected. - Don't do TLB flush if original PTE wasn't valid (DaveM) - Use tlb_batch_add() instead of directly calling TLB flush function. This routine also flushes dcache (needed by older sparcs) (DaveM) Changelog v1 vs v2: - Access PTEs in order (David Miller) - Issue TLB and TSB flush after clearing PTEs (David Miller) --- arch/sparc/include/asm/pgtable_64.h | 43 + arch/sparc/include/asm/tlbflush_64.h |3 +- arch/sparc/mm/hugetlbpage.c | 33 ++ arch/sparc/mm/init_64.c | 12 - arch/sparc/mm/tlb.c | 25 ++- arch/sparc/mm/tsb.c | 32 + 6 files changed, 97 insertions(+), 51 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index f089cfa..5a189bf 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -375,7 +375,7 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot) #define pgprot_noncached pgprot_noncached #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) -static inline pte_t pte_mkhuge(pte_t pte) +static inline unsigned long __pte_huge_mask(void) { unsigned long mask; @@ -390,8 +390,19 @@ static inline pte_t pte_mkhuge(pte_t pte) : "=r" (mask) : "i" (_PAGE_SZHUGE_4U), "i" (_PAGE_SZHUGE_4V)); - return __pte(pte_val(pte) | mask); + return mask; +} + +static inline pte_t pte_mkhuge(pte_t pte) +{ + return __pte(pte_val(pte) | __pte_huge_mask()); +} + +static inline bool is_hugetlb_pte(pte_t pte) +{ + return !!(pte_val(pte) & __pte_huge_mask()); } + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { @@ -403,6 +414,11 @@ static inline pmd_t pmd_mkhuge(pmd_t pmd) return __pmd(pte_val(pte)); } #endif +#else +static inline bool is_hugetlb_pte(pte_t pte) +{ + return false; +} #endif static inline pte_t pte_mkdirty(pte_t pte) @@ -858,6 +874,19 @@ static inline unsigned long pud_pfn(pud_t pud) void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, pte_t *ptep, pte_t orig, int fullmm); +static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, + pte_t *ptep, pte_t orig, int fullmm) +{ + /* It is more efficient to let flush_tlb_kernel_range() +* handle init_mm tlb flushes. +* +* SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U +* and SUN4V pte layout, so this inline test is fine. +*/ + if (likely(mm != &init_mm) && pte_accessible(mm, orig)) + tlb_batch_add(mm, vaddr, ptep, orig, fullmm); +} + #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long addr, @@ -874,15 +903,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t orig = *ptep; *ptep = pte; - - /* It is more efficient to let flush_tlb_kernel_range() -* handle init_mm tlb flushes. -* -* SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U -* and SUN4V pte layout, so this inline test is fine. -*/ - if (likely(mm != &init_mm) && pte_accessible(mm, orig)) - tlb_batch_add(mm, addr, ptep, orig, fullmm); + maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm); } #define set_pte_at(mm,addr,ptep,pte) \ diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h index dea1cfa..a8e192e 100644 --- a/arch/sparc/include/asm/tlbflush_64.h +++ b/arch/sparc/include/asm/tlbflush_64.h @@ -8,6 +8,7 @@ #define TLB_BATCH_NR 192 struct tlb_batch { + bool huge; struct mm_struct *mm; unsigned long tlb_nr; unsigned long active; @@ -16,7 +17,7 @@ struct tlb_batch { void flush_tsb_kernel_range(unsigned long start, unsigned long end); void flush_tsb_user(struct tlb_batch *tb); -void flush_tsb_user_page(struct mm_struct *mm, unsigned long vaddr); +void flush_tsb_user_page(struct mm_struct *mm, unsigned long vaddr, bool huge); /* TLB flu
[PATCH v3] sparc64: Reduce TLB flushes during hugepte changes
During hugepage map/unmap, TSB and TLB flushes are currently issued at every PAGE_SIZE'd boundary which is unnecessary. We now issue the flush at REAL_HPAGE_SIZE boundaries only. Without this patch workloads which unmap a large hugepage backed VMA region get CPU lockups due to excessive TLB flush calls. Orabug: 22365539, 22643230, 22995196 Signed-off-by: Nitin Gupta --- Changelog v3 vs v2: - Changed patch title to reflect that both map/unmap cases are affected. - Don't do TLB flush if original PTE wasn't valid (DaveM) - Use tlb_batch_add() instead of directly calling TLB flush function. This routine also flushes dcache (needed by older sparcs) (DaveM) Changelog v1 vs v2: - Access PTEs in order (David Miller) - Issue TLB and TSB flush after clearing PTEs (David Miller) --- arch/sparc/include/asm/pgtable_64.h | 38 +--- arch/sparc/include/asm/tlbflush_64.h | 3 ++- arch/sparc/mm/hugetlbpage.c | 33 ++- arch/sparc/mm/init_64.c | 12 arch/sparc/mm/tlb.c | 18 + arch/sparc/mm/tsb.c | 32 -- 6 files changed, 88 insertions(+), 48 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 7a38d6a..0e706b8 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -375,7 +375,7 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot) #define pgprot_noncached pgprot_noncached #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) -static inline pte_t pte_mkhuge(pte_t pte) +static inline unsigned long __pte_huge_mask(void) { unsigned long mask; @@ -390,8 +390,19 @@ static inline pte_t pte_mkhuge(pte_t pte) : "=r" (mask) : "i" (_PAGE_SZHUGE_4U), "i" (_PAGE_SZHUGE_4V)); - return __pte(pte_val(pte) | mask); + return mask; +} + +static inline pte_t pte_mkhuge(pte_t pte) +{ + return __pte(pte_val(pte) | __pte_huge_mask()); +} + +static inline bool is_hugetlb_pte(pte_t pte) +{ + return !!(pte_val(pte) & __pte_huge_mask()); } + #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline pmd_t pmd_mkhuge(pmd_t pmd) { @@ -858,6 +869,19 @@ static inline unsigned long pud_pfn(pud_t pud) void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, pte_t *ptep, pte_t orig, int fullmm); +static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr, + pte_t *ptep, pte_t orig, int fullmm) +{ + /* It is more efficient to let flush_tlb_kernel_range() +* handle init_mm tlb flushes. +* +* SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U +* and SUN4V pte layout, so this inline test is fine. +*/ + if (likely(mm != &init_mm) && pte_accessible(mm, orig)) + tlb_batch_add(mm, vaddr, ptep, orig, fullmm); +} + #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long addr, @@ -874,15 +898,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t orig = *ptep; *ptep = pte; - - /* It is more efficient to let flush_tlb_kernel_range() -* handle init_mm tlb flushes. -* -* SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U -* and SUN4V pte layout, so this inline test is fine. -*/ - if (likely(mm != &init_mm) && pte_accessible(mm, orig)) - tlb_batch_add(mm, addr, ptep, orig, fullmm); + maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm); } #define set_pte_at(mm,addr,ptep,pte) \ diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h index dea1cfa..a8e192e 100644 --- a/arch/sparc/include/asm/tlbflush_64.h +++ b/arch/sparc/include/asm/tlbflush_64.h @@ -8,6 +8,7 @@ #define TLB_BATCH_NR 192 struct tlb_batch { + bool huge; struct mm_struct *mm; unsigned long tlb_nr; unsigned long active; @@ -16,7 +17,7 @@ struct tlb_batch { void flush_tsb_kernel_range(unsigned long start, unsigned long end); void flush_tsb_user(struct tlb_batch *tb); -void flush_tsb_user_page(struct mm_struct *mm, unsigned long vaddr); +void flush_tsb_user_page(struct mm_struct *mm, unsigned long vaddr, bool huge); /* TLB flush operations. */ diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index 4977800..ba52e64 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -176,17 +176,31 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t entry) { int i; + pte_t orig[2]; + un
Fwd: [PATCH] char:misc minor is overflowing
Hi, Is there any modification / improvement needed in this patch ? --- Original Message --- Sender : Shivnandan Kumar Engineer/SRI-Noida-Advance Solutions - System 1 R&D Group/Samsung Electronics Date : Nov 20, 2015 15:35 (GMT+05:30) Title : [PATCH] char:misc minor is overflowing When a driver register as a misc driver and it tries to allocate minor number dynamically. Then there is a chance of minor number overflow. The problem is that 64(DYNAMIC_MINORS) is not enough for dynamic minor number and if kernel defines 0-63 for dynamic minor number, it should be reserved. But,0-10 was used for other devices, for example 1 is reserved for PSMOUSE. I got the issue that misc_minors is 0x3FFF and so, value of variable 'i' in function misc_register becomes 62 and so misc_minors become 1.(Which was already reserved for PSMOUSE). This patch help to avoid the above problem. Signed-off-by: shivnandan kumar --- drivers/char/misc.c|5 ++--- include/linux/miscdevice.h |1 + 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/char/misc.c b/drivers/char/misc.c index 8069b36..1a6a640 100644 --- a/drivers/char/misc.c +++ b/drivers/char/misc.c @@ -198,7 +198,7 @@ int misc_register(struct miscdevice * misc) err = -EBUSY; goto out; } - misc->minor = DYNAMIC_MINORS - i - 1; + misc->minor = DYNAMIC_MINORS - i - 1 + DYNAMIC_MINOR_START; set_bit(i, misc_minors); } else { struct miscdevice *c; @@ -218,8 +218,7 @@ int misc_register(struct miscdevice * misc) misc, misc->groups, "%s", misc->name); if (IS_ERR(misc->this_device)) { if (is_dynamic) { - int i = DYNAMIC_MINORS - misc->minor - 1; + int i = DYNAMIC_MINORS - misc->minor - 1 + DYNAMIC_MINOR_START; if (i < DYNAMIC_MINORS && i >= 0) clear_bit(i, misc_minors); misc->minor = MISC_DYNAMIC_MINOR; diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h index 81f6e42..7aa931e 100644 --- a/include/linux/miscdevice.h +++ b/include/linux/miscdevice.h @@ -19,6 +19,7 @@ #define APOLLO_MOUSE_MINOR 7 /* unused */ #define PC110PAD_MINOR 9 /* unused */ /*#define ADB_MOUSE_MINOR 10 FIXME OBSOLETE */ +#define DYNAMIC_MINOR_START 11 #define WATCHDOG_MINOR 130 /* Watchdog timer */ #define TEMP_MINOR 131 /* Temperature Sensor */ #define RTC_MINOR 135 -- 1.7.9.5 Nitin Gupta Logix Cyber Park • Plot No. C 28-29, Tower D - Ground to 10th Floor, Tower C - 8th to 10th Floor, Sector 62 • Noida(U.P.) 201301 • INDIA
[PATCH] sparc64: Fix numa distance values
Orabug: 21896119 Use machine descriptor (MD) to get node latency values instead of just using default values. Testing: On an T5-8 system with: - total nodes = 8 - self latencies = 0x26d18 - latency to other nodes = 0x3a598 => latency ratio = ~1.5 output of numactl --hardware - before fix: node distances: node 0 1 2 3 4 5 6 7 0: 10 20 20 20 20 20 20 20 1: 20 10 20 20 20 20 20 20 2: 20 20 10 20 20 20 20 20 3: 20 20 20 10 20 20 20 20 4: 20 20 20 20 10 20 20 20 5: 20 20 20 20 20 10 20 20 6: 20 20 20 20 20 20 10 20 7: 20 20 20 20 20 20 20 10 - after fix: node distances: node 0 1 2 3 4 5 6 7 0: 10 15 15 15 15 15 15 15 1: 15 10 15 15 15 15 15 15 2: 15 15 10 15 15 15 15 15 3: 15 15 15 10 15 15 15 15 4: 15 15 15 15 10 15 15 15 5: 15 15 15 15 15 10 15 15 6: 15 15 15 15 15 15 10 15 7: 15 15 15 15 15 15 15 10 Signed-off-by: Nitin Gupta Reviewed-by: Chris Hyser Reviewed-by: Santosh Shilimkar --- Changelog v1 -> v2: - Drop extern keyword for function prototype (Sam Ravnborg) arch/sparc/include/asm/topology_64.h |3 + arch/sparc/mm/init_64.c | 70 +- 2 files changed, 72 insertions(+), 1 deletions(-) diff --git a/arch/sparc/include/asm/topology_64.h b/arch/sparc/include/asm/topology_64.h index 01d1704..bec481a 100644 --- a/arch/sparc/include/asm/topology_64.h +++ b/arch/sparc/include/asm/topology_64.h @@ -31,6 +31,9 @@ static inline int pcibus_to_node(struct pci_bus *pbus) cpu_all_mask : \ cpumask_of_node(pcibus_to_node(bus))) +int __node_distance(int, int); +#define node_distance(a, b) __node_distance(a, b) + #else /* CONFIG_NUMA */ #include diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 4ac88b7..3025bd5 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -93,6 +93,8 @@ static unsigned long cpu_pgsz_mask; static struct linux_prom64_registers pavail[MAX_BANKS]; static int pavail_ents; +u64 numa_latency[MAX_NUMNODES][MAX_NUMNODES]; + static int cmp_p64(const void *a, const void *b) { const struct linux_prom64_registers *x = a, *y = b; @@ -1157,6 +1159,48 @@ static struct mdesc_mlgroup * __init find_mlgroup(u64 node) return NULL; } +int __node_distance(int from, int to) +{ + if ((from >= MAX_NUMNODES) || (to >= MAX_NUMNODES)) { + pr_warn("Returning default NUMA distance value for %d->%d\n", + from, to); + return (from == to) ? LOCAL_DISTANCE : REMOTE_DISTANCE; + } + return numa_latency[from][to]; +} + +static int find_best_numa_node_for_mlgroup(struct mdesc_mlgroup *grp) +{ + int i; + + for (i = 0; i < MAX_NUMNODES; i++) { + struct node_mem_mask *n = &node_masks[i]; + + if ((grp->mask == n->mask) && (grp->match == n->val)) + break; + } + return i; +} + +static void find_numa_latencies_for_group(struct mdesc_handle *md, u64 grp, + int index) +{ + u64 arc; + + mdesc_for_each_arc(arc, md, grp, MDESC_ARC_TYPE_FWD) { + int tnode; + u64 target = mdesc_arc_target(md, arc); + struct mdesc_mlgroup *m = find_mlgroup(target); + + if (!m) + continue; + tnode = find_best_numa_node_for_mlgroup(m); + if (tnode == MAX_NUMNODES) + continue; + numa_latency[index][tnode] = m->latency; + } +} + static int __init numa_attach_mlgroup(struct mdesc_handle *md, u64 grp, int index) { @@ -1220,9 +1264,16 @@ static int __init numa_parse_mdesc_group(struct mdesc_handle *md, u64 grp, static int __init numa_parse_mdesc(void) { struct mdesc_handle *md = mdesc_grab(); - int i, err, count; + int i, j, err, count; u64 node; + /* Some sane defaults for numa latency values */ + for (i = 0; i < MAX_NUMNODES; i++) { + for (j = 0; j < MAX_NUMNODES; j++) + numa_latency[i][j] = (i == j) ? + LOCAL_DISTANCE : REMOTE_DISTANCE; + } + node = mdesc_node_by_name(md, MDESC_NODE_NULL, "latency-groups"); if (node == MDESC_NODE_NULL) { mdesc_release(md); @@ -1245,6 +1296,23 @@ static int __init numa_parse_mdesc(void) count++; } + count = 0; + mdesc_for_each_node_by_name(md, node, "group") { + find_numa_latencies_for_group(md, node, count); + count++; + } + + /* Normalize numa latency matrix according to ACPI SLIT spec. */ + for (i =
Re: [PATCH] sparc64: Fix numa distance values
On 10/29/2015 11:50 AM, Sam Ravnborg wrote: Small nit. diff --git a/arch/sparc/include/asm/topology_64.h b/arch/sparc/include/asm/topology_64.h index 01d1704..ed3dfdd 100644 --- a/arch/sparc/include/asm/topology_64.h +++ b/arch/sparc/include/asm/topology_64.h @@ -31,6 +31,9 @@ static inline int pcibus_to_node(struct pci_bus *pbus) cpu_all_mask : \ cpumask_of_node(pcibus_to_node(bus))) +extern int __node_distance(int, int); We have dropped using "extern" for function prototypes. ok, dropped extern here. +#define node_distance(a, b) __node_distance(a, b) And had this be written as: #define node_distance node_distance underscores here to separate macro name from function name seems to be clearer and would also avoid confusing cross-referencing tools. int node_distance(int, int); Then there had been no need for the leadign underscores. But as I said - only nits. Sam Thanks for the review. Nitin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sparc64: Fix numa distance values
Orabug: 21896119 Use machine descriptor (MD) to get node latency values instead of just using default values. Testing: On an T5-8 system with: - total nodes = 8 - self latencies = 0x26d18 - latency to other nodes = 0x3a598 => latency ratio = ~1.5 output of numactl --hardware - before fix: node distances: node 0 1 2 3 4 5 6 7 0: 10 20 20 20 20 20 20 20 1: 20 10 20 20 20 20 20 20 2: 20 20 10 20 20 20 20 20 3: 20 20 20 10 20 20 20 20 4: 20 20 20 20 10 20 20 20 5: 20 20 20 20 20 10 20 20 6: 20 20 20 20 20 20 10 20 7: 20 20 20 20 20 20 20 10 - after fix: node distances: node 0 1 2 3 4 5 6 7 0: 10 15 15 15 15 15 15 15 1: 15 10 15 15 15 15 15 15 2: 15 15 10 15 15 15 15 15 3: 15 15 15 10 15 15 15 15 4: 15 15 15 15 10 15 15 15 5: 15 15 15 15 15 10 15 15 6: 15 15 15 15 15 15 10 15 7: 15 15 15 15 15 15 15 10 Signed-off-by: Nitin Gupta Reviewed-by: Chris Hyser Reviewed-by: Santosh Shilimkar --- arch/sparc/include/asm/topology_64.h |3 + arch/sparc/mm/init_64.c | 70 +- 2 files changed, 72 insertions(+), 1 deletions(-) diff --git a/arch/sparc/include/asm/topology_64.h b/arch/sparc/include/asm/topology_64.h index 01d1704..ed3dfdd 100644 --- a/arch/sparc/include/asm/topology_64.h +++ b/arch/sparc/include/asm/topology_64.h @@ -31,6 +31,9 @@ static inline int pcibus_to_node(struct pci_bus *pbus) cpu_all_mask : \ cpumask_of_node(pcibus_to_node(bus))) +extern int __node_distance(int, int); +#define node_distance(a, b) __node_distance(a, b) + #else /* CONFIG_NUMA */ #include diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 4ac88b7..3025bd5 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -93,6 +93,8 @@ static unsigned long cpu_pgsz_mask; static struct linux_prom64_registers pavail[MAX_BANKS]; static int pavail_ents; +u64 numa_latency[MAX_NUMNODES][MAX_NUMNODES]; + static int cmp_p64(const void *a, const void *b) { const struct linux_prom64_registers *x = a, *y = b; @@ -1157,6 +1159,48 @@ static struct mdesc_mlgroup * __init find_mlgroup(u64 node) return NULL; } +int __node_distance(int from, int to) +{ + if ((from >= MAX_NUMNODES) || (to >= MAX_NUMNODES)) { + pr_warn("Returning default NUMA distance value for %d->%d\n", + from, to); + return (from == to) ? LOCAL_DISTANCE : REMOTE_DISTANCE; + } + return numa_latency[from][to]; +} + +static int find_best_numa_node_for_mlgroup(struct mdesc_mlgroup *grp) +{ + int i; + + for (i = 0; i < MAX_NUMNODES; i++) { + struct node_mem_mask *n = &node_masks[i]; + + if ((grp->mask == n->mask) && (grp->match == n->val)) + break; + } + return i; +} + +static void find_numa_latencies_for_group(struct mdesc_handle *md, u64 grp, + int index) +{ + u64 arc; + + mdesc_for_each_arc(arc, md, grp, MDESC_ARC_TYPE_FWD) { + int tnode; + u64 target = mdesc_arc_target(md, arc); + struct mdesc_mlgroup *m = find_mlgroup(target); + + if (!m) + continue; + tnode = find_best_numa_node_for_mlgroup(m); + if (tnode == MAX_NUMNODES) + continue; + numa_latency[index][tnode] = m->latency; + } +} + static int __init numa_attach_mlgroup(struct mdesc_handle *md, u64 grp, int index) { @@ -1220,9 +1264,16 @@ static int __init numa_parse_mdesc_group(struct mdesc_handle *md, u64 grp, static int __init numa_parse_mdesc(void) { struct mdesc_handle *md = mdesc_grab(); - int i, err, count; + int i, j, err, count; u64 node; + /* Some sane defaults for numa latency values */ + for (i = 0; i < MAX_NUMNODES; i++) { + for (j = 0; j < MAX_NUMNODES; j++) + numa_latency[i][j] = (i == j) ? + LOCAL_DISTANCE : REMOTE_DISTANCE; + } + node = mdesc_node_by_name(md, MDESC_NODE_NULL, "latency-groups"); if (node == MDESC_NODE_NULL) { mdesc_release(md); @@ -1245,6 +1296,23 @@ static int __init numa_parse_mdesc(void) count++; } + count = 0; + mdesc_for_each_node_by_name(md, node, "group") { + find_numa_latencies_for_group(md, node, count); + count++; + } + + /* Normalize numa latency matrix according to ACPI SLIT spec. */ + for (i = 0; i < MAX_NUMNODES; i++) { + u64 sel
Re: [PATCH] staging: zsmalloc: Ensure handle is never 0 on success
On 11/12/13, 6:42 PM, Greg KH wrote: On Wed, Nov 13, 2013 at 12:41:38AM +0900, Minchan Kim wrote: We spent much time with preventing zram enhance since it have been in staging and Greg never want to improve without promotion. It's not "improve", it's "Greg does not want you adding new features and functionality while the code is in staging." I want you to spend your time on getting it out of staging first. Now if something needs to be done based on review and comments to the code, then that's fine to do and I'll accept that, but I've been seeing new functionality be added to the code, which I will not accept because it seems that you all have given up on getting it merged, which isn't ok. It's not that people have given up on getting it merged but every time patches are posted, there is really no response from maintainers perhaps due to their lack of interest in embedded, or perhaps they believe embedded folks are making a wrong choice by using zram. Either way, a final word, instead of just silence would be more helpful. Thanks, Nitin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] staging: zsmalloc: Ensure handle is never 0 on success
On Thu, Nov 7, 2013 at 5:58 PM, Olav Haugan wrote: > zsmalloc encodes a handle using the pfn and an object > index. On hardware platforms with physical memory starting > at 0x0 the pfn can be 0. This causes the encoded handle to be > 0 and is incorrectly interpreted as an allocation failure. > > To prevent this false error we ensure that the encoded handle > will not be 0 when allocation succeeds. > > Signed-off-by: Olav Haugan > --- > drivers/staging/zsmalloc/zsmalloc-main.c | 17 + > 1 file changed, 13 insertions(+), 4 deletions(-) > > diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c > b/drivers/staging/zsmalloc/zsmalloc-main.c > index 1a67537..3b950e5 100644 > --- a/drivers/staging/zsmalloc/zsmalloc-main.c > +++ b/drivers/staging/zsmalloc/zsmalloc-main.c > @@ -430,7 +430,12 @@ static struct page *get_next_page(struct page *page) > return next; > } > > -/* Encode as a single handle value */ > +/* > + * Encode as a single handle value. > + * On hardware platforms with physical memory starting at 0x0 the pfn > + * could be 0 so we ensure that the handle will never be 0 by adjusting the > + * encoded obj_idx value before encoding. > + */ > static void *obj_location_to_handle(struct page *page, unsigned long obj_idx) > { > unsigned long handle; > @@ -441,17 +446,21 @@ static void *obj_location_to_handle(struct page *page, > unsigned long obj_idx) > } > > handle = page_to_pfn(page) << OBJ_INDEX_BITS; > - handle |= (obj_idx & OBJ_INDEX_MASK); > + handle |= ((obj_idx + 1) & OBJ_INDEX_MASK); > > return (void *)handle; > } > > -/* Decode pair from the given object handle */ > +/* > + * Decode pair from the given object handle. We adjust the > + * decoded obj_idx back to its original value since it was adjusted in > + * obj_location_to_handle(). > + */ > static void obj_handle_to_location(unsigned long handle, struct page **page, > unsigned long *obj_idx) > { > *page = pfn_to_page(handle >> OBJ_INDEX_BITS); > - *obj_idx = handle & OBJ_INDEX_MASK; > + *obj_idx = (handle & OBJ_INDEX_MASK) - 1; > } > > static unsigned long obj_idx_to_offset(struct page *page, Acked-by: Nitin Gupta Thanks, Nitin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/