RE: [PATCH] mm/compaction: remove unused variable sysctl_compact_memory

2021-03-03 Thread Nitin Gupta



> -Original Message-
> From: owner-linux...@kvack.org  On Behalf
> Of pi...@codeaurora.org
> Sent: Wednesday, March 3, 2021 6:34 AM
> To: Nitin Gupta 
> Cc: linux-kernel@vger.kernel.org; a...@linux-foundation.org; linux-
> m...@kvack.org; linux-fsde...@vger.kernel.org; iamjoonsoo@lge.com;
> sh_...@163.com; mateusznos...@gmail.com; b...@redhat.com;
> vba...@suse.cz; yzai...@google.com; keesc...@chromium.org;
> mcg...@kernel.org; mgor...@techsingularity.net; pintu.p...@gmail.com
> Subject: Re: [PATCH] mm/compaction: remove unused variable
> sysctl_compact_memory
> 
> External email: Use caution opening links or attachments
> 
> 
> On 2021-03-03 01:48, Nitin Gupta wrote:
> >> -Original Message-
> >> From: pintu=codeaurora@mg.codeaurora.org
> >>  On Behalf Of Pintu Kumar
> >> Sent: Tuesday, March 2, 2021 9:56 AM
> >> To: linux-kernel@vger.kernel.org; a...@linux-foundation.org; linux-
> >> m...@kvack.org; linux-fsde...@vger.kernel.org; pi...@codeaurora.org;
> >> iamjoonsoo@lge.com; sh_...@163.com;
> mateusznos...@gmail.com;
> >> b...@redhat.com; Nitin Gupta ; vba...@suse.cz;
> >> yzai...@google.com; keesc...@chromium.org; mcg...@kernel.org;
> >> mgor...@techsingularity.net
> >> Cc: pintu.p...@gmail.com
> >> Subject: [PATCH] mm/compaction: remove unused variable
> >> sysctl_compact_memory
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> The sysctl_compact_memory is mostly unsed in mm/compaction.c It just
> >> acts as a place holder for sysctl.
> >>
> >> Thus we can remove it from here and move the declaration directly in
> >> kernel/sysctl.c itself.
> >> This will also eliminate the extern declaration from header file.
> >
> >
> > I prefer keeping the existing pattern of listing all compaction
> > related tunables together in compaction.h:
> >
> >   extern int sysctl_compact_memory;
> >   extern unsigned int sysctl_compaction_proactiveness;
> >   extern int sysctl_extfrag_threshold;
> >   extern int sysctl_compact_unevictable_allowed;
> >
> 
> Thanks Nitin for your review.
> You mean, you just wanted to retain this extern declaration ?
> Any real benefit of keeping this declaration if not used elsewhere ?
> 

I see that sysctl_compaction_handler() doesn't use the sysctl value at all.
So, we can get rid of it completely as Vlastimil suggested.

> >
> >> No functionality is broken or changed this way.
> >>
> >> Signed-off-by: Pintu Kumar 
> >> Signed-off-by: Pintu Agarwal 
> >> ---
> >>  include/linux/compaction.h | 1 -
> >>  kernel/sysctl.c| 1 +
> >>  mm/compaction.c| 3 ---
> >>  3 files changed, 1 insertion(+), 4 deletions(-)
> >>
> >> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> >> index
> >> ed4070e..4221888 100644
> >> --- a/include/linux/compaction.h
> >> +++ b/include/linux/compaction.h
> >> @@ -81,7 +81,6 @@ static inline unsigned long compact_gap(unsigned
> >> int
> >> order)  }
> >>
> >>  #ifdef CONFIG_COMPACTION
> >> -extern int sysctl_compact_memory;
> >>  extern unsigned int sysctl_compaction_proactiveness;  extern int
> >> sysctl_compaction_handler(struct ctl_table *table, int write,
> >> void *buffer, size_t *length, loff_t *ppos);
> >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c9fbdd8..66aff21
> >> 100644
> >> --- a/kernel/sysctl.c
> >> +++ b/kernel/sysctl.c
> >> @@ -198,6 +198,7 @@ static int max_sched_tunable_scaling =
> >> SCHED_TUNABLESCALING_END-1;  #ifdef CONFIG_COMPACTION  static int
> >> min_extfrag_threshold;  static int max_extfrag_threshold = 1000;
> >> +static int sysctl_compact_memory;
> >>  #endif
> >>
> >>  #endif /* CONFIG_SYSCTL */
> >> diff --git a/mm/compaction.c b/mm/compaction.c index
> 190ccda..ede2886
> >> 100644
> >> --- a/mm/compaction.c
> >> +++ b/mm/compaction.c
> >> @@ -2650,9 +2650,6 @@ static void compact_nodes(void)
> >> compact_node(nid);
> >>  }
> >>
> >> -/* The written value is actually unused, all memory is compacted */
> >> -int sysctl_compact_memory;
> >> -
> >
> >
> > Please retain this comment for the tunable.
> 
> Sorry, I could not understand.
> You mean to say just retain this last comment and only remove the
> variable ?
> Again any real benefit you see in retaining this even if its not used?
> 
> 

You are just moving declaration of sysctl_compact_memory from compaction.c
to sysctl.c. So, I wanted the comment "... all memory is compacted" to be 
retained
with the sysctl variable. Since you are now getting rid of this variable 
completely,
this comment goes away too.

Thanks,
Nitin



RE: [PATCH] mm/compaction: remove unused variable sysctl_compact_memory

2021-03-02 Thread Nitin Gupta



> -Original Message-
> From: pintu=codeaurora@mg.codeaurora.org
>  On Behalf Of Pintu Kumar
> Sent: Tuesday, March 2, 2021 9:56 AM
> To: linux-kernel@vger.kernel.org; a...@linux-foundation.org; linux-
> m...@kvack.org; linux-fsde...@vger.kernel.org; pi...@codeaurora.org;
> iamjoonsoo@lge.com; sh_...@163.com; mateusznos...@gmail.com;
> b...@redhat.com; Nitin Gupta ; vba...@suse.cz;
> yzai...@google.com; keesc...@chromium.org; mcg...@kernel.org;
> mgor...@techsingularity.net
> Cc: pintu.p...@gmail.com
> Subject: [PATCH] mm/compaction: remove unused variable
> sysctl_compact_memory
> 
> External email: Use caution opening links or attachments
> 
> 
> The sysctl_compact_memory is mostly unsed in mm/compaction.c It just acts
> as a place holder for sysctl.
> 
> Thus we can remove it from here and move the declaration directly in
> kernel/sysctl.c itself.
> This will also eliminate the extern declaration from header file.


I prefer keeping the existing pattern of listing all compaction related tunables
together in compaction.h:

extern int sysctl_compact_memory;
extern unsigned int sysctl_compaction_proactiveness;
extern int sysctl_extfrag_threshold;
extern int sysctl_compact_unevictable_allowed;


> No functionality is broken or changed this way.
> 
> Signed-off-by: Pintu Kumar 
> Signed-off-by: Pintu Agarwal 
> ---
>  include/linux/compaction.h | 1 -
>  kernel/sysctl.c| 1 +
>  mm/compaction.c| 3 ---
>  3 files changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h index
> ed4070e..4221888 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -81,7 +81,6 @@ static inline unsigned long compact_gap(unsigned int
> order)  }
> 
>  #ifdef CONFIG_COMPACTION
> -extern int sysctl_compact_memory;
>  extern unsigned int sysctl_compaction_proactiveness;  extern int
> sysctl_compaction_handler(struct ctl_table *table, int write,
> void *buffer, size_t *length, loff_t *ppos); diff 
> --git
> a/kernel/sysctl.c b/kernel/sysctl.c index c9fbdd8..66aff21 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -198,6 +198,7 @@ static int max_sched_tunable_scaling =
> SCHED_TUNABLESCALING_END-1;  #ifdef CONFIG_COMPACTION  static int
> min_extfrag_threshold;  static int max_extfrag_threshold = 1000;
> +static int sysctl_compact_memory;
>  #endif
> 
>  #endif /* CONFIG_SYSCTL */
> diff --git a/mm/compaction.c b/mm/compaction.c index 190ccda..ede2886
> 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2650,9 +2650,6 @@ static void compact_nodes(void)
> compact_node(nid);
>  }
> 
> -/* The written value is actually unused, all memory is compacted */ -int
> sysctl_compact_memory;
> -


Please retain this comment for the tunable.

-Nitin


[PATCH] mm: Fix compile error due to COMPACTION_HPAGE_ORDER

2020-06-22 Thread Nitin Gupta
Fix compile error when COMPACTION_HPAGE_ORDER is assigned
to HUGETLB_PAGE_ORDER. The correct way to check if this
constant is defined is to check for CONFIG_HUGETLBFS.

Signed-off-by: Nitin Gupta 
To: Andrew Morton 
Reported-by: Nathan Chancellor 
Tested-by: Nathan Chancellor 
---
 mm/compaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 45fd24a0ea0b..02963ffb9e70 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -62,7 +62,7 @@ static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 
500;
  */
 #if defined CONFIG_TRANSPARENT_HUGEPAGE
 #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER
-#elif defined HUGETLB_PAGE_ORDER
+#elif defined CONFIG_HUGETLBFS
 #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER
 #else
 #define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT)
-- 
2.27.0



Re: [PATCH v8] mm: Proactive compaction

2020-06-22 Thread Nitin Gupta
On 6/22/20 9:57 PM, Nathan Chancellor wrote:
> On Mon, Jun 22, 2020 at 09:32:12PM -0700, Nitin Gupta wrote:
>> On 6/22/20 7:26 PM, Nathan Chancellor wrote:
>>> On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote:
>>>> For some applications, we need to allocate almost all memory as
>>>> hugepages. However, on a running system, higher-order allocations can
>>>> fail if the memory is fragmented. Linux kernel currently does on-demand
>>>> compaction as we request more hugepages, but this style of compaction
>>>> incurs very high latency. Experiments with one-time full memory
>>>> compaction (followed by hugepage allocations) show that kernel is able
>>>> to restore a highly fragmented memory state to a fairly compacted memory
>>>> state within <1 sec for a 32G system. Such data suggests that a more
>>>> proactive compaction can help us allocate a large fraction of memory as
>>>> hugepages keeping allocation latencies low.
>>>>
>>>> For a more proactive compaction, the approach taken here is to define a
>>>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
>>>> for external fragmentation which kcompactd tries to maintain.
>>>>
>>>> The tunable takes a value in range [0, 100], with a default of 20.
>>>>
>>>> Note that a previous version of this patch [1] was found to introduce
>>>> too many tunables (per-order extfrag{low, high}), but this one reduces
>>>> them to just one sysctl. Also, the new tunable is an opaque value
>>>> instead of asking for specific bounds of "external fragmentation", which
>>>> would have been difficult to estimate. The internal interpretation of
>>>> this opaque value allows for future fine-tuning.
>>>>
>>>> Currently, we use a simple translation from this tunable to [low, high]
>>>> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
>>>> The score for a node is defined as weighted mean of per-zone external
>>>> fragmentation. A zone's present_pages determines its weight.
>>>>
>>>> To periodically check per-node score, we reuse per-node kcompactd
>>>> threads, which are woken up every 500 milliseconds to check the same. If
>>>> a node's score exceeds its high threshold (as derived from user-provided
>>>> proactiveness value), proactive compaction is started until its score
>>>> reaches its low threshold value. By default, proactiveness is set to 20,
>>>> which implies threshold values of low=80 and high=90.
>>>>
>>>> This patch is largely based on ideas from Michal Hocko [2]. See also the
>>>> LWN article [3].
>>>>
>>>> Performance data
>>>> 
>>>>
>>>> System: x64_64, 1T RAM, 80 CPU threads.
>>>> Kernel: 5.6.0-rc3 + this patch
>>>>
>>>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>>>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>>>>
>>>> Before starting the driver, the system was fragmented from a userspace
>>>> program that allocates all memory and then for each 2M aligned section,
>>>> frees 3/4 of base pages using munmap. The workload is mainly anonymous
>>>> userspace pages, which are easy to move around. I intentionally avoided
>>>> unmovable pages in this test to see how much latency we incur when
>>>> hugepage allocations hit direct compaction.
>>>>
>>>> 1. Kernel hugepage allocation latencies
>>>>
>>>> With the system in such a fragmented state, a kernel driver then
>>>> allocates as many hugepages as possible and measures allocation
>>>> latency:
>>>>
>>>> (all latency values are in microseconds)
>>>>
>>>> - With vanilla 5.6.0-rc3
>>>>
>>>>   percentile latency
>>>>   –– –––
>>>>   57894
>>>>  109496
>>>>  25   12561
>>>>  30   15295
>>>>  40   18244
>>>>  50   21229
>>>>  60   27556
>>>>  75   30147
>>>>  80   31047
>>>>  90   32859
>>>>  95   33799
>>>>
>>>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
>>>> 762G total free => 98% of free memory could be alloc

Re: [PATCH v8] mm: Proactive compaction

2020-06-22 Thread Nitin Gupta
On 6/22/20 7:26 PM, Nathan Chancellor wrote:
> On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote:
>> For some applications, we need to allocate almost all memory as
>> hugepages. However, on a running system, higher-order allocations can
>> fail if the memory is fragmented. Linux kernel currently does on-demand
>> compaction as we request more hugepages, but this style of compaction
>> incurs very high latency. Experiments with one-time full memory
>> compaction (followed by hugepage allocations) show that kernel is able
>> to restore a highly fragmented memory state to a fairly compacted memory
>> state within <1 sec for a 32G system. Such data suggests that a more
>> proactive compaction can help us allocate a large fraction of memory as
>> hugepages keeping allocation latencies low.
>>
>> For a more proactive compaction, the approach taken here is to define a
>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
>> for external fragmentation which kcompactd tries to maintain.
>>
>> The tunable takes a value in range [0, 100], with a default of 20.
>>
>> Note that a previous version of this patch [1] was found to introduce
>> too many tunables (per-order extfrag{low, high}), but this one reduces
>> them to just one sysctl. Also, the new tunable is an opaque value
>> instead of asking for specific bounds of "external fragmentation", which
>> would have been difficult to estimate. The internal interpretation of
>> this opaque value allows for future fine-tuning.
>>
>> Currently, we use a simple translation from this tunable to [low, high]
>> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
>> The score for a node is defined as weighted mean of per-zone external
>> fragmentation. A zone's present_pages determines its weight.
>>
>> To periodically check per-node score, we reuse per-node kcompactd
>> threads, which are woken up every 500 milliseconds to check the same. If
>> a node's score exceeds its high threshold (as derived from user-provided
>> proactiveness value), proactive compaction is started until its score
>> reaches its low threshold value. By default, proactiveness is set to 20,
>> which implies threshold values of low=80 and high=90.
>>
>> This patch is largely based on ideas from Michal Hocko [2]. See also the
>> LWN article [3].
>>
>> Performance data
>> 
>>
>> System: x64_64, 1T RAM, 80 CPU threads.
>> Kernel: 5.6.0-rc3 + this patch
>>
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>>
>> Before starting the driver, the system was fragmented from a userspace
>> program that allocates all memory and then for each 2M aligned section,
>> frees 3/4 of base pages using munmap. The workload is mainly anonymous
>> userspace pages, which are easy to move around. I intentionally avoided
>> unmovable pages in this test to see how much latency we incur when
>> hugepage allocations hit direct compaction.
>>
>> 1. Kernel hugepage allocation latencies
>>
>> With the system in such a fragmented state, a kernel driver then
>> allocates as many hugepages as possible and measures allocation
>> latency:
>>
>> (all latency values are in microseconds)
>>
>> - With vanilla 5.6.0-rc3
>>
>>   percentile latency
>>   –– –––
>> 57894
>>109496
>>25   12561
>>30   15295
>>40   18244
>>50   21229
>>60   27556
>>75   30147
>>80   31047
>>90   32859
>>95   33799
>>
>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=20
>>
>> sysctl -w vm.compaction_proactiveness=20
>>
>>   percentile latency
>>   –– –––
>> 5   2
>>10   2
>>25   3
>>30   3
>>40   3
>>50   4
>>60   4
>>75   4
>>80   4
>>90   5
>>95 429
>>
>> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
>> 2. JAVA heap allocation
>>
>> I

Re: [PATCH] mm: Use unsigned types for fragmentation score

2020-06-18 Thread Nitin Gupta
On 6/18/20 6:41 AM, Baoquan He wrote:
> On 06/17/20 at 06:03pm, Nitin Gupta wrote:
>> Proactive compaction uses per-node/zone "fragmentation score" which
>> is always in range [0, 100], so use unsigned type of these scores
>> as well as for related constants.
>>
>> Signed-off-by: Nitin Gupta 
>> ---
>>  include/linux/compaction.h |  4 ++--
>>  kernel/sysctl.c|  2 +-
>>  mm/compaction.c| 18 +-
>>  mm/vmstat.c|  2 +-
>>  4 files changed, 13 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>> index 7a242d46454e..25a521d299c1 100644
>> --- a/include/linux/compaction.h
>> +++ b/include/linux/compaction.h
>> @@ -85,13 +85,13 @@ static inline unsigned long compact_gap(unsigned int 
>> order)
>>  
>>  #ifdef CONFIG_COMPACTION
>>  extern int sysctl_compact_memory;
>> -extern int sysctl_compaction_proactiveness;
>> +extern unsigned int sysctl_compaction_proactiveness;
>>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>>  void *buffer, size_t *length, loff_t *ppos);
>>  extern int sysctl_extfrag_threshold;
>>  extern int sysctl_compact_unevictable_allowed;
>>  
>> -extern int extfrag_for_order(struct zone *zone, unsigned int order);
>> +extern unsigned int extfrag_for_order(struct zone *zone, unsigned int 
>> order);
>>  extern int fragmentation_index(struct zone *zone, unsigned int order);
>>  extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
>>  unsigned int order, unsigned int alloc_flags,
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 58b0a59c9769..40180cdde486 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -2833,7 +2833,7 @@ static struct ctl_table vm_table[] = {
>>  {
>>  .procname   = "compaction_proactiveness",
>>  .data   = &sysctl_compaction_proactiveness,
>> -.maxlen = sizeof(int),
>> +.maxlen = sizeof(sysctl_compaction_proactiveness),
> 
> Patch looks good to me. Wondering why not using 'unsigned int' here,
> just curious.
> 


It's just coding style preference. I see the same style used for many
other sysctls too (min_free_kbytes etc.).

Thanks,
Nitin



[PATCH] mm: Use unsigned types for fragmentation score

2020-06-17 Thread Nitin Gupta
Proactive compaction uses per-node/zone "fragmentation score" which
is always in range [0, 100], so use unsigned type of these scores
as well as for related constants.

Signed-off-by: Nitin Gupta 
---
 include/linux/compaction.h |  4 ++--
 kernel/sysctl.c|  2 +-
 mm/compaction.c| 18 +-
 mm/vmstat.c|  2 +-
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7a242d46454e..25a521d299c1 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -85,13 +85,13 @@ static inline unsigned long compact_gap(unsigned int order)
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
-extern int sysctl_compaction_proactiveness;
+extern unsigned int sysctl_compaction_proactiveness;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
void *buffer, size_t *length, loff_t *ppos);
 extern int sysctl_extfrag_threshold;
 extern int sysctl_compact_unevictable_allowed;
 
-extern int extfrag_for_order(struct zone *zone, unsigned int order);
+extern unsigned int extfrag_for_order(struct zone *zone, unsigned int order);
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, unsigned int alloc_flags,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 58b0a59c9769..40180cdde486 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2833,7 +2833,7 @@ static struct ctl_table vm_table[] = {
{
.procname   = "compaction_proactiveness",
.data   = &sysctl_compaction_proactiveness,
-   .maxlen = sizeof(int),
+   .maxlen = sizeof(sysctl_compaction_proactiveness),
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
diff --git a/mm/compaction.c b/mm/compaction.c
index ac2030814edb..45fd24a0ea0b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -53,7 +53,7 @@ static inline void count_compact_events(enum vm_event_item 
item, long delta)
 /*
  * Fragmentation score check interval for proactive compaction purposes.
  */
-static const int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
+static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
 
 /*
  * Page order with-respect-to which proactive compaction
@@ -1890,7 +1890,7 @@ static bool kswapd_is_running(pg_data_t *pgdat)
  * ZONE_DMA32. For smaller zones, the score value remains close to zero,
  * and thus never exceeds the high threshold for proactive compaction.
  */
-static int fragmentation_score_zone(struct zone *zone)
+static unsigned int fragmentation_score_zone(struct zone *zone)
 {
unsigned long score;
 
@@ -1906,9 +1906,9 @@ static int fragmentation_score_zone(struct zone *zone)
  * the node's score falls below the low threshold, or one of the back-off
  * conditions is met.
  */
-static int fragmentation_score_node(pg_data_t *pgdat)
+static unsigned int fragmentation_score_node(pg_data_t *pgdat)
 {
-   unsigned long score = 0;
+   unsigned int score = 0;
int zoneid;
 
for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
@@ -1921,17 +1921,17 @@ static int fragmentation_score_node(pg_data_t *pgdat)
return score;
 }
 
-static int fragmentation_score_wmark(pg_data_t *pgdat, bool low)
+static unsigned int fragmentation_score_wmark(pg_data_t *pgdat, bool low)
 {
-   int wmark_low;
+   unsigned int wmark_low;
 
/*
 * Cap the low watermak to avoid excessive compaction
 * activity in case a user sets the proactivess tunable
 * close to 100 (maximum).
 */
-   wmark_low = max(100 - sysctl_compaction_proactiveness, 5);
-   return low ? wmark_low : min(wmark_low + 10, 100);
+   wmark_low = max(100U - sysctl_compaction_proactiveness, 5U);
+   return low ? wmark_low : min(wmark_low + 10, 100U);
 }
 
 static bool should_proactive_compact_node(pg_data_t *pgdat)
@@ -2604,7 +2604,7 @@ int sysctl_compact_memory;
  * aggressively the kernel should compact memory in the
  * background. It takes values in the range [0, 100].
  */
-int __read_mostly sysctl_compaction_proactiveness = 20;
+unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
 
 /*
  * This is the entry point for compacting all nodes via
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3e7ba8bce2ba..b1de695b826d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1079,7 +1079,7 @@ static int __fragmentation_index(unsigned int order, 
struct contig_page_info *in
  * It is defined as the percentage of pages found in blocks of size
  * less than 1 << order. It returns values in range [0, 100].
  */
-int extfrag_for_order(struct zone *zone, unsigned int order)
+unsigned int extfrag_for_order(struct zone *zone,

Re: [PATCH v8] mm: Proactive compaction

2020-06-17 Thread Nitin Gupta




On 6/17/20 1:53 PM, Andrew Morton wrote:

On Tue, 16 Jun 2020 13:45:27 -0700 Nitin Gupta  wrote:


For some applications, we need to allocate almost all memory as
hugepages. However, on a running system, higher-order allocations can
fail if the memory is fragmented. Linux kernel currently does on-demand
compaction as we request more hugepages, but this style of compaction
incurs very high latency. Experiments with one-time full memory
compaction (followed by hugepage allocations) show that kernel is able
to restore a highly fragmented memory state to a fairly compacted memory
state within <1 sec for a 32G system. Such data suggests that a more
proactive compaction can help us allocate a large fraction of memory as
hugepages keeping allocation latencies low.

...



All looks straightforward to me and easy to disable if it goes wrong.

All the hard-coded magic numbers are a worry, but such is life.

One teeny complaint:



...

@@ -2650,12 +2801,34 @@ static int kcompactd(void *p)
unsigned long pflags;
  
  		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);

-   wait_event_freezable(pgdat->kcompactd_wait,
-   kcompactd_work_requested(pgdat));
+   if (wait_event_freezable_timeout(pgdat->kcompactd_wait,
+   kcompactd_work_requested(pgdat),
+   msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) {
+
+   psi_memstall_enter(&pflags);
+   kcompactd_do_work(pgdat);
+   psi_memstall_leave(&pflags);
+   continue;
+   }
  
-		psi_memstall_enter(&pflags);

-   kcompactd_do_work(pgdat);
-   psi_memstall_leave(&pflags);
+   /* kcompactd wait timeout */
+   if (should_proactive_compact_node(pgdat)) {
+   unsigned int prev_score, score;


Everywhere else, scores have type `int'.  Here they are unsigned.  How come?

Would it be better to make these unsigned throughout?  I don't think a
score can ever be negative?



The score is always in [0, 100], so yes, it should be unsigned.
I will send another patch which fixes this.

Thanks,
Nitin



[PATCH v8] mm: Proactive compaction

2020-06-16 Thread Nitin Gupta
e's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t  he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the
low threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior


Above workloads produce a memory state which is easy to compact.
However, if memory is filled with unmovable pages, proactive compaction
should essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta 
Reviewed-by: Vlastimil Babka 
Reviewed-by: Khalid Aziz 
Reviewed-by: Oleksandr Natalenko 
Tested-by: Oleksandr Natalenko 
To: Andrew Morton 
CC: Vlastimil Babka 
CC: Khalid Aziz 
CC: Michal Hocko 
CC: Mel Gorman 
CC: Matthew Wilcox 
CC: Mike Kravetz 
CC: Joonsoo Kim 
CC: David Rientjes 
CC: Nitin Gupta 
CC: Oleksandr Natalenko 
CC: linux-kernel 
CC: linux-mm 
CC: Linux API 

---
Changelog v8 vs v7:
 - Rebase to 5.8-rc1

Changelog v7 vs v6:
 - Fix compile error while THP is disabled (Oleksandr)

Changelog v6 vs v5:
 - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, and
   some cleanups (Vlastimil)
 - Cap min threshold to avoid excess compaction load in case user sets
   extreme values like 100 for `vm.compaction_proactiveness` sysctl (Khalid)
 - Add some more explanation about the effect of tunable on compaction
   behavior in user guide (Khalid)

Changelog v5 vs v4:
 - Change tunable from sysfs to sysctl (Vlastimil)
 - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil)
 - Minor cleanups (remove redundant initializations, ...)

Changelog v4 vs v3:
 - Document various functions.
 - Added admin-guide for the new tunable `proactiveness`.
 - Rename proactive_compaction_score to fragmentation_score for clarity.

Changelog v3 vs v2:
 - Make proactiveness a global tunable and not per-node. Also upadated
the
   patch description to reflect the same (Vlastimil Babka).
 - Don't start proactive compaction if kswapd is running (Vlastimil
Babka).
 - Clarified in the description that compaction runs in parallel with
   the workload, instead of a one-time compaction followed by a stream
of
   hugepage allocations.

Changelog v2 vs v1:
 - Introduce per-node and per-zone "proactive compaction score". This
   score is compared against watermarks which are set according to
   user provided proactiveness value.
 - Separate code-paths for proactive compaction from targeted compaction
   i.e. where pgdat->kcompactd_max_order is non-zero.
 - Renamed hpage_compaction_effort -> proactiveness. In future we may
   use more than extfrag wrt hugepage size to determine proactive
   compaction score.
---
 Documentation/admin-guide/sysctl/vm.rst |  15 ++
 include/linux/compaction.h  |   2 +
 kernel/sysctl.c |   9 ++
 mm/compaction.c | 183 +++-
 mm/internal.h   |   1 +
 mm/vmstat.c |  18 +++
 6 files changed, 223 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index d46d5b7013c6..4b7c496199ca 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -119,6 +119,21 @@ all zones are compacted such that free memory is available 
in contiguous
 blocks where possible. This can be important for example in the allocation of
 huge pages although processes will also directly compact memory as required.
 
+compaction_proactiveness
+
+
+This tunable takes a value in the range [0, 100] with a default value of
+20. This tunable determines how aggressively compaction is done in the
+background. Setting it to 0 disables proactive compaction.
+
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.
+
+Be careful when setting it

Re: [PATCH v7] mm: Proactive compaction

2020-06-16 Thread Nitin Gupta
On 6/16/20 2:46 AM, Oleksandr Natalenko wrote:
> Hello.
> 
> Please see the notes inline.
> 
> On Mon, Jun 15, 2020 at 07:36:14AM -0700, Nitin Gupta wrote:
>> For some applications, we need to allocate almost all memory as
>> hugepages. However, on a running system, higher-order allocations can
>> fail if the memory is fragmented. Linux kernel currently does on-demand
>> compaction as we request more hugepages, but this style of compaction
>> incurs very high latency. Experiments with one-time full memory
>> compaction (followed by hugepage allocations) show that kernel is able
>> to restore a highly fragmented memory state to a fairly compacted memory
>> state within <1 sec for a 32G system. Such data suggests that a more
>> proactive compaction can help us allocate a large fraction of memory as
>> hugepages keeping allocation latencies low.
>>
>> For a more proactive compaction, the approach taken here is to define a
>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
>> for external fragmentation which kcompactd tries to maintain.
>>
>> The tunable takes a value in range [0, 100], with a default of 20.
>>
>> Note that a previous version of this patch [1] was found to introduce
>> too many tunables (per-order extfrag{low, high}), but this one reduces
>> them to just one sysctl. Also, the new tunable is an opaque value
>> instead of asking for specific bounds of "external fragmentation", which
>> would have been difficult to estimate. The internal interpretation of
>> this opaque value allows for future fine-tuning.
>>
>> Currently, we use a simple translation from this tunable to [low, high]
>> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
>> The score for a node is defined as weighted mean of per-zone external
>> fragmentation. A zone's present_pages determines its weight.
>>
>> To periodically check per-node score, we reuse per-node kcompactd
>> threads, which are woken up every 500 milliseconds to check the same. If
>> a node's score exceeds its high threshold (as derived from user-provided
>> proactiveness value), proactive compaction is started until its score
>> reaches its low threshold value. By default, proactiveness is set to 20,
>> which implies threshold values of low=80 and high=90.
>>
>> This patch is largely based on ideas from Michal Hocko [2]. See also the
>> LWN article [3].
>>
>> Performance data
>> 
>>
>> System: x64_64, 1T RAM, 80 CPU threads.
>> Kernel: 5.6.0-rc3 + this patch
>>
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>>
>> Before starting the driver, the system was fragmented from a userspace
>> program that allocates all memory and then for each 2M aligned section,
>> frees 3/4 of base pages using munmap. The workload is mainly anonymous
>> userspace pages, which are easy to move around. I intentionally avoided
>> unmovable pages in this test to see how much latency we incur when
>> hugepage allocations hit direct compaction.
>>
>> 1. Kernel hugepage allocation latencies
>>
>> With the system in such a fragmented state, a kernel driver then
>> allocates as many hugepages as possible and measures allocation
>> latency:
>>
>> (all latency values are in microseconds)
>>
>> - With vanilla 5.6.0-rc3
>>
>>   percentile latency
>>   –– –––
>> 57894
>>109496
>>25   12561
>>30   15295
>>40   18244
>>50   21229
>>60   27556
>>75   30147
>>80   31047
>>90   32859
>>95   33799
>>
>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=20
>>
>> sysctl -w vm.compaction_proactiveness=20
>>
>>   percentile latency
>>   –– –––
>> 5   2
>>10   2
>>25   3
>>30   3
>>40   3
>>50   4
>>60   4
>>75   4
>>80   4
>>90   5
>>95 429
>>
>> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
&

[PATCH v7] mm: Proactive compaction

2020-06-15 Thread Nitin Gupta
e's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t  he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the
low threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior


Above workloads produce a memory state which is easy to compact.
However, if memory is filled with unmovable pages, proactive compaction
should essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta 
Reviewed-by: Vlastimil Babka 
Reviewed-by: Khalid Aziz 
To: Andrew Morton 
CC: Vlastimil Babka 
CC: Khalid Aziz 
CC: Michal Hocko 
CC: Mel Gorman 
CC: Matthew Wilcox 
CC: Mike Kravetz 
CC: Joonsoo Kim 
CC: David Rientjes 
CC: Nitin Gupta 
CC: Oleksandr Natalenko 
CC: linux-kernel 
CC: linux-mm 
CC: Linux API 

---
Changelog v7 vs v6:
 - Fix compile error while THP is disabled (Oleksandr)

Changelog v6 vs v5:
 - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, and
   some cleanups (Vlastimil)
 - Cap min threshold to avoid excess compaction load in case user sets
   extreme values like 100 for `vm.compaction_proactiveness` sysctl (Khalid)
 - Add some more explanation about the effect of tunable on compaction
   behavior in user guide (Khalid)

Changelog v5 vs v4:
 - Change tunable from sysfs to sysctl (Vlastimil)
 - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil)
 - Minor cleanups (remove redundant initializations, ...)

Changelog v4 vs v3:
 - Document various functions.
 - Added admin-guide for the new tunable `proactiveness`.
 - Rename proactive_compaction_score to fragmentation_score for clarity.

Changelog v3 vs v2:
 - Make proactiveness a global tunable and not per-node. Also upadated
the
   patch description to reflect the same (Vlastimil Babka).
 - Don't start proactive compaction if kswapd is running (Vlastimil
Babka).
 - Clarified in the description that compaction runs in parallel with
   the workload, instead of a one-time compaction followed by a stream
of
   hugepage allocations.

Changelog v2 vs v1:
 - Introduce per-node and per-zone "proactive compaction score". This
   score is compared against watermarks which are set according to
   user provided proactiveness value.
 - Separate code-paths for proactive compaction from targeted compaction
   i.e. where pgdat->kcompactd_max_order is non-zero.
 - Renamed hpage_compaction_effort -> proactiveness. In future we may
   use more than extfrag wrt hugepage size to determine proactive
   compaction score.
---
 Documentation/admin-guide/sysctl/vm.rst |  15 ++
 include/linux/compaction.h  |   2 +
 kernel/sysctl.c |   9 ++
 mm/compaction.c | 183 +++-
 mm/internal.h   |   1 +
 mm/vmstat.c |  18 +++
 6 files changed, 223 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 0329a4d3fa9e..360914b4f346 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -119,6 +119,21 @@ all zones are compacted such that free memory is available 
in contiguous
 blocks where possible. This can be important for example in the allocation of
 huge pages although processes will also directly compact memory as required.
 
+compaction_proactiveness
+
+
+This tunable takes a value in the range [0, 100] with a default value of
+20. This tunable determines how aggressively compaction is done in the
+background. Setting it to 0 disables proactive compaction.
+
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.
+
+Be careful when setting it to extreme values like 100, as that may
+cause excessive background compaction activity.
 
 compact_unevictable_allowed
 

Re: [PATCH v6] mm: Proactive compaction

2020-06-15 Thread Nitin Gupta
On 6/15/20 7:25 AM, Oleksandr Natalenko wrote:
> On Mon, Jun 15, 2020 at 10:29:01AM +0200, Oleksandr Natalenko wrote:
>> Just to let you know, this fails to compile for me with THP disabled on
>> v5.8-rc1:
>>
>>   CC  mm/compaction.o
>> In file included from ./include/linux/dev_printk.h:14,
>>  from ./include/linux/device.h:15,
>>  from ./include/linux/node.h:18,
>>  from ./include/linux/cpu.h:17,
>>  from mm/compaction.c:11:
>> In function ‘fragmentation_score_zone’,
>> inlined from ‘__compact_finished’ at mm/compaction.c:1982:11,
>> inlined from ‘compact_zone’ at mm/compaction.c:2062:8:
>> ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ 
>> declared with attribute error: BUILD_BUG failed
>>   392 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
>> __COUNTER__)
>>   |  ^
>> ./include/linux/compiler.h:373:4: note: in definition of macro 
>> ‘__compiletime_assert’
>>   373 |prefix ## suffix();\
>>   |^~
>> ./include/linux/compiler.h:392:2: note: in expansion of macro 
>> ‘_compiletime_assert’
>>   392 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
>> __COUNTER__)
>>   |  ^~~
>> ./include/linux/build_bug.h:39:37: note: in expansion of macro 
>> ‘compiletime_assert’
>>39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>   | ^~
>> ./include/linux/build_bug.h:59:21: note: in expansion of macro 
>> ‘BUILD_BUG_ON_MSG’
>>59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>>   | ^~~~
>> ./include/linux/huge_mm.h:319:28: note: in expansion of macro ‘BUILD_BUG’
>>   319 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>>   |^
>> ./include/linux/huge_mm.h:115:26: note: in expansion of macro 
>> ‘HPAGE_PMD_SHIFT’
>>   115 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
>>   |  ^~~
>> mm/compaction.c:64:32: note: in expansion of macro ‘HPAGE_PMD_ORDER’
>>64 | #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER
>>   |^~~
>> mm/compaction.c:1898:28: note: in expansion of macro ‘COMPACTION_HPAGE_ORDER’
>>  1898 |extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
>>   |^~
>> In function ‘fragmentation_score_zone’,
>> inlined from ‘kcompactd’ at mm/compaction.c:1918:12:
>> ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ 
>> declared with attribute error: BUILD_BUG failed
>>   392 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
>> __COUNTER__)
>>   |  ^
>> ./include/linux/compiler.h:373:4: note: in definition of macro 
>> ‘__compiletime_assert’
>>   373 |prefix ## suffix();\
>>   |^~
>> ./include/linux/compiler.h:392:2: note: in expansion of macro 
>> ‘_compiletime_assert’
>>   392 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
>> __COUNTER__)
>>   |  ^~~
>> ./include/linux/build_bug.h:39:37: note: in expansion of macro 
>> ‘compiletime_assert’
>>39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>   | ^~
>> ./include/linux/build_bug.h:59:21: note: in expansion of macro 
>> ‘BUILD_BUG_ON_MSG’
>>59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>>   | ^~~~
>> ./include/linux/huge_mm.h:319:28: note: in expansion of macro ‘BUILD_BUG’
>>   319 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>>   |^
>> ./include/linux/huge_mm.h:115:26: note: in expansion of macro 
>> ‘HPAGE_PMD_SHIFT’
>>   115 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
>>   |  ^~~
>> mm/compaction.c:64:32: note: in expansion of macro ‘HPAGE_PMD_ORDER’
>>64 | #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER
>>   |^~~
>> mm/compaction.c:1898:28: note: in expansion of macro ‘COMPACTION_HPAGE_ORDER’
>>  1898 |extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
>>   |^~
>> In function ‘fragmentation_score_zone’,
>> inlined from ‘kcompactd’ at mm/compaction.c:1918:12:
>> ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ 
>> declared with attribute error: BUILD_BUG failed
>>   392 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
>> __COUNTER__)
>>   |  ^
>> ./include/linux/compiler.h:373:4: note: in definition of macro 
>> ‘__compiletime_assert’
>>   373 |prefix ## suffix();\
>>   

Re: [PATCH v6] mm: Proactive compaction

2020-06-11 Thread Nitin Gupta
On 6/9/20 12:23 PM, Khalid Aziz wrote:
> On Mon, 2020-06-01 at 12:48 -0700, Nitin Gupta wrote:
>> For some applications, we need to allocate almost all memory as
>> hugepages. However, on a running system, higher-order allocations can
>> fail if the memory is fragmented. Linux kernel currently does on-
>> demand
>> compaction as we request more hugepages, but this style of compaction
>> incurs very high latency. Experiments with one-time full memory
>> compaction (followed by hugepage allocations) show that kernel is
>> able
>> to restore a highly fragmented memory state to a fairly compacted
>> memory
>> state within <1 sec for a 32G system. Such data suggests that a more
>> proactive compaction can help us allocate a large fraction of memory
>> as
>> hugepages keeping allocation latencies low.
>>
>> For a more proactive compaction, the approach taken here is to define
>> a
>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
>> for external fragmentation which kcompactd tries to maintain.
>>
>> The tunable takes a value in range [0, 100], with a default of 20.
>>
>> Note that a previous version of this patch [1] was found to introduce
>> too many tunables (per-order extfrag{low, high}), but this one
>> reduces
>> them to just one sysctl. Also, the new tunable is an opaque value
>> instead of asking for specific bounds of "external fragmentation",
>> which
>> would have been difficult to estimate. The internal interpretation of
>> this opaque value allows for future fine-tuning.
>>
>> Currently, we use a simple translation from this tunable to [low,
>> high]
>> "fragmentation score" thresholds (low=100-proactiveness,
>> high=low+10%).
>> The score for a node is defined as weighted mean of per-zone external
>> fragmentation. A zone's present_pages determines its weight.
>>
>> To periodically check per-node score, we reuse per-node kcompactd
>> threads, which are woken up every 500 milliseconds to check the same.
>> If
>> a node's score exceeds its high threshold (as derived from user-
>> provided
>> proactiveness value), proactive compaction is started until its score
>> reaches its low threshold value. By default, proactiveness is set to
>> 20,
>> which implies threshold values of low=80 and high=90.
>>
>> This patch is largely based on ideas from Michal Hocko [2]. See also
>> the
>> LWN article [3].
>>
>> Performance data
>> 
>>
>> System: x64_64, 1T RAM, 80 CPU threads.
>> Kernel: 5.6.0-rc3 + this patch
>>
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>>
>> Before starting the driver, the system was fragmented from a
>> userspace
>> program that allocates all memory and then for each 2M aligned
>> section,
>> frees 3/4 of base pages using munmap. The workload is mainly
>> anonymous
>> userspace pages, which are easy to move around. I intentionally
>> avoided
>> unmovable pages in this test to see how much latency we incur when
>> hugepage allocations hit direct compaction.
>>
>> 1. Kernel hugepage allocation latencies
>>
>> With the system in such a fragmented state, a kernel driver then
>> allocates as many hugepages as possible and measures allocation
>> latency:
>>
>> (all latency values are in microseconds)
>>
>> - With vanilla 5.6.0-rc3
>>
>>   percentile latency
>>   –– –––
>> 57894
>>109496
>>25   12561
>>30   15295
>>40   18244
>>50   21229
>>60   27556
>>75   30147
>>80   31047
>>90   32859
>>95   33799
>>
>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as
>> hugepages)
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=20
>>
>> sysctl -w vm.compaction_proactiveness=20
>>
>>   percentile latency
>>   –– –––
>> 5   2
>>10   2
>>25   3
>>30   3
>>40   3
>>50   4
>>60   4
>>75   4
>>80   4
>>90   5
>>95 429
>>
>> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of

Re: [PATCH v6] mm: Proactive compaction

2020-06-09 Thread Nitin Gupta
On Mon, Jun 1, 2020 at 12:48 PM Nitin Gupta  wrote:
>
> For some applications, we need to allocate almost all memory as
> hugepages. However, on a running system, higher-order allocations can
> fail if the memory is fragmented. Linux kernel currently does on-demand
> compaction as we request more hugepages, but this style of compaction
> incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) show that kernel is able
> to restore a highly fragmented memory state to a fairly compacted memory
> state within <1 sec for a 32G system. Such data suggests that a more
> proactive compaction can help us allocate a large fraction of memory as
> hugepages keeping allocation latencies low.
>

> Signed-off-by: Nitin Gupta 
> Reviewed-by: Vlastimil Babka 

(+CC Khalid)

Can this be pipelined for upstream inclusion now? Sorry, I'm a bit
rusty on upstream flow these days.

Thanks,
Nitin


[PATCH v6] mm: Proactive compaction

2020-06-01 Thread Nitin Gupta
e's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t  he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the
low threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior


Above workloads produce a memory state which is easy to compact.
However, if memory is filled with unmovable pages, proactive compaction
should essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta 
Reviewed-by: Vlastimil Babka 
To: Mel Gorman 
To: Michal Hocko 
To: Vlastimil Babka 
CC: Matthew Wilcox 
CC: Andrew Morton 
CC: Mike Kravetz 
CC: Joonsoo Kim 
CC: David Rientjes 
CC: Nitin Gupta 
CC: linux-kernel 
CC: linux-mm 
CC: Linux API 

---
Changelog v6 vs v5:
 - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, and
   some cleanups (Vlastimil)
 - Cap min threshold to avoid excess compaction load in case user sets
   extreme values like 100 for `vm.compaction_proactiveness` sysctl (Khalid)
 - Add some more explanation about the effect of tunable on compaction
   behavior in user guide (Khalid)

Changelog v5 vs v4:
 - Change tunable from sysfs to sysctl (Vlastimil)
 - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil)
 - Minor cleanups (remove redundant initializations, ...)

Changelog v4 vs v3:
 - Document various functions.
 - Added admin-guide for the new tunable `proactiveness`.
 - Rename proactive_compaction_score to fragmentation_score for clarity.

Changelog v3 vs v2:
 - Make proactiveness a global tunable and not per-node. Also upadated
the
   patch description to reflect the same (Vlastimil Babka).
 - Don't start proactive compaction if kswapd is running (Vlastimil
Babka).
 - Clarified in the description that compaction runs in parallel with
   the workload, instead of a one-time compaction followed by a stream
of
   hugepage allocations.

Changelog v2 vs v1:
 - Introduce per-node and per-zone "proactive compaction score". This
   score is compared against watermarks which are set according to
   user provided proactiveness value.
 - Separate code-paths for proactive compaction from targeted compaction
   i.e. where pgdat->kcompactd_max_order is non-zero.
 - Renamed hpage_compaction_effort -> proactiveness. In future we may
   use more than extfrag wrt hugepage size to determine proactive
   compaction score.
---
 Documentation/admin-guide/sysctl/vm.rst |  15 ++
 include/linux/compaction.h  |   2 +
 kernel/sysctl.c |   9 ++
 mm/compaction.c | 183 +++-
 mm/internal.h   |   1 +
 mm/vmstat.c |  18 +++
 6 files changed, 223 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 0329a4d3fa9e..360914b4f346 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -119,6 +119,21 @@ all zones are compacted such that free memory is available 
in contiguous
 blocks where possible. This can be important for example in the allocation of
 huge pages although processes will also directly compact memory as required.
 
+compaction_proactiveness
+
+
+This tunable takes a value in the range [0, 100] with a default value of
+20. This tunable determines how aggressively compaction is done in the
+background. Setting it to 0 disables proactive compaction.
+
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.
+
+Be careful when setting it to extreme values like 100, as that may
+cause excessive background compaction activity.
 
 compact_unevictable_allowed
 ===
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4b898cdbdf05..ccd28978b296 100644
--- a/include/linux/

Re: [PATCH v5] mm: Proactive compaction

2020-05-28 Thread Nitin Gupta
On Thu, May 28, 2020 at 4:32 PM Khalid Aziz  wrote:
>
> This looks good to me. I like the idea overall of controlling
> aggressiveness of compaction with a single tunable for the whole
> system. I wonder how an end user could arrive at what a reasonable
> value would be for this based upon their workload. More comments below.
>

Tunables like the one this patch introduces, and similar ones like 'swappiness'
will always require some experimentations from the user.


> On Mon, 2020-05-18 at 11:14 -0700, Nitin Gupta wrote:
> > For some applications, we need to allocate almost all memory as
> > hugepages. However, on a running system, higher-order allocations can
> > fail if the memory is fragmented. Linux kernel currently does on-
> > demand
> > compaction as we request more hugepages, but this style of compaction
> > incurs very high latency. Experiments with one-time full memory
> > compaction (followed by hugepage allocations) show that kernel is
> > able
> > to restore a highly fragmented memory state to a fairly compacted
> > memory
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory
> > as
> > hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > a new tunable called 'proactiveness' which dictates bounds for
> > external
> > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to
> > maintain.
> >
> > The tunable is exposed through sysctl:
> >   /proc/sys/vm/compaction_proactiveness
> >
> > It takes value in range [0, 100], with a default of 20.
>
> Looking at the code, setting this to 100 would mean system would
> continuously strive to drive level of fragmentation down to 0 which can
> not be reasonable and would bog the system down. A cap lower than 100
> might be a good idea to keep kcompactd from dragging system down.
>

Yes, I understand that a value of 100 would be a continuous compaction
storm but I still don't want to artificially cap the tunable. The interpretation
of this tunable can change in future, and a range of [0, 100] seems
more intuitive than, say [0, 90]. Still, I think a word of caution should
be added to its documentation (admin-guide/sysctl/vm.rst).


> >

> > Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
> > 762G total free => 98% of free memory could be allocated as
> > hugepages)
> >
> > - With 5.6.0-rc3 + this patch, with proactiveness=20
> >
> > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
>
> Should be "echo 20 | sudo tee /proc/sys/vm/compaction_proactiveness"
>

oops... I forgot to update the patch description. This is from the v4 patch
which used sysfs but v5 switched to using sysctl.


> >

> > diff --git a/Documentation/admin-guide/sysctl/vm.rst
> > b/Documentation/admin-guide/sysctl/vm.rst
> > index 0329a4d3fa9e..e5d88cabe980 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -119,6 +119,19 @@ all zones are compacted such that free memory is
> > available in contiguous
> >  blocks where possible. This can be important for example in the
> > allocation of
> >  huge pages although processes will also directly compact memory as
> > required.
> >
> > +compaction_proactiveness
> > +
> > +
> > +This tunable takes a value in the range [0, 100] with a default
> > value of
> > +20. This tunable determines how aggressively compaction is done in
> > the
> > +background. Setting it to 0 disables proactive compaction.
> > +
> > +Note that compaction has a non-trivial system-wide impact as pages
> > +belonging to different processes are moved around, which could also
> > lead
> > +to latency spikes in unsuspecting applications. The kernel employs
> > +various heuristics to avoid wasting CPU cycles if it detects that
> > +proactive compaction is not being effective.
> > +
>
> Value of 100 would cause kcompactd to try to bring fragmentation down
> to 0. If hugepages are being consumed and released continuously by the
> workload, it is possible that kcompactd keeps making progress (and
> hence passes the test "proactive_defer = score < prev_score ?")
> continuously but can not reach a fragmentation score of 0 and hence
> gets stuck in compact_zone() for a long time. Page migration for
> compaction is not inexpensive. Maybe either cap the value to something
> less than 100 or set a floor fo

Re: [PATCH v5] mm: Proactive compaction

2020-05-28 Thread Nitin Gupta
On Wed, May 27, 2020 at 3:18 AM Vlastimil Babka  wrote:
>
> On 5/18/20 8:14 PM, Nitin Gupta wrote:
> > For some applications, we need to allocate almost all memory as
> > hugepages. However, on a running system, higher-order allocations can
> > fail if the memory is fragmented. Linux kernel currently does on-demand
> > compaction as we request more hugepages, but this style of compaction
> > incurs very high latency. Experiments with one-time full memory
> > compaction (followed by hugepage allocations) show that kernel is able
> > to restore a highly fragmented memory state to a fairly compacted memory
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory as
> > hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > a new tunable called 'proactiveness' which dictates bounds for external
> > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to
>
> HPAGE_PMD_ORDER
>

Since HPAGE_PMD_ORDER is not always defined, and thus we may have
to fallback to HUGETLB_PAGE_ORDER or even PMD_ORDER, I think
I should remove references to the order in the patch description entirely.

I also need to change the tunable name from 'proactiveness' to
'vm.compaction_proactiveness' sysctl.

modified description:
===
For a more proactive compaction, the approach taken here is to define
a new sysctl called 'vm.compaction_proactiveness' which dictates
bounds for external fragmentation which kcompactd tries to ...
===


> >
> > The tunable is exposed through sysctl:
> >   /proc/sys/vm/compaction_proactiveness
> >
> > It takes value in range [0, 100], with a default of 20.
> >


> >
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
>
> Make this link a [2] reference? I would also add: "See also the LWN article
> [3]." where [3] is https://lwn.net/Articles/817905/
>
>

Sounds good. I will turn these into [2] and [3] references.



>
> Reviewed-by: Vlastimil Babka 
>

> With some smaller nitpicks below.
>
> But as we are adding a new API, I would really appreciate others comment about
> the approach at least.
>



> > +/*
> > + * A zone's fragmentation score is the external fragmentation wrt to the
> > + * HUGETLB_PAGE_ORDER scaled by the zone's size. It returns a value in the
>
> HPAGE_PMD_ORDER
>

Maybe just remove reference to the order as I mentioned above?



> > +/*
> > + * Tunable for proactive compaction. It determines how
> > + * aggressively the kernel should compact memory in the
> > + * background. It takes values in the range [0, 100].
> > + */
> > +int sysctl_compaction_proactiveness = 20;
>
> These are usually __read_mostly
>

Ok.


> > +
> >  /*
> >   * This is the entry point for compacting all nodes via
> >   * /proc/sys/vm/compact_memory
> > @@ -2637,6 +2769,7 @@ static int kcompactd(void *p)
> >  {
> >   pg_data_t *pgdat = (pg_data_t*)p;
> >   struct task_struct *tsk = current;
> > + unsigned int proactive_defer = 0;
> >
> >   const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> >
> > @@ -2652,12 +2785,34 @@ static int kcompactd(void *p)
> >   unsigned long pflags;
> >
> >   trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
> > - wait_event_freezable(pgdat->kcompactd_wait,
> > - kcompactd_work_requested(pgdat));
> > + if (wait_event_freezable_timeout(pgdat->kcompactd_wait,
> > + kcompactd_work_requested(pgdat),
> > + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) {
>
> Hmm perhaps the wakeups should also backoff if there's nothing to do?


Perhaps. For now, I just wanted to keep it simple and waking a thread to do a
quick calculation didn't seem expensive to me, so I prefer this simplistic
approach for now.


> > +/*
> > + * Calculates external fragmentation within a zone wrt the given order.
> > + * It is defined as the percentage of pages found in blocks of size
> > + * less than 1 << order. It returns values in range [0, 100].
> > + */
> > +int extfrag_for_order(struct zone *zone, unsigned int order)
> > +{
> > + struct contig_page_info info;
> > +
> > + fill_contig_page_info(zone, order, &info);
> > + if (info.free_pages == 0)
> > + return 0;
> > +
> > + return (info.free_pages - (info.free_blocks_suitable << order)) * 100
> > + / info.free_pages;
>
> I guess this should also use div_u64() like __fragmentation_index() does.
>

Ok.


> > +}
> > +
> >  /* Same as __fragmentation index but allocs contig_page_info on stack */
> >  int fragmentation_index(struct zone *zone, unsigned int order)
> >  {
> >
>


Thanks,
Nitin


Re: [PATCH v5] mm: Proactive compaction

2020-05-28 Thread Nitin Gupta
On Thu, May 28, 2020 at 2:50 AM Vlastimil Babka  wrote:
>
> On 5/28/20 11:15 AM, Holger Hoffstätte wrote:
> >
> > On 5/18/20 8:14 PM, Nitin Gupta wrote:
> > [patch v5 :)]
> >
> > I've been successfully using this in my tree and it works great, but a 
> > friend
> > who also uses my tree just found a bug (actually an improvement ;) due to 
> > the
> > change from HUGETLB_PAGE_ORDER to HPAGE_PMD_ORDER in v5.
> >
> > When building with CONFIG_TRANSPARENT_HUGEPAGE=n (for some reason it was 
> > off)
> > HPAGE_PMD_SHIFT expands to BUILD_BUG() and compilation fails like this:
>
> Oops, I forgot about this. Still I believe HPAGE_PMD_ORDER is the best choice 
> as
> long as THP's are enabled. I guess fallback to HUGETLB_PAGE_ORDER would be
> possible if THPS are not enabled, but AFAICS some architectures don't define
> that. Such architectures perhaps won't benefit from proactive compaction 
> anyway?
>

I am not sure about such architectures but in such cases, we would end
up calculating
"fragmentation score" based on a page size which does not match the
architecture's
view of the "default hugepage size" which is not a terrible thing in
itself as compaction
can still be done in the background, after all.

Since we always need a target order to calculate the fragmentation score, how
about this fallack scheme:

HPAGE_PMD_ORDER -> HUGETLB_PAGE_ORDER -> PMD_ORDER

Thanks,
Nitin


[PATCH v5] mm: Proactive compaction

2020-05-18 Thread Nitin Gupta
e
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As the benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the
low threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior


Above workloads produce a memory state which is easy to compact.
However, if memory is filled with unmovable pages, proactive compaction
should essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/

Signed-off-by: Nitin Gupta 
To: Mel Gorman 
To: Michal Hocko 
To: Vlastimil Babka 
CC: Matthew Wilcox 
CC: Andrew Morton 
CC: Mike Kravetz 
CC: Joonsoo Kim 
CC: David Rientjes 
CC: Nitin Gupta 
CC: linux-kernel 
CC: linux-mm 
CC: Linux API 

---
Changelog v5 vs v4:
 - Change tunable from sysfs to sysctl (Vlastimil)
 - HUGETLB_PAGE_ORDER -> HPAGE_PMD_ORDER (Vlastimil)
 - Minor cleanups (remove redundant initializations, ...)

Changelog v4 vs v3:
 - Document various functions.
 - Added admin-guide for the new tunable `proactiveness`.
 - Rename proactive_compaction_score to fragmentation_score for clarity.

Changelog v3 vs v2:
 - Make proactiveness a global tunable and not per-node. Also upadated the
   patch description to reflect the same (Vlastimil Babka).
 - Don't start proactive compaction if kswapd is running (Vlastimil Babka).
 - Clarified in the description that compaction runs in parallel with
   the workload, instead of a one-time compaction followed by a stream of
   hugepage allocations.

Changelog v2 vs v1:
 - Introduce per-node and per-zone "proactive compaction score". This
   score is compared against watermarks which are set according to
   user provided proactiveness value.
 - Separate code-paths for proactive compaction from targeted compaction
   i.e. where pgdat->kcompactd_max_order is non-zero.
 - Renamed hpage_compaction_effort -> proactiveness. In future we may
   use more than extfrag wrt hugepage size to determine proactive
   compaction score.
---
 Documentation/admin-guide/sysctl/vm.rst |  13 ++
 include/linux/compaction.h  |   2 +
 kernel/sysctl.c |   9 ++
 mm/compaction.c | 165 +++-
 mm/internal.h   |   1 +
 mm/vmstat.c |  17 +++
 6 files changed, 202 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 0329a4d3fa9e..e5d88cabe980 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -119,6 +119,19 @@ all zones are compacted such that free memory is available 
in contiguous
 blocks where possible. This can be important for example in the allocation of
 huge pages although processes will also directly compact memory as required.
 
+compaction_proactiveness
+
+
+This tunable takes a value in the range [0, 100] with a default value of
+20. This tunable determines how aggressively compaction is done in the
+background. Setting it to 0 disables proactive compaction.
+
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.
+
 
 compact_unevictable_allowed
 ===
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4b898cdbdf05..ccd28978b296 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -85,11 +85,13 @@ static inline unsigned long compact_gap(unsigned int order)
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
+extern int sysctl_compaction_proactiveness;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *

[PATCH v4] mm: Proactive compaction

2020-04-28 Thread Nitin Gupta
e
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As the benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the
low threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior


Above workloads produce a memory state which is easy to compact.
However, if memory is filled with unmovable pages, proactive compaction
should essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/

Signed-off-by: Nitin Gupta 
To: Mel Gorman 
To: Michal Hocko 
To: Vlastimil Babka 
CC: Matthew Wilcox 
CC: Andrew Morton 
CC: Mike Kravetz 
CC: Joonsoo Kim 
CC: David Rientjes 
CC: Nitin Gupta 
CC: linux-kernel 
CC: linux-mm 
CC: Linux API 

---
Changelog v4 vs v3:
 - Document various functions.
 - Added admin-guide for the new tunable `proactiveness`.
 - Rename proactive_compaction_score to fragmentation_score for clarity.

Changelog v3 vs v2:
 - Make proactiveness a global tunable and not per-node. Also upadated the
   patch description to reflect the same (Vlastimil Babka).
 - Don't start proactive compaction if kswapd is running (Vlastimil Babka).
 - Clarified in the description that compaction runs in parallel with
   the workload, instead of a one-time compaction followed by a stream of
   hugepage allocations.

Changelog v2 vs v1:
 - Introduce per-node and per-zone "proactive compaction score". This
   score is compared against watermarks which are set according to
   user provided proactiveness value.
 - Separate code-paths for proactive compaction from targeted compaction
   i.e. where pgdat->kcompactd_max_order is non-zero.
 - Renamed hpage_compaction_effort -> proactiveness. In future we may
   use more than extfrag wrt hugepage size to determine proactive
   compaction score.
---
 .../admin-guide/mm/proactive-compaction.rst   |  26 ++
 MAINTAINERS   |   6 +
 include/linux/compaction.h|   1 +
 mm/compaction.c   | 236 +-
 mm/internal.h |   1 +
 mm/page_alloc.c   |   1 +
 mm/vmstat.c   |  17 ++
 7 files changed, 282 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/proactive-compaction.rst

diff --git a/Documentation/admin-guide/mm/proactive-compaction.rst 
b/Documentation/admin-guide/mm/proactive-compaction.rst
new file mode 100644
index ..510f47e38238
--- /dev/null
+++ b/Documentation/admin-guide/mm/proactive-compaction.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _proactive_compaction:
+
+
+Proactive Compaction
+
+
+Many applications benefit significantly from the use of huge pages.
+However, huge-page allocations often incur a high latency or even fail
+under fragmented memory conditions. Proactive compaction provides an
+effective solution to these problems by doing memory compaction in the
+background.
+
+The process of proactive compaction is controlled by a single tunable:
+
+/sys/kernel/mm/compaction/proactiveness
+
+This tunable takes a value in the range [0, 100] with a default value of
+20. This tunable determines how aggressively compaction is done in the
+background. Setting it to 0 disables proactive compaction.
+
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.
diff --git a/MAINTAINERS b/MAINTAINERS
index 26f281d9f32a..e448c0b35ecb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18737,6 +18737,12 @@ L: linux...@kvack.org
 S: Maintained
 F: mm/zswap.c
 
+PROACTIVE COMPACTION
+M: Nitin Gupta 
+L: linux...@kvack.org
+S: Maintained
+F: Docu

Re: [RFC] mm: Proactive compaction

2019-09-19 Thread Nitin Gupta
On Tue, 2019-08-20 at 10:46 +0200, Vlastimil Babka wrote:
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
> > 
> > Testing done (on x86):
> >   - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> >   respectively.
> >   - Use a test program to fragment memory: the program allocates all
> > memory
> >   and then for each 2M aligned section, frees 3/4 of base pages using
> >   munmap.
> >   - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
> >   compaction till extfrag < extfrag_low for order-9.
> > 
> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> 
> That's a lot of control knobs - how is an admin supposed to tune them to
> their
> needs?


Yes, it's difficult for an admin to get so many tunable right unless
targeting a very specific workload.

How about a simpler solution where we exposed just one tunable per-node:
   /sys/.../node-x/compaction_effort
which accepts [0, 100]

This parallels /proc/sys/vm/swappiness but for compaction. With this
single number, we can estimate per-order [low, high] watermarks for external
fragmentation like this:
 - For now, map this range to [low, medium, high] which correponds to specific
low, high thresholds for extfrag.
 - Apply more relaxed thresholds for higher-order than for lower orders.

With this single tunable we remove the burden of setting per-order explicit
[low, high] thresholds and it should be easier to experiment with.

-Nitin





Re: [RFC] mm: Proactive compaction

2019-09-19 Thread Nitin Gupta
On Thu, 2019-08-22 at 09:51 +0100, Mel Gorman wrote:
> As unappealing as it sounds, I think it is better to try improve the
> allocation latency itself instead of trying to hide the cost in a kernel
> thread. It's far harder to implement as compaction is not easy but it
> would be more obvious what the savings are by looking at a histogram of
> allocation latencies -- there are other metrics that could be considered
> but that's the obvious one.
> 

Do you mean reducing allocation latency especially when it hits direct
compaction path? Do you have any ideas in mind for this? I'm open to
working on them and report back latency nummbers, while I think more on less
tunable-heavy background (pro-active) compaction approaches.

-Nitin



Re: [RFC] mm: Proactive compaction

2019-09-16 Thread Nitin Gupta
On Mon, 2019-09-16 at 13:16 -0700, David Rientjes wrote:
> On Fri, 16 Aug 2019, Nitin Gupta wrote:
> 
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> > 
> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> > 
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> > 
> >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> > 
> > Per-node kcompactd thread is woken up every few seconds to check if
> > any zone on its node has extfrag above the extfrag_high threshold for
> > any order, in which case the thread starts compaction in the backgrond
> > till all zones are below extfrag_low level for all orders. By default
> > both these thresolds are set to 100 for all orders which essentially
> > disables kcompactd.
> > 
> > To avoid wasting CPU cycles when compaction cannot help, such as when
> > memory is full, we check both, extfrag > extfrag_high and
> > compaction_suitable(zone). This allows kcomapctd thread to stays inactive
> > even if extfrag thresholds are not met.
> > 
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
> > 
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> >  respectively.
> >  - Use a test program to fragment memory: the program allocates all memory
> >  and then for each 2M aligned section, frees 3/4 of base pages using
> >  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
> >  compaction till extfrag < extfrag_low for order-9.
> > 
> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> > 
> 
> Is there an update to this proposal or non-RFC patch that has been posted 
> for proactive compaction?
> 
> We've had good success with periodically compacting memory on a regular 
> cadence on systems with hugepages enabled.  The cadence itself is defined 
> by the admin but it causes khugepaged[*] to periodically wakeup and invoke 
> compaction in an attempt to keep zones as defragmented as possible 
> (perhaps more "proactive" than what is proposed here in an attempt to keep 
> all memory as unfragmented as possible regardless of extfrag thresholds).  
> It also avoids corner-cases where kcompactd could become more expensive 
> than what is anticipated because it is unsuccessful at compacting memory 
> yet the extfrag threshold is still exceeded.
> 
>  [*] Khugepaged instead of kcompactd only because this is only enabled
>  for systems where transparent hugepages are enabled, probably better
>  off in kcompactd to avoid duplicating work between two kthreads if
>  there is already a need for background compaction.
> 


Discussion on this RFC patch revolved around the issue of exposing too
many tunables (per-node, per-order, [low-high] extfrag thresholds). It
was sort-of concluded that no admin will get these tunables right for
a variety of workloads.

To eliminate the need for tunables, I proposed another patch:

https://patchwork.kernel.org/patch/11140067/

which does not add any tunables but extends and exports an existing
function (compact_zone_order). In summary, this new patch adds a
callback function which allows any driver to implement ad-hoc
compaction policies. There is also a sample driver which makes use
of this interface to keep hugepage external fragmentation within
specified range (exposed through debugfs):

https://gitlab.com/nigupta/linux/snippets/1894161

-Nitin



Re: [PATCH] mm: Add callback for defining compaction completion

2019-09-12 Thread Nitin Gupta
On Thu, 2019-09-12 at 17:11 +0530, Bharath Vedartham wrote:
> Hi Nitin,
> On Wed, Sep 11, 2019 at 10:33:39PM +, Nitin Gupta wrote:
> > On Wed, 2019-09-11 at 08:45 +0200, Michal Hocko wrote:
> > > On Tue 10-09-19 22:27:53, Nitin Gupta wrote:
> > > [...]
> > > > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > > > > > For some applications we need to allocate almost all memory as
> > > > > > hugepages.
> > > > > > However, on a running system, higher order allocations can fail if
> > > > > > the
> > > > > > memory is fragmented. Linux kernel currently does on-demand
> > > > > > compaction
> > > > > > as we request more hugepages but this style of compaction incurs
> > > > > > very
> > > > > > high latency. Experiments with one-time full memory compaction
> > > > > > (followed by hugepage allocations) shows that kernel is able to
> > > > > > restore a highly fragmented memory state to a fairly compacted
> > > > > > memory
> > > > > > state within <1 sec for a 32G system. Such data suggests that a
> > > > > > more
> > > > > > proactive compaction can help us allocate a large fraction of
> > > > > > memory
> > > > > > as hugepages keeping allocation latencies low.
> > > > > > 
> > > > > > In general, compaction can introduce unexpected latencies for
> > > > > > applications that don't even have strong requirements for
> > > > > > contiguous
> > > > > > allocations.
> > > 
> > > Could you expand on this a bit please? Gfp flags allow to express how
> > > much the allocator try and compact for a high order allocations. Hugetlb
> > > allocations tend to require retrying and heavy compaction to succeed and
> > > the success rate tends to be pretty high from my experience.  Why that
> > > is not case in your case?
> > > 
> The link to the driver you send on gitlab is not working :(

Sorry about that, here's the correct link:
https://gitlab.com/nigupta/linux/snippets/1894161

> > Yes, I have the same observation: with `GFP_TRANSHUGE |
> > __GFP_RETRY_MAYFAIL` I get very good success rate (~90% of free RAM
> > allocated as hugepages). However, what I'm trying to point out is that
> > this
> > high success rate comes with high allocation latencies (90th percentile
> > latency of 2206us). On the same system, the same high-order allocations
> > which hit the fast path have latency <5us.
> > 
> > > > > > It is also hard to efficiently determine if the current
> > > > > > system state can be easily compacted due to mixing of unmovable
> > > > > > memory. Due to these reasons, automatic background compaction by
> > > > > > the
> > > > > > kernel itself is hard to get right in a way which does not hurt
> > > > > > unsuspecting
> > > > > applications or waste CPU cycles.
> > > > > 
> > > > > We do trigger background compaction on a high order pressure from
> > > > > the
> > > > > page allocator by waking up kcompactd. Why is that not sufficient?
> > > > > 
> > > > 
> > > > Whenever kcompactd is woken up, it does just enough work to create
> > > > one free page of the given order (compaction_control.order) or higher.
> > > 
> > > This is an implementation detail IMHO. I am pretty sure we can do a
> > > better auto tuning when there is an indication of a constant flow of
> > > high order requests. This is no different from the memory reclaim in
> > > principle. Just because the kswapd autotuning not fitting with your
> > > particular workload you wouldn't want to export direct reclaim
> > > functionality and call it from a random module. That is just doomed to
> > > fail because different subsystems in control just leads to decisions
> > > going against each other.
> > > 
> > 
> > I don't want to go the route of adding any auto-tuning/perdiction code to
> > control compaction in the kernel. I'm more inclined towards extending
> > existing interfaces to allow compaction behavior to be controlled either
> > from userspace or a kernel driver. Letting a random module control
> > compaction or a root process pumping new tunables from sysfs is the same
> > in
> > principle.

Re: [PATCH] mm: Add callback for defining compaction completion

2019-09-11 Thread Nitin Gupta
On Wed, 2019-09-11 at 08:45 +0200, Michal Hocko wrote:
> On Tue 10-09-19 22:27:53, Nitin Gupta wrote:
> [...]
> > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > > > For some applications we need to allocate almost all memory as
> > > > hugepages.
> > > > However, on a running system, higher order allocations can fail if the
> > > > memory is fragmented. Linux kernel currently does on-demand
> > > > compaction
> > > > as we request more hugepages but this style of compaction incurs very
> > > > high latency. Experiments with one-time full memory compaction
> > > > (followed by hugepage allocations) shows that kernel is able to
> > > > restore a highly fragmented memory state to a fairly compacted memory
> > > > state within <1 sec for a 32G system. Such data suggests that a more
> > > > proactive compaction can help us allocate a large fraction of memory
> > > > as hugepages keeping allocation latencies low.
> > > > 
> > > > In general, compaction can introduce unexpected latencies for
> > > > applications that don't even have strong requirements for contiguous
> > > > allocations.
> 
> Could you expand on this a bit please? Gfp flags allow to express how
> much the allocator try and compact for a high order allocations. Hugetlb
> allocations tend to require retrying and heavy compaction to succeed and
> the success rate tends to be pretty high from my experience.  Why that
> is not case in your case?
> 

Yes, I have the same observation: with `GFP_TRANSHUGE |
__GFP_RETRY_MAYFAIL` I get very good success rate (~90% of free RAM
allocated as hugepages). However, what I'm trying to point out is that this
high success rate comes with high allocation latencies (90th percentile
latency of 2206us). On the same system, the same high-order allocations
which hit the fast path have latency <5us.

> > > > It is also hard to efficiently determine if the current
> > > > system state can be easily compacted due to mixing of unmovable
> > > > memory. Due to these reasons, automatic background compaction by the
> > > > kernel itself is hard to get right in a way which does not hurt
> > > > unsuspecting
> > > applications or waste CPU cycles.
> > > 
> > > We do trigger background compaction on a high order pressure from the
> > > page allocator by waking up kcompactd. Why is that not sufficient?
> > > 
> > 
> > Whenever kcompactd is woken up, it does just enough work to create
> > one free page of the given order (compaction_control.order) or higher.
> 
> This is an implementation detail IMHO. I am pretty sure we can do a
> better auto tuning when there is an indication of a constant flow of
> high order requests. This is no different from the memory reclaim in
> principle. Just because the kswapd autotuning not fitting with your
> particular workload you wouldn't want to export direct reclaim
> functionality and call it from a random module. That is just doomed to
> fail because different subsystems in control just leads to decisions
> going against each other.
> 

I don't want to go the route of adding any auto-tuning/perdiction code to
control compaction in the kernel. I'm more inclined towards extending
existing interfaces to allow compaction behavior to be controlled either
from userspace or a kernel driver. Letting a random module control
compaction or a root process pumping new tunables from sysfs is the same in
principle.

This patch is in the spirit of simple extension to existing
compaction_zone_order() which allows either a kernel driver or userspace
(through sysfs) to control compaction.

Also, we should avoid driving hard parallels between reclaim and
compaction: the former is often necessary for forward progress while the
latter is often an optimization. Since contiguous allocations are mostly
optimizations it's good to expose hooks from the kernel that let user
(through a driver or userspace) control it using its own heuristics.


I thought hard about whats lacking in current userspace interface (sysfs):
 - /proc/sys/vm/compact_memory: full system compaction is not an option as
   a viable pro-active compaction strategy.
 - possibly expose [low, high] threshold values for each node and let
   kcompactd act on them. This was my approach for my original patch I
   linked earlier. Problem here is that it introduces too many tunables.

Considering the above, I came up with this callback approach which make it
trivial to introduce user specific policies for compaction. It puts the
onus of system stability, responsive in the hands of user without burdening
admins with more tunables or adding cry

RE: [PATCH] mm: Add callback for defining compaction completion

2019-09-10 Thread Nitin Gupta
> -Original Message-
> From: owner-linux...@kvack.org  On Behalf
> Of Michal Hocko
> Sent: Tuesday, September 10, 2019 1:19 PM
> To: Nitin Gupta 
> Cc: a...@linux-foundation.org; vba...@suse.cz;
> mgor...@techsingularity.net; dan.j.willi...@intel.com;
> khalid.a...@oracle.com; Matthew Wilcox ; Yu Zhao
> ; Qian Cai ; Andrey Ryabinin
> ; Allison Randal ; Mike
> Rapoport ; Thomas Gleixner
> ; Arun KS ; Wei Yang
> ; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org
> Subject: Re: [PATCH] mm: Add callback for defining compaction completion
> 
> On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> hugepages.
> > However, on a running system, higher order allocations can fail if the
> > memory is fragmented. Linux kernel currently does on-demand
> compaction
> > as we request more hugepages but this style of compaction incurs very
> > high latency. Experiments with one-time full memory compaction
> > (followed by hugepage allocations) shows that kernel is able to
> > restore a highly fragmented memory state to a fairly compacted memory
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory
> > as hugepages keeping allocation latencies low.
> >
> > In general, compaction can introduce unexpected latencies for
> > applications that don't even have strong requirements for contiguous
> > allocations. It is also hard to efficiently determine if the current
> > system state can be easily compacted due to mixing of unmovable
> > memory. Due to these reasons, automatic background compaction by the
> > kernel itself is hard to get right in a way which does not hurt unsuspecting
> applications or waste CPU cycles.
> 
> We do trigger background compaction on a high order pressure from the
> page allocator by waking up kcompactd. Why is that not sufficient?
> 

Whenever kcompactd is woken up, it does just enough work to create
one free page of the given order (compaction_control.order) or higher.

Such a design causes very high latency for workloads where we want
to allocate lots of hugepages in short period of time. With pro-active
compaction we can hide much of this latency. For some more background
discussion and data, please see this thread:

https://patchwork.kernel.org/patch/11098289/

> > Even with these caveats, pro-active compaction can still be very
> > useful in certain scenarios to reduce hugepage allocation latencies.
> > This callback interface allows drivers to drive compaction based on
> > their own policies like the current level of external fragmentation
> > for a particular order, system load etc.
> 
> So we do not trust the core MM to make a reasonable decision while we give
> a free ticket to modules. How does this make any sense at all? How is a
> random module going to make a more informed decision when it has less
> visibility on the overal MM situation.
>

Embedding any specific policy (like: keep external fragmentation for order-9
between 30-40%) within MM core looks like a bad idea. As a driver, we
can easily measure parameters like system load, current fragmentation level
for any order in any zone etc. to make an informed decision.
See the thread I refereed above for more background discussion.

> If you need to control compaction from the userspace you have an interface
> for that.  It is also completely unexplained why you need a completion
> callback.
> 

/proc/sys/vm/compact_memory does whole system compaction which is
often too much as a pro-active compaction strategy. To get more control
over how to compaction work to do, I have added a compaction callback
which controls how much work is done in one compaction cycle.
 
For example, as a test for this patch, I have a small test driver which defines
[low, high] external fragmentation thresholds for the HPAGE_ORDER. Whenever
extfrag is within this range, I run compact_zone_order with a callback which
returns COMPACT_CONTINUE till extfrag > low threshold and returns
COMPACT_PARTIAL_SKIPPED when extfrag <= low.

Here's the code for this sample driver:
https://gitlab.com/nigupta/memstress/snippets/1893847

Maybe this code can be added to Documentation/...

Thanks,
Nitin

> 
> > Signed-off-by: Nitin Gupta 
> > ---
> >  include/linux/compaction.h | 10 ++
> >  mm/compaction.c| 20 ++--
> >  mm/internal.h  |  2 ++
> >  3 files changed, 26 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 9569e7c786d3..1ea828450fa2 100644
> > --- a/include/linux/compaction.

[PATCH] mm: Add callback for defining compaction completion

2019-09-10 Thread Nitin Gupta
For some applications we need to allocate almost all memory as hugepages.
However, on a running system, higher order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) shows that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping allocation
latencies low.

In general, compaction can introduce unexpected latencies for applications
that don't even have strong requirements for contiguous allocations. It is
also hard to efficiently determine if the current system state can be
easily compacted due to mixing of unmovable memory. Due to these reasons,
automatic background compaction by the kernel itself is hard to get right
in a way which does not hurt unsuspecting applications or waste CPU cycles.

Even with these caveats, pro-active compaction can still be very useful in
certain scenarios to reduce hugepage allocation latencies. This callback
interface allows drivers to drive compaction based on their own policies
like the current level of external fragmentation for a particular order,
system load etc.

Signed-off-by: Nitin Gupta 
---
 include/linux/compaction.h | 10 ++
 mm/compaction.c| 20 ++--
 mm/internal.h  |  2 ++
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 9569e7c786d3..1ea828450fa2 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -58,6 +58,16 @@ enum compact_result {
COMPACT_SUCCESS,
 };
 
+/* Callback function to determine if compaction is finished. */
+typedef enum compact_result (*compact_finished_cb)(
+   struct zone *zone, int order);
+
+enum compact_result compact_zone_order(struct zone *zone, int order,
+   gfp_t gfp_mask, enum compact_priority prio,
+   unsigned int alloc_flags, int classzone_idx,
+   struct page **capture,
+   compact_finished_cb compact_finished_cb);
+
 struct alloc_context; /* in mm/internal.h */
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..73e2e9246bc4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1872,6 +1872,9 @@ static enum compact_result __compact_finished(struct 
compact_control *cc)
return COMPACT_PARTIAL_SKIPPED;
}
 
+   if (cc->compact_finished_cb)
+   return cc->compact_finished_cb(cc->zone, cc->order);
+
if (is_via_compact_memory(cc->order))
return COMPACT_CONTINUE;
 
@@ -2274,10 +2277,11 @@ compact_zone(struct compact_control *cc, struct 
capture_control *capc)
return ret;
 }
 
-static enum compact_result compact_zone_order(struct zone *zone, int order,
+enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum compact_priority prio,
unsigned int alloc_flags, int classzone_idx,
-   struct page **capture)
+   struct page **capture,
+   compact_finished_cb compact_finished_cb)
 {
enum compact_result ret;
struct compact_control cc = {
@@ -2293,10 +2297,11 @@ static enum compact_result compact_zone_order(struct 
zone *zone, int order,
MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT,
.alloc_flags = alloc_flags,
.classzone_idx = classzone_idx,
-   .direct_compaction = true,
+   .direct_compaction = !compact_finished_cb,
.whole_zone = (prio == MIN_COMPACT_PRIORITY),
.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
-   .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
+   .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY),
+   .compact_finished_cb = compact_finished_cb
};
struct capture_control capc = {
.cc = &cc,
@@ -2313,11 +2318,13 @@ static enum compact_result compact_zone_order(struct 
zone *zone, int order,
VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));
 
-   *capture = capc.page;
+   if (capture)
+   *capture = capc.page;
current->capture_control = NULL;
 
return ret;
 }
+EXPORT_SYMBOL(compact_zone_order);
 
 int sysctl_extfrag_threshold = 500;
 
@@ -2361,7 +2368,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, 
unsigned int order,
}
 
status = compact_zone_order(zone, order, gfp_mask, prio,
-   allo

Re: [RFC] mm: Proactive compaction

2019-08-27 Thread Nitin Gupta
On Mon, 2019-08-26 at 12:47 +0100, Mel Gorman wrote:
> On Thu, Aug 22, 2019 at 09:57:22PM +0000, Nitin Gupta wrote:
> > > Note that proactive compaction may reduce allocation latency but
> > > it is not
> > > free either. Even though the scanning and migration may happen in
> > > a kernel
> > > thread, tasks can incur faults while waiting for compaction to
> > > complete if the
> > > task accesses data being migrated. This means that costs are
> > > incurred by
> > > applications on a system that may never care about high-order
> > > allocation
> > > latency -- particularly if the allocations typically happen at
> > > application
> > > initialisation time.  I recognise that kcompactd makes a bit of
> > > effort to
> > > compact memory out-of-band but it also is typically triggered in
> > > response to
> > > reclaim that was triggered by a high-order allocation request.
> > > i.e. the work
> > > done by the thread is triggered by an allocation request that hit
> > > the slow
> > > paths and not a preemptive measure.
> > > 
> > 
> > Hitting the slow path for every higher-order allocation is a
> > signification
> > performance/latency issue for applications that requires a large
> > number of
> > these allocations to succeed in bursts. To get some concrete
> > numbers, I
> > made a small driver that allocates as many hugepages as possible
> > and
> > measures allocation latency:
> > 
> 
> Every higher-order allocation does not necessarily hit the slow path
> nor
> does it incur equal latency.

I did not mean *every* hugepage allocation in a literal sense.
I meant to say: higher order allocation *tend* to hit slow path
with a high probability under reasonably fragmented memory state
and when they do, they incur high latency.


> 
> > The driver first tries to allocate hugepage using
> > GFP_TRANSHUGE_LIGHT
> > (referred to as "Light" in the table below) and if that fails,
> > tries to
> > allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as
> > "Fallback" in table below). We stop the allocation loop if both
> > methods
> > fail.
> > 
> > Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All
> > latencies
> > are in microsec.
> > 
> > > GFP/Stat |Any |   Light |   Fallback |
> > > : | -: | --: | -: |
> > >count |   9908 | 788 |   9120 |
> > >  min |0.0 | 0.0 | 1726.0 |
> > >  max |   135387.0 |   142.0 |   135387.0 |
> > > mean |5494.66 |1.83 |5969.26 |
> > >   stddev |   21624.04 |7.58 |   22476.06 |
> 
> Given that it is expected that there would be significant tail
> latencies,
> it would be better to analyse this in terms of percentiles. A very
> small
> number of high latency allocations would skew the mean significantly
> which is hinted by the stddev.
> 

Here is the same data in terms of percentiles:

- with vanilla kernel 5.3.0-rc5:

percentile latency
–– –––
 5   1
10179
0
251829
301838
401854
5018
71
601890
751924
801945
902
206
952302


- Now with kernel 5.3.0-rc5 + this patch:

percentile latency
–– –––
 5   3
10   4
25   
4
30   4
40   4
50   4
60  
 4
75   5
80   5
90   9
951
154


> > As you can see, the mean and stddev of allocation is extremely high
> > with
> > the current approach of on-demand compaction.
> > 
> > The system was fragmented from a userspace program as I described
> > in this
> > patch description. The workload is mainly anonymous userspace pages
> > which
> > as easy to move around. I intentionally avoided unmovable pages in
> > this
> > test to see how much latency do we incur just by hitting the slow
> > path for
> > a majority of allocations.
> > 
> 
> Even though, the penalty for proactive compaction is that
> applications
> that may have no interest in higher-order pages may still stall while
> their data is migrated if the data is hot. This is why I think the
> focus
> should be on reducing the latency of compaction -- it benefits
> applications that require higher-order latencies without increasing
> the
> overhead for unrelated applications.
> 

Sure, reducing compaction latency would help b

Re: [RFC] mm: Proactive compaction

2019-08-22 Thread Nitin Gupta
> -Original Message-
> From: owner-linux...@kvack.org  On Behalf
> Of Mel Gorman
> Sent: Thursday, August 22, 2019 1:52 AM
> To: Nitin Gupta 
> Cc: a...@linux-foundation.org; vba...@suse.cz; mho...@suse.com;
> dan.j.willi...@intel.com; Yu Zhao ; Matthew Wilcox
> ; Qian Cai ; Andrey Ryabinin
> ; Roman Gushchin ; Greg Kroah-
> Hartman ; Kees Cook
> ; Jann Horn ; Johannes
> Weiner ; Arun KS ; Janne
> Huttunen ; Konstantin Khlebnikov
> ; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org
> Subject: Re: [RFC] mm: Proactive compaction
> 
> On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> >
> 
> Note that proactive compaction may reduce allocation latency but it is not
> free either. Even though the scanning and migration may happen in a kernel
> thread, tasks can incur faults while waiting for compaction to complete if the
> task accesses data being migrated. This means that costs are incurred by
> applications on a system that may never care about high-order allocation
> latency -- particularly if the allocations typically happen at application
> initialisation time.  I recognise that kcompactd makes a bit of effort to
> compact memory out-of-band but it also is typically triggered in response to
> reclaim that was triggered by a high-order allocation request. i.e. the work
> done by the thread is triggered by an allocation request that hit the slow
> paths and not a preemptive measure.
> 

Hitting the slow path for every higher-order allocation is a signification
performance/latency issue for applications that requires a large number of
these allocations to succeed in bursts. To get some concrete numbers, I
made a small driver that allocates as many hugepages as possible and
measures allocation latency:

The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
(referred to as "Light" in the table below) and if that fails, tries to
allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as
"Fallback" in table below). We stop the allocation loop if both methods
fail.

Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All latencies
are in microsec.

| GFP/Stat |Any |   Light |   Fallback |
|: | -: | --: | -: |
|count |   9908 | 788 |   9120 |
|  min |0.0 | 0.0 | 1726.0 |
|  max |   135387.0 |   142.0 |   135387.0 |
| mean |5494.66 |1.83 |5969.26 |
|   stddev |   21624.04 |7.58 |   22476.06 |

As you can see, the mean and stddev of allocation is extremely high with
the current approach of on-demand compaction.

The system was fragmented from a userspace program as I described in this
patch description. The workload is mainly anonymous userspace pages which
as easy to move around. I intentionally avoided unmovable pages in this
test to see how much latency do we incur just by hitting the slow path for
a majority of allocations.


> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> >
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> >
> >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> >
> 
> These will be difficult for an admin to tune that is not extremely familiar 
> with
> how external fragmentation is defined. If an admin asked "how much will
> stalls be reduced by setting this to a different value?", the answer will 
> always
> be "I don't know, maybe some, maybe not".
>

Yes, this is my main worry. These values can be set to emperically
determined values on highly specialized systems like database appliances.
However, on a generic system, there is no real reasonable value.


Still, at the very least, I would like an interface that allows compacting
system to a reasonable state. Something like:

compact_extfrag(node, zone, order, high, low)

which start compaction if extfrag > hi

RE: [RFC] mm: Proactive compaction

2019-08-21 Thread Nitin Gupta



> -Original Message-
> From: owner-linux...@kvack.org  On Behalf
> Of Matthew Wilcox
> Sent: Tuesday, August 20, 2019 3:21 PM
> To: Nitin Gupta 
> Cc: a...@linux-foundation.org; vba...@suse.cz;
> mgor...@techsingularity.net; mho...@suse.com;
> dan.j.willi...@intel.com; Yu Zhao ; Qian Cai
> ; Andrey Ryabinin ; Roman
> Gushchin ; Greg Kroah-Hartman
> ; Kees Cook ; Jann
> Horn ; Johannes Weiner ; Arun
> KS ; Janne Huttunen
> ; Konstantin Khlebnikov
> ; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org
> Subject: Re: [RFC] mm: Proactive compaction
> 
> On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote:
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> > respectively.
> >  - Use a test program to fragment memory: the program allocates all
> > memory  and then for each 2M aligned section, frees 3/4 of base pages
> > using  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and
> > starts  compaction till extfrag < extfrag_low for order-9.
> 
> Your test program is a good idea, but I worry it may produce unrealistically
> optimistic outcomes.  Page cache is readily reclaimable, so you're setting up
> a situation where 2MB pages can once again be produced.
> 
> How about this:
> 
> One program which creates a file several times the size of memory (or
> several files which total the same amount).  Then read the file(s).  Maybe by
> mmap(), and just do nice easy sequential accesses.
> 
> A second program which causes slab allocations.  eg
> 
> for (;;) {
>   for (i = 0; i < n * 1000 * 1000; i++) {
>   char fname[64];
> 
>   sprintf(fname, "/tmp/missing.%d", i);
>   open(fname, O_RDWR);
>   }
> }
> 
> The first program should thrash the pagecache, causing pages to
> continuously be allocated, reclaimed and freed.  The second will create
> millions of dentries, causing the slab allocator to allocate a lot of
> order-0 pages which are harder to free.  If you really want to make it work
> hard, mix in opening some files whihc actually exist, preventing the pages
> which contain those dentries from being evicted.
> 
> This feels like it's simulating a more normal workload than your test.
> What do you think?

This combination of workloads for mixing movable and unmovable
pages sounds good.   I coded up these two and here's what I observed:

- kernel: 5.3.0-rc5 + this patch, x86_64, 32G RAM.
- Set extfrag_{low,high} = {25,30} for order-9
- Run pagecache and dentry thrash test programs as you described
- for pagecache test: mmap and sequentially read 128G file on a 32G system.
- for dentry test: set n=100. I created /tmp/missing.[0-1] so these 
dentries stay allocated..
- Start linux kernel compile for further pagecache thrashing.

With above workload fragmentation for order-9 stayed 80-90% which kept
kcompactd0 working but it couldn't make progress due to unmovable pages
from dentries.  As expected, we keep hitting compaction_deferred() as
compaction attempts fail.

After a manual `echo 3 | /proc/sys/vm/drop_caches` and stopping dentry thrasher,
kcompactd succeded in bringing extfrag below set thresholds.


With unmovable pages spread across memory, there is little compaction
can do. Maybe we should have a knob like 'compactness' (like swapiness) which
defines how aggressive compaction can be. For high values, maybe allow
freeing dentries too? This way hugepage sensitive applications can trade
with higher I/O latencies.

Thanks,
Nitin








RE: [RFC] mm: Proactive compaction

2019-08-20 Thread Nitin Gupta
> -Original Message-
> From: Vlastimil Babka 
> Sent: Tuesday, August 20, 2019 1:46 AM
> To: Nitin Gupta ; a...@linux-foundation.org;
> mgor...@techsingularity.net; mho...@suse.com;
> dan.j.willi...@intel.com
> Cc: Yu Zhao ; Matthew Wilcox ;
> Qian Cai ; Andrey Ryabinin ; Roman
> Gushchin ; Greg Kroah-Hartman
> ; Kees Cook ; Jann
> Horn ; Johannes Weiner ; Arun
> KS ; Janne Huttunen
> ; Konstantin Khlebnikov
> ; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Khalid Aziz 
> Subject: Re: [RFC] mm: Proactive compaction
> 
> +CC Khalid Aziz who proposed a different approach:
> https://lore.kernel.org/linux-mm/20190813014012.30232-1-
> khalid.a...@oracle.com/T/#u
> 
> On 8/16/19 11:43 PM, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> >
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> >
> >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> >
> > Per-node kcompactd thread is woken up every few seconds to check if
> > any zone on its node has extfrag above the extfrag_high threshold for
> > any order, in which case the thread starts compaction in the backgrond
> > till all zones are below extfrag_low level for all orders. By default
> > both these thresolds are set to 100 for all orders which essentially
> > disables kcompactd.
> 
> Could you define what exactly extfrag is, in the changelog?
> 

extfrag for order-n = ((total free pages) - (free pages for order >= n)) / 
(total free pages) * 100;

I will add this to v2 changelog.


> > To avoid wasting CPU cycles when compaction cannot help, such as when
> > memory is full, we check both, extfrag > extfrag_high and
> > compaction_suitable(zone). This allows kcomapctd thread to stays
> > inactive even if extfrag thresholds are not met.
> 
> How does it translate to e.g. the number of free pages of order?
> 

Watermarks are checked as follows (see: __compaction_suitable)

watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);

If a zone does not satisfy this watermark, we don't start compaction.

> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-
> mm/20161230131412.gi13...@dhcp22.suse.cz
> > /
> >
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> > respectively.
> >  - Use a test program to fragment memory: the program allocates all
> > memory  and then for each 2M aligned section, frees 3/4 of base pages
> > using  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and
> > starts  compaction till extfrag < extfrag_low for order-9.
> >
> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> 
> That's a lot of control knobs - how is an admin supposed to tune them to
> their needs?


I expect that a workload would typically care for just a particular page order
(say, order-9 on x86 for the default hugepage size). An admin can set
extfrag_{low,high} for just that order (say, low=25, high=30) and leave these
thresholds to their default value (low=100, high=100) for all other orders.

Thanks,
Nitin


> 
> (keeping the rest for reference)
> 
> > Signed-off-by: Nitin Gupta 
> > ---
> >  include/linux/compaction.h |  12 ++
> >  mm/compaction.c| 250 ++---
> >  mm/vmstat.c|  12 ++
> >  3 files changed, 228 insertions(+), 46 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 9569e7c786d3..2

[RFC] mm: Proactive compaction

2019-08-16 Thread Nitin Gupta
For some applications we need to allocate almost all memory as
hugepages. However, on a running system, higher order allocations can
fail if the memory is fragmented. Linux kernel currently does
on-demand compaction as we request more hugepages but this style of
compaction incurs very high latency. Experiments with one-time full
memory compaction (followed by hugepage allocations) shows that kernel
is able to restore a highly fragmented memory state to a fairly
compacted memory state within <1 sec for a 32G system. Such data
suggests that a more proactive compaction can help us allocate a large
fraction of memory as hugepages keeping allocation latencies low.

For a more proactive compaction, the approach taken here is to define
per page-order external fragmentation thresholds and let kcompactd
threads act on these thresholds.

The low and high thresholds are defined per page-order and exposed
through sysfs:

  /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}

Per-node kcompactd thread is woken up every few seconds to check if
any zone on its node has extfrag above the extfrag_high threshold for
any order, in which case the thread starts compaction in the backgrond
till all zones are below extfrag_low level for all orders. By default
both these thresolds are set to 100 for all orders which essentially
disables kcompactd.

To avoid wasting CPU cycles when compaction cannot help, such as when
memory is full, we check both, extfrag > extfrag_high and
compaction_suitable(zone). This allows kcomapctd thread to stays inactive
even if extfrag thresholds are not met.

This patch is largely based on ideas from Michal Hocko posted here:
https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/

Testing done (on x86):
 - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
 respectively.
 - Use a test program to fragment memory: the program allocates all memory
 and then for each 2M aligned section, frees 3/4 of base pages using
 munmap.
 - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
 compaction till extfrag < extfrag_low for order-9.

The patch has plenty of rough edges but posting it early to see if I'm
going in the right direction and to get some early feedback.

Signed-off-by: Nitin Gupta 
---
 include/linux/compaction.h |  12 ++
 mm/compaction.c| 250 ++---
 mm/vmstat.c|  12 ++
 3 files changed, 228 insertions(+), 46 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 9569e7c786d3..26bfedbbc64b 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -60,6 +60,17 @@ enum compact_result {
 
 struct alloc_context; /* in mm/internal.h */
 
+// "order-%d"
+#define COMPACTION_ORDER_STATE_NAME_LEN 16
+// Per-order compaction state
+struct compaction_order_state {
+   unsigned int order;
+   unsigned int extfrag_low;
+   unsigned int extfrag_high;
+   unsigned int extfrag_curr;
+   char name[COMPACTION_ORDER_STATE_NAME_LEN];
+};
+
 /*
  * Number of free order-0 pages that should be available above given watermark
  * to make sure compaction has reasonable chance of not running out of free
@@ -90,6 +101,7 @@ extern int sysctl_compaction_handler(struct ctl_table 
*table, int write,
 extern int sysctl_extfrag_threshold;
 extern int sysctl_compact_unevictable_allowed;
 
+extern int extfrag_for_order(struct zone *zone, unsigned int order);
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, unsigned int alloc_flags,
diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..21866b1ad249 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -25,6 +25,10 @@
 #include 
 #include "internal.h"
 
+#ifdef CONFIG_COMPACTION
+struct compaction_order_state compaction_order_states[MAX_ORDER+1];
+#endif
+
 #ifdef CONFIG_COMPACTION
 static inline void count_compact_event(enum vm_event_item item)
 {
@@ -1846,6 +1850,49 @@ static inline bool is_via_compact_memory(int order)
return order == -1;
 }
 
+static int extfrag_wmark_high(struct zone *zone)
+{
+   int order;
+
+   for (order = 1; order <= MAX_ORDER; order++) {
+   int extfrag = extfrag_for_order(zone, order);
+   int threshold = compaction_order_states[order].extfrag_high;
+
+   if (extfrag > threshold)
+   return order;
+   }
+   return 0;
+}
+
+static bool node_should_compact(pg_data_t *pgdat)
+{
+   struct zone *zone;
+
+   for_each_populated_zone(zone) {
+   int order = extfrag_wmark_high(zone);
+
+   if (order && compaction_suitable(zone, order,
+   0, zone_idx(zone)) == COMPACT_CONTINUE) {
+   return true;
+  

Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-31 Thread Nitin Gupta


On 01/25/2018 01:13 PM, Mel Gorman wrote:
> On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
>>>> It's not really about memory scarcity but a more efficient use of it.
>>>> Applications may want hugepage benefits without requiring any changes to
>>>> app code which is what THP is supposed to provide, while still avoiding
>>>> memory bloat.
>>>>
>>> I read these links and find that there are mainly two complains:
>>> 1. THP causes latency spikes, because direction compaction slows down THP 
>>> allocation,
>>> 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return 
>>> memory ranges smaller than
>>>THP size and fails because of THP.
>>>
>>> The first complain is not related to this patch.
>>
>> I'm trying to address many different THP issues and memory bloat is
>> first among them.
> 
> Expecting userspace to get this right is probably going to go sideways.
> It'll be screwed up and be sub-optimal or have odd semantics for existing
> madvise flags. The fact is that an application may not even know if it's
> going to be sparsely using memory in advance if it's a computation load
> modelling from unknown input data.
> 
> I suggest you read the old Talluri paper "Superpassing the TLB Performance
> of Superpages with Less Operating System Support" and pay attention to
> Section 4. There it discusses a page reservation scheme whereby on fault
> a naturally aligned set of base pages are reserved and only one correctly
> placed base page is inserted into the faulting address. It was tied into
> a hypothetical piece of hardware that doesn't exist to give best-effort
> support for superpages so it does not directly help you but the initial
> idea is sound. There are holes in the paper from todays perspective but
> it was written in the 90's.
> 
> From there, read "Transparent operating system support for superpages"
> by Navarro, particularly chapter 4 paying attention to the parts where
> it talks about opportunism and promotion threshold.
> 
> Superficially, it goes like this
> 
> 1. On fault, reserve a THP in the allocator and use one base page that
>is correctly-aligned for the faulting addresses. By correctly-aligned,
>I mean that you use base page whose offset would be naturally contiguous
>if it ever was part of a huge page.
> 2. On subsequent faults, attempt to use a base page that is naturally
>aligned to be a THP
> 3. When a "threshold" of base pages are inserted, allocate the remaining
>pages and promote it to a THP
> 4. If there is memory pressure, spill "reserved" pages into the main
>allocation pool and lose the opportunity to promote (which will need
>khugepaged to recover)
> 
> By definition, a promotion threshold of 1 would be the existing scheme
> of allocation a THP on the first fault and some users will want that. It
> also should be the default to avoid unexpected overhead.  For workloads
> where memory is being sparsely addressed and the increased overhead of
> THP is unwelcome then the threshold should be tuned higher with a maximum
> possible value of HPAGE_PMD_NR.
> 
> It's non-trivial to do this because at minimum a page fault has to check
> if there is a potential promotion candidate by checking the PTEs around
> the faulting address searching for a correctly-aligned base page that is
> already inserted. If there is, then check if the correctly aligned base
> page for the current faulting address is free and if so use it. It'll
> also then need to check the remaining PTEs to see if both the promotion
> threshold has been reached and if so, promote it to a THP (or else teach
> khugepaged to do an in-place promotion if possible). In other words,
> implementing the promotion threshold is both hard and it's not free.
> 
> However, if it did exist then the only tunable would be the "promotion
> threshold" and applications would not need any special awareness of their
> address space.
> 

I went through both references you mentioned and I really like the
idea of reservation-based hugepage allocation.  Navarro also extends
the idea to allow multiple hugepage sizes to be used (as support by
underlying hardware) which was next in order of what I wanted to do in
THP.

So, please ignore this patch and I would work towards implementing
ideas in these papers.

Thanks for the feedback.

Nitin


Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-25 Thread Nitin Gupta


On 01/24/2018 04:47 PM, Zi Yan wrote:
 With this change, whenever an application issues MADV_DONTNEED on a
 memory region, the region is marked as "space-efficient". For such
 regions, a hugepage is not immediately allocated on first write.
>>> Kirill didn't like it in the previous version and I do not like this
>>> either. You are adding a very subtle side effect which might completely
>>> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
>>> to free up unused memory. Now you have put it out of THP usage
>>> basically.
>>>
>> Userpsace may want a region to be considered by khugepaged while opting
>> out of hugepage allocation on first touch. Asking userspace memory
>> allocators to have to track and reclaim unused parts of a THP allocated
>> hugepage does not seems right, as the kernel can use simple userspace
>> hints to avoid allocating extra memory in the first place.
>>
>> I agree that this patch is adding a subtle side-effect which may take
>> some applications by surprise. However, I often see the opposite too:
>> for many workloads, disabling THP is the first advise as this aggressive
>> allocation of hugepages on first touch is unexpected and is too
>> wasteful. For e.g.:
>>
>> 1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB)
>> http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/
>>
>> 2) Disable THP on MongoDB
>> https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
>>
>> 3) Disable THP for Couchbase Server
>> https://blog.couchbase.com/often-overlooked-linux-os-tweaks/
>>
>> 4) Redis
>> http://antirez.com/news/84
>>
>>
>>> If the memory is used really scarce then we have MADV_NOHUGEPAGE.
>>>
>> It's not really about memory scarcity but a more efficient use of it.
>> Applications may want hugepage benefits without requiring any changes to
>> app code which is what THP is supposed to provide, while still avoiding
>> memory bloat.
>>
> I read these links and find that there are mainly two complains:
> 1. THP causes latency spikes, because direction compaction slows down THP 
> allocation,
> 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return 
> memory ranges smaller than
>THP size and fails because of THP.
>
> The first complain is not related to this patch.

I'm trying to address many different THP issues and memory bloat is
first among them.
> For second one, at least with recent kernels, MADV_DONTNEED splits THPs and 
> returns the memory range you
> specified in madvise(). Am I missing anything?
>

Yes, MADV_DONTNEED splits THPs and releases the requested range but
this is not
solving the issue of aggressive alloc-hugepage-on-first-touch policy
of THP=madvise
on MADV_HUGEPAGE regions. Sure, some workloads may prefer that policy
but for
application that don't, this patch give them an option to give hints
to the kernel to
go for gradual hugepage promotion via khugepaged only (and not on
first touch).

It's not good if an application has to track which parts of their
(implicitly allocated)
hugepage are in use and which sub-parts are free so they can issue
MADV_DONTNEED
calls on them. This approach really does not make THP "transparent"
and requires
lot of mm tracking code in userpace.

Nitin



Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-24 Thread Nitin Gupta
On 1/19/18 4:49 AM, Michal Hocko wrote:
> On Thu 18-01-18 15:33:16, Nitin Gupta wrote:
>> From: Nitin Gupta 
>>
>> Currently, if the THP enabled policy is "always", or the mode
>> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage
>> is allocated on a page fault if the pud or pmd is empty.  This
>> yields the best VA translation performance, but increases memory
>> consumption if some small page ranges within the huge page are
>> never accessed.
> 
> Yes, this is true but hardly unexpected for MADV_HUGEPAGE or THP always
> users.
>  

Yes, allocating hugepage on first touch is the current behavior for
above two cases. However, I see issues with this current behavior.
Firstly, THP=always mode is often too aggressive/wasteful to be useful
for any realistic workloads. For THP=madvise, users may want to back
active parts of memory region with hugepages while avoiding aggressive
hugepage allocation on first touch. Or, they may really want the current
behavior.

With this patch, users would have the option to pick what behavior they
want by passing hints to the kernel in the form of MADV_HUGEPAGE and
MADV_DONTNEED madvise calls.


>> An alternate behavior for such page faults is to install a
>> hugepage only when a region is actually found to be (almost)
>> fully mapped and active.  This is a compromise between
>> translation performance and memory consumption.  Currently there
>> is no way for an application to choose this compromise for the
>> page fault conditions above.
> 
> Is that really true? We have 
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> This is not reflected during the PF of course but you can control the
> behavior there as well. Either by the global setting or a per proces
> prctl.
> 

I think this part of patch description needs some rewording. This patch
is to change *only* the page fault behavior.

Once pages are installed, khugepaged does its job as usual, using
max_ptes_none and other config values. I'm not trying to change any
khugepaged behavior here.


>> With this change, whenever an application issues MADV_DONTNEED on a
>> memory region, the region is marked as "space-efficient". For such
>> regions, a hugepage is not immediately allocated on first write.
> 
> Kirill didn't like it in the previous version and I do not like this
> either. You are adding a very subtle side effect which might completely
> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
> to free up unused memory. Now you have put it out of THP usage
> basically.
>

Userpsace may want a region to be considered by khugepaged while opting
out of hugepage allocation on first touch. Asking userspace memory
allocators to have to track and reclaim unused parts of a THP allocated
hugepage does not seems right, as the kernel can use simple userspace
hints to avoid allocating extra memory in the first place.

I agree that this patch is adding a subtle side-effect which may take
some applications by surprise. However, I often see the opposite too:
for many workloads, disabling THP is the first advise as this aggressive
allocation of hugepages on first touch is unexpected and is too
wasteful. For e.g.:

1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB)
http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/

2) Disable THP on MongoDB
https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/

3) Disable THP for Couchbase Server
https://blog.couchbase.com/often-overlooked-linux-os-tweaks/

4) Redis
http://antirez.com/news/84


> If the memory is used really scarce then we have MADV_NOHUGEPAGE.
> 

It's not really about memory scarcity but a more efficient use of it.
Applications may want hugepage benefits without requiring any changes to
app code which is what THP is supposed to provide, while still avoiding
memory bloat.

-Nitin


Re: [PATCH] mm: Reduce memory bloat with THP

2017-12-15 Thread Nitin Gupta
On 12/15/17 2:01 AM, Kirill A. Shutemov wrote:
> On Thu, Dec 14, 2017 at 05:28:52PM -0800, Nitin Gupta wrote:
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 751e97a..b2ec07b 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -508,6 +508,7 @@ static long madvise_dontneed_single_vma(struct 
>> vm_area_struct *vma,
>>  unsigned long start, unsigned long end)
>>  {
>>  zap_page_range(vma, start, end - start);
>> +vma->space_efficient = true;
>>  return 0;
>>  }
>>  
> 
> And this modifies vma without down_write(mmap_sem).
> 

I thought this function was always called with mmmap_sem write locked.
I will check again.

- Nitin




Re: [PATCH] mm: Reduce memory bloat with THP

2017-12-15 Thread Nitin Gupta
On 12/15/17 2:00 AM, Kirill A. Shutemov wrote:
> On Thu, Dec 14, 2017 at 05:28:52PM -0800, Nitin Gupta wrote:
>> Currently, if the THP enabled policy is "always", or the mode
>> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage
>> is allocated on a page fault if the pud or pmd is empty.  This
>> yields the best VA translation performance, but increases memory
>> consumption if some small page ranges within the huge page are
>> never accessed.
>>
>> An alternate behavior for such page faults is to install a
>> hugepage only when a region is actually found to be (almost)
>> fully mapped and active.  This is a compromise between
>> translation performance and memory consumption.  Currently there
>> is no way for an application to choose this compromise for the
>> page fault conditions above.
>>
>> With this change, when an application issues MADV_DONTNEED on a
>> memory region, the region is marked as "space-efficient". For
>> such regions, a hugepage is not immediately allocated on first
>> write.  Instead, it is left to the khugepaged thread to do
>> delayed hugepage promotion depending on whether the region is
>> actually mapped and active. When application issues
>> MADV_HUGEPAGE, the region is marked again as non-space-efficient
>> wherein hugepage is allocated on first touch.
> 
> I think this would be NAK. At least in this form.
> 
> What performance testing have you done? Any numbers?
> 

I wrote a throw-away code which mmaps 128G area and writes to a random
address in a loop. Together with writes, madvise(MADV_DONTNEED) are
issued at another random addresses. Writes are issued with 70%
probability and DONTNEED with 30%. With this test, I'm trying to emulate
workload of a large in-memory hash-table.

With the patch, I see that memory bloat is much less severe.
I've uploaded the test program with the memory usage plot here:

https://gist.github.com/nitingupta910/42ddf969e17556d74a14fbd84640ddb3

THP was set to 'always' mode in both cases but the result would be the
same if madvise mode was used instead.

> Making whole vma "space_efficient" just because somebody freed one page
> from it is just wrong. And there's no way back after this.
>

I'm using MADV_DONTNEED as a hint that although user wants to
transparently use hugepages but at the same time wants to be more
conservative with respect to memory usage. If a MADV_HUGEPAGE is issued
for a VMA range after any DONTNEEDs then the space_efficient bit is
again cleared, so we revert back to allocating hugepage on fault on
empty pud/pmd.

>>
>> Orabug: 26910556
> 
> Wat?
> 

It's oracle internal identifier used to track this work.

Thanks,
Nitin



[PATCH] sparc64: Fix page table walk for PUD hugepages

2017-11-03 Thread Nitin Gupta
For a PUD hugepage entry, we need to propagate bits [32:22]
from virtual address to resolve at 4M granularity. However,
the current code was incorrectly propagating bits [29:19].
This bug can cause incorrect data to be returned for pages
backed with 16G hugepages.

Signed-off-by: Nitin Gupta 
Reported-by: Al Viro 
Cc: Al Viro 

diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index acf55063aa3d..ca0de1646f1e 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -216,7 +216,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
sllxREG2, 32, REG2; \
andcc   REG1, REG2, %g0;\
be,pt   %xcc, 700f; \
-sethi  %hi(0x1ffc), REG2;  \
+sethi  %hi(0xffe0), REG2;  \
sllxREG2, 1, REG2;  \
brgez,pnREG1, FAIL_LABEL;   \
 andn   REG1, REG2, REG1;   \
-- 
2.13.1



Re: [PATCH 1/4] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-22 Thread Nitin Gupta
On Sun, Oct 22, 2017 at 8:10 PM, Minchan Kim  wrote:
> On Fri, Oct 20, 2017 at 10:59:31PM +0300, Kirill A. Shutemov wrote:
>> With boot-time switching between paging mode we will have variable
>> MAX_PHYSMEM_BITS.
>>
>> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
>> configuration to define zsmalloc data structures.
>>
>> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
>> It also suits well to handle PAE special case.
>>
>> Signed-off-by: Kirill A. Shutemov 
>> Cc: Minchan Kim 
>> Cc: Nitin Gupta 
>> Cc: Sergey Senozhatsky 
> Acked-by: Minchan Kim 
>
> Nitin:
>
> I think this patch works and it would be best for Kirill to be able to do.
> So if you have better idea to clean it up, let's make it as another patch
> regardless of this patch series.
>


I was looking into dynamically allocating size_class array to avoid that
compile error, but yes, that can be done in a future patch. So, for this patch:

Reviewed-by: Nitin Gupta 


Re: [PATCH 1/4] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-20 Thread Nitin Gupta
On Fri, Oct 20, 2017 at 12:59 PM, Kirill A. Shutemov
 wrote:
> With boot-time switching between paging mode we will have variable
> MAX_PHYSMEM_BITS.
>
> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
> configuration to define zsmalloc data structures.
>
> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
> It also suits well to handle PAE special case.
>


I see that with your upcoming patch, MAX_PHYSMEM_BITS is turned into a
variable for x86_64 case as: (pgtable_l5_enabled ? 52 : 46).

Even with this change, I don't see a need for this new
MAX_POSSIBLE_PHYSMEM_BITS constant.


> -#ifndef MAX_PHYSMEM_BITS
> -#ifdef CONFIG_HIGHMEM64G
> -#define MAX_PHYSMEM_BITS 36
> -#else /* !CONFIG_HIGHMEM64G */
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> +#ifdef MAX_PHYSMEM_BITS
> +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> +#else


This ifdef on HIGHMEM64G is redundant, as x86 already defines
MAX_PHYSMEM_BITS = 36 in PAE case. So, all that zsmalloc should do is:

#ifndef MAX_PHYSMEM_BITS
#define MAX_PHYSMEM_BITS BITS_PER_LONG
#endif

.. and then no change is needed for rest of derived constants like _PFN_BITS.

It is upto every arch to define correct MAX_PHYSMEM_BITS (variable or constant)
based on whatever configurations the arch supports. If not defined,
zsmalloc picks
a reasonable default of BITS_PER_LONG.

I will send a patch which makes the change to remove ifdef on CONFIG_HIGHMEM64G.

Thanks,
Nitin


Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-18 Thread Nitin Gupta
On Mon, Oct 16, 2017 at 7:44 AM, Kirill A. Shutemov
 wrote:
> On Fri, Oct 13, 2017 at 05:00:12PM -0700, Nitin Gupta wrote:
>> On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
>>  wrote:
>> > With boot-time switching between paging mode we will have variable
>> > MAX_PHYSMEM_BITS.
>> >
>> > Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
>> > configuration to define zsmalloc data structures.
>> >
>> > The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
>> > It also suits well to handle PAE special case.
>> >
>> > Signed-off-by: Kirill A. Shutemov 
>> > Cc: Minchan Kim 
>> > Cc: Nitin Gupta 
>> > Cc: Sergey Senozhatsky 
>> > ---
>> >  arch/x86/include/asm/pgtable-3level_types.h |  1 +
>> >  arch/x86/include/asm/pgtable_64_types.h |  2 ++
>> >  mm/zsmalloc.c   | 13 +++--
>> >  3 files changed, 10 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/arch/x86/include/asm/pgtable-3level_types.h 
>> > b/arch/x86/include/asm/pgtable-3level_types.h
>> > index b8a4341faafa..3fe1d107a875 100644
>> > --- a/arch/x86/include/asm/pgtable-3level_types.h
>> > +++ b/arch/x86/include/asm/pgtable-3level_types.h
>> > @@ -43,5 +43,6 @@ typedef union {
>> >   */
>> >  #define PTRS_PER_PTE   512
>> >
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS  36
>> >
>> >  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
>> > diff --git a/arch/x86/include/asm/pgtable_64_types.h 
>> > b/arch/x86/include/asm/pgtable_64_types.h
>> > index 06470da156ba..39075df30b8a 100644
>> > --- a/arch/x86/include/asm/pgtable_64_types.h
>> > +++ b/arch/x86/include/asm/pgtable_64_types.h
>> > @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
>> >  #define P4D_SIZE   (_AC(1, UL) << P4D_SHIFT)
>> >  #define P4D_MASK   (~(P4D_SIZE - 1))
>> >
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS  52
>> > +
>> >  #else /* CONFIG_X86_5LEVEL */
>> >
>> >  /*
>> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> > index 7c38e850a8fc..7bde01c55c90 100644
>> > --- a/mm/zsmalloc.c
>> > +++ b/mm/zsmalloc.c
>> > @@ -82,18 +82,19 @@
>> >   * This is made more complicated by various memory models and PAE.
>> >   */
>> >
>> > -#ifndef MAX_PHYSMEM_BITS
>> > -#ifdef CONFIG_HIGHMEM64G
>> > -#define MAX_PHYSMEM_BITS 36
>> > -#else /* !CONFIG_HIGHMEM64G */
>> > +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
>> > +#ifdef MAX_PHYSMEM_BITS
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
>> > +#else
>> >  /*
>> >   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will 
>> > just
>> >   * be PAGE_SHIFT
>> >   */
>> > -#define MAX_PHYSMEM_BITS BITS_PER_LONG
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
>> >  #endif
>> >  #endif
>> > -#define _PFN_BITS  (MAX_PHYSMEM_BITS - PAGE_SHIFT)
>> > +
>> > +#define _PFN_BITS  (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>> >
>>
>>
>> I think we can avoid using this new constant in zsmalloc.
>>
>> The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
>> bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
>> for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
>> would remain 32 bytes.
>>
>> So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
>> thus OBJ_INDEX_BITS = PAGE_SHIFT.
>
> As you understand the topic better than me, could you prepare the patch?
>


Actually no changes are necessary.

As long as physical address bits <= BITS_PER_LONG, then setting
_PFN_BITS to the most conservative value of BITS_PER_LONG is
fine. AFAIK, this condition does not hold on x86 PAE where PA
bits (36) > BITS_PER_LONG (32), so only that case need special
handling to make sure PFN bits are not lost when encoding
allocated object location in an unsigned long.

Thanks,
Nitin


Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-13 Thread Nitin Gupta
On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
 wrote:
> With boot-time switching between paging mode we will have variable
> MAX_PHYSMEM_BITS.
>
> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
> configuration to define zsmalloc data structures.
>
> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
> It also suits well to handle PAE special case.
>
> Signed-off-by: Kirill A. Shutemov 
> Cc: Minchan Kim 
> Cc: Nitin Gupta 
> Cc: Sergey Senozhatsky 
> ---
>  arch/x86/include/asm/pgtable-3level_types.h |  1 +
>  arch/x86/include/asm/pgtable_64_types.h |  2 ++
>  mm/zsmalloc.c   | 13 +++--
>  3 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable-3level_types.h 
> b/arch/x86/include/asm/pgtable-3level_types.h
> index b8a4341faafa..3fe1d107a875 100644
> --- a/arch/x86/include/asm/pgtable-3level_types.h
> +++ b/arch/x86/include/asm/pgtable-3level_types.h
> @@ -43,5 +43,6 @@ typedef union {
>   */
>  #define PTRS_PER_PTE   512
>
> +#define MAX_POSSIBLE_PHYSMEM_BITS  36
>
>  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> b/arch/x86/include/asm/pgtable_64_types.h
> index 06470da156ba..39075df30b8a 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
>  #define P4D_SIZE   (_AC(1, UL) << P4D_SHIFT)
>  #define P4D_MASK   (~(P4D_SIZE - 1))
>
> +#define MAX_POSSIBLE_PHYSMEM_BITS  52
> +
>  #else /* CONFIG_X86_5LEVEL */
>
>  /*
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 7c38e850a8fc..7bde01c55c90 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -82,18 +82,19 @@
>   * This is made more complicated by various memory models and PAE.
>   */
>
> -#ifndef MAX_PHYSMEM_BITS
> -#ifdef CONFIG_HIGHMEM64G
> -#define MAX_PHYSMEM_BITS 36
> -#else /* !CONFIG_HIGHMEM64G */
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> +#ifdef MAX_PHYSMEM_BITS
> +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> +#else
>  /*
>   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
>   * be PAGE_SHIFT
>   */
> -#define MAX_PHYSMEM_BITS BITS_PER_LONG
> +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
>  #endif
>  #endif
> -#define _PFN_BITS  (MAX_PHYSMEM_BITS - PAGE_SHIFT)
> +
> +#define _PFN_BITS  (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>


I think we can avoid using this new constant in zsmalloc.

The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
would remain 32 bytes.

So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
thus OBJ_INDEX_BITS = PAGE_SHIFT.

- Nitin


[PATCH v6 3/3] sparc64: Cleanup hugepage table walk functions

2017-08-11 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 7acb84d..bcd8cdb 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH v6 1/3] sparc64: Support huge PUD case in get_user_pages

2017-08-11 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h | 15 +++--
 arch/sparc/mm/gup.c | 45 -
 2 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2579f5a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index f80cfc6..d809099 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,45 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   page = pud_page(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   head = compound_head(page);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +180,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH v6 2/3] sparc64: Add 16GB hugepage support

2017-08-11 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Cc: Anthony Yznaga 
Reviewed-by: Bob Picco 
Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/hugetlb.h|  7 
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 36 ++
 arch/sparc/kernel/head_64.S |  2 +-
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/kernel/vmlinux.lds.S |  5 +++
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 54 +++
 9 files changed, 157 insertions(+), 31 deletions(-)

diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h
index d1f837d..0ca7caa 100644
--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -4,6 +4,13 @@
 #include 
 #include 
 
+#ifdef CONFIG_HUGETLB_PAGE
+struct pud_huge_patch_entry {
+   unsigned int addr;
+   unsigned int insn;
+};
+extern struct pud_huge_patch_entry __pud_huge_patch, __pud_huge_patch_end;
+#endif
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep, pte_t pte);
diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2579f5a..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..acf5506 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,41 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+700:   ba 700f;\
+nop;   \
+   .section.pud_huge_patch, "ax";  \
+   .word   700b;   \
+   nop;\
+   .previous;  \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@

[PATCH v5 1/3] sparc64: Support huge PUD case in get_user_pages

2017-07-29 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h | 15 +++--
 arch/sparc/mm/gup.c | 45 -
 2 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2579f5a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index f80cfc6..d809099 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,45 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   page = pud_page(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   head = compound_head(page);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +180,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH v5 2/3] sparc64: Add 16GB hugepage support

2017-07-29 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/hugetlb.h|  7 
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 36 ++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/kernel/vmlinux.lds.S |  5 +++
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 54 +++
 8 files changed, 156 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h
index d1f837d..0ca7caa 100644
--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -4,6 +4,13 @@
 #include 
 #include 
 
+#ifdef CONFIG_HUGETLB_PAGE
+struct pud_huge_patch_entry {
+   unsigned int addr;
+   unsigned int insn;
+};
+extern struct pud_huge_patch_entry __pud_huge_patch, __pud_huge_patch_end;
+#endif
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep, pte_t pte);
diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2579f5a..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..acf5506 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,41 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+700:   ba 700f;\
+nop;   \
+   .section.pud_huge_patch, "ax";  \
+   .word   700b;   \
+   nop;\
+   .previous;  \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +277,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys

[PATCH v5 3/3] sparc64: Cleanup hugepage table walk functions

2017-07-29 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 7acb84d..bcd8cdb 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



Re: [PATCH 2/3] sparc64: Add 16GB hugepage support

2017-07-26 Thread Nitin Gupta


On 07/20/2017 01:04 PM, David Miller wrote:
> From: Nitin Gupta 
> Date: Thu, 13 Jul 2017 14:53:24 -0700
> 
>> Testing:
>>
>> Tested with the stream benchmark which allocates 48G of
>> arrays backed by 16G hugepages and does RW operation on
>> them in parallel.
> 
> It would be great if we started adding tests under
> tools/testing/selftests so that other people can recreate
> your tests/benchmarks.
> 

Yes, I would like to add the stream benchmark to selftests too.
I will check if our internal version of stream can be released.


>> diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
>> index 32258e0..7b240a3 100644
>> --- a/arch/sparc/include/asm/tsb.h
>> +++ b/arch/sparc/include/asm/tsb.h
>> @@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
>> __tsb_phys_patch_end;
>>   nop; \
>>  699:
>>  
>> +/* PUD has been loaded into REG1, interpret the value, seeing
>> + * if it is a HUGE PUD or a normal one.  If it is not valid
>> + * then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
>> + * translates to a valid PTE, branch to PTE_LABEL.
>> + *
>> + * We have to propagate bits [32:22] from the virtual address
>> + * to resolve at 4M granularity.
>> + */
>> +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 
>> PTE_LABEL) \
>> +brz,pn  REG1, FAIL_LABEL;   \
>> + sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
>> +sllxREG2, 32, REG2; \
>> +andcc   REG1, REG2, %g0;\
>> +be,pt   %xcc, 700f; \
>> + sethi  %hi(0x1ffc), REG2;  \
>> +sllxREG2, 1, REG2;  \
>> +brgez,pnREG1, FAIL_LABEL;   \
>> + andn   REG1, REG2, REG1;   \
>> +and VADDR, REG2, REG2;  \
>> +brlz,pt REG1, PTE_LABEL;\
>> + or REG1, REG2, REG1;   \
>> +700:
>> +#else
>> +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 
>> PTE_LABEL) \
>> +brz,pn  REG1, FAIL_LABEL; \
>> + nop;
>> +#endif
>> +
>>  /* PMD has been loaded into REG1, interpret the value, seeing
>>   * if it is a HUGE PMD or a normal one.  If it is not valid
>>   * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
>> @@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
>> __tsb_phys_patch_end;
>>  srlxREG2, 64 - PAGE_SHIFT, REG2; \
>>  andnREG2, 0x7, REG2; \
>>  ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
>> +USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
>>  brz,pn  REG1, FAIL_LABEL; \
>>   sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
>>  srlxREG2, 64 - PAGE_SHIFT, REG2; \
> 
> This macro is getting way out of control, every TLB/TSB miss is
> going to invoke this sequence of code.
> 
> Yes, it's just a two cycle constant load, a test modifying the
> condition codes, and an easy to predict branch.
> 
> But every machine will eat this overhead, even if they don't use
> hugepages or don't set the 16GB knob.
> 
> I think we can do better, using code patching or similar.
> 
> Once the knob is set, you can know for sure that this code path
> will never actually be taken.

The simplest way I can think of is to add CONFIG_SPARC_16GB_HUGEPAGE
and exclude PUD check if not enabled.  Would this be okay?

Thanks,
Nitin



[PATCH] sparc64: Register hugepages during arch init

2017-07-19 Thread Nitin Gupta
Add hstate for each supported hugepage size using
arch initcall. This change fixes some hugepage
parameter parsing inconsistencies:

case 1: no hugepage parameters

 Without hugepage parameters, only a hugepages-8192kB entry is visible
 in sysfs.  It's different from x86_64 where both 2M and 1G hugepage
 sizes are available.

case 2: default_hugepagesz=[64K|256M|2G]

 When specifying only a default_hugepagesz parameter, the default
 hugepage size isn't really changed and it stays at 8M. This is again
 different from x86_64.

Orabug: 25869946

Reviewed-by: Bob Picco 
Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/init_64.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 3c40ebd..fed73f1 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -325,6 +325,29 @@ static void __update_mmu_tsb_insert(struct mm_struct *mm, 
unsigned long tsb_inde
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
+static void __init add_huge_page_size(unsigned long size)
+{
+   unsigned int order;
+
+   if (size_to_hstate(size))
+   return;
+
+   order = ilog2(size) - PAGE_SHIFT;
+   hugetlb_add_hstate(order);
+}
+
+static int __init hugetlbpage_init(void)
+{
+   add_huge_page_size(1UL << HPAGE_64K_SHIFT);
+   add_huge_page_size(1UL << HPAGE_SHIFT);
+   add_huge_page_size(1UL << HPAGE_256MB_SHIFT);
+   add_huge_page_size(1UL << HPAGE_2GB_SHIFT);
+
+   return 0;
+}
+
+arch_initcall(hugetlbpage_init);
+
 static int __init setup_hugepagesz(char *string)
 {
unsigned long long hugepage_size;
@@ -364,7 +387,7 @@ static int __init setup_hugepagesz(char *string)
goto out;
}
 
-   hugetlb_add_hstate(hugepage_shift - PAGE_SHIFT);
+   add_huge_page_size(hugepage_size);
rc = 1;
 
 out:
-- 
2.9.2



[PATCH 2/3] sparc64: Add 16GB hugepage support

2017-07-13 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 30 +++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 41 
 6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2579f5a..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-   sethi   %uhi(_PAGE_PMD_HUGE), %g7
+   sethi   %uhi(_PAGE_PMD_HUGE | _PAGE_PUD_HUGE), %g7
sllx%g7, 32, %g7
 
andcc   %g5, %g7, %g0
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c

[PATCH 3/3] sparc64: Cleanup hugepage table walk functions

2017-07-13 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 7acb84d..bcd8cdb 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH 1/3] sparc64: Support huge PUD case in get_user_pages

2017-07-13 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h | 15 ++--
 arch/sparc/mm/gup.c | 47 -
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2579f5a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index f80cfc6..d777594 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   head = pud_page(pud);
+   page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH v2] sparc64: Fix gup_huge_pmd

2017-06-22 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Cc: Julian Calaby 
Signed-off-by: Nitin Gupta 
---
 Changes since v1
 - Clarify use of 'head' variable (Julian Calaby)

 arch/sparc/mm/gup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..f80cfc6 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -78,8 +78,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
return 0;
 
refs = 0;
-   head = pmd_page(pmd);
-   page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   page = pmd_page(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   head = compound_head(page);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



Re: [PATCH] sparc64: Fix gup_huge_pmd

2017-06-22 Thread Nitin Gupta

Hi Julian,


On 6/22/17 3:53 AM, Julian Calaby wrote:

On Thu, Jun 22, 2017 at 7:50 AM, Nitin Gupta  wrote:

The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta 
---
  arch/sparc/mm/gup.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..9116a6f 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
 refs = 0;
 head = pmd_page(pmd);
 page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);

Stupid question: shouldn't this go before the page calculation?


No, it should be after page calculation: First, 'head' points to base of
the PMD page, then 'page' points to an offset within that page. Finally,
we make sure that head variable points to head of the compound page
which contains the addr.

I think confusion comes from the use of 'head' for pointing to a
non-head page. So, maybe it would be more clear to write that part
of the function this way:

page = pmd_page(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
head = compound_head(page);

Thanks,
Nitin



[PATCH] sparc64: Fix gup_huge_pmd

2017-06-21 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..9116a6f 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



[PATCH 3/4] sparc64: Fix gup_huge_pmd

2017-06-20 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 7cfa9c5..b1c649d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



[PATCH 2/4] sparc64: Support huge PUD case in get_user_pages

2017-06-20 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h | 15 ++--
 arch/sparc/mm/gup.c | 47 -
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2444b02..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -692,6 +692,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -828,9 +830,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..7cfa9c5 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   head = pud_page(pud);
+   page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH 1/4] sparc64: Add 16GB hugepage support

2017-06-20 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 30 +++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 41 
 6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2444b02 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-   sethi   %uhi(_PAGE_PMD_HUGE), %g7
+   sethi   %uhi(_PAGE_PMD_HUGE | _PAGE_PUD_HUGE), %g7
sllx%g7, 32, %g7
 
andcc   %g5, %g7, %g0
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c

[PATCH 4/4] sparc64: Cleanup hugepage table walk functions

2017-06-20 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index f0bb42d..e8b7245 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
@@ -294,27 +286,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH v3 3/4] sparc64: Fix gup_huge_pmd

2017-06-19 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 7cfa9c5..b1c649d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



[PATCH v3 4/4] sparc64: Cleanup hugepage table walk functions

2017-06-19 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index f0bb42d..e8b7245 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
@@ -294,27 +286,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH v3 2/4] sparc64: Support huge PUD case in get_user_pages

2017-06-19 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h | 15 ++--
 arch/sparc/mm/gup.c | 47 -
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2444b02..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -692,6 +692,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -828,9 +830,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..7cfa9c5 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   head = pud_page(pud);
+   page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH v3 1/4] sparc64: Add 16GB hugepage support

2017-06-19 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta 
---
Changelog v3 vs v2:
 - Fixed email headers so the subject shows up correctly

Changelog v2 vs v1:
 - Remove redundant brgez,pn (Bob Picco)
 - Remove unncessary label rename from 700 to 701 (Rob Gardner)
 - Add patch description (Paul)
 - Add 16G case to get_user_pages()

arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 30 +++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 41 
 6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2444b02 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)

Re: From: Nitin Gupta

2017-06-19 Thread Nitin Gupta
Please ignore this patch series. I will resend again with correct email 
headers.


Nitin


On 6/19/17 2:48 PM, Nitin Gupta wrote:

Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta 
---

Changelog v2 vs v1:
  - Remove redundant brgez,pn (Bob Picco)
  - Remove unncessary label rename from 700 to 701 (Rob Gardner)
  - Add patch description (Paul)
  - Add 16G case to get_user_pages()

  arch/sparc/include/asm/page_64.h|  3 +-
  arch/sparc/include/asm/pgtable_64.h |  5 +++
  arch/sparc/include/asm/tsb.h| 30 +++
  arch/sparc/kernel/tsb.S |  2 +-
  arch/sparc/mm/hugetlbpage.c | 74 ++---
  arch/sparc/mm/init_64.c | 41 
  6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
  
  #define HPAGE_SHIFT		23

  #define REAL_HPAGE_SHIFT  22
+#define HPAGE_16GB_SHIFT   34
  #define HPAGE_2GB_SHIFT   31
  #define HPAGE_256MB_SHIFT 28
  #define HPAGE_64K_SHIFT   16
@@ -28,7 +29,7 @@
  #define HUGETLB_PAGE_ORDER(HPAGE_SHIFT - PAGE_SHIFT)
  #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
  #define REAL_HPAGE_PER_HPAGE  (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
  #endif
  
  #ifndef __ASSEMBLY__

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2444b02 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
  }
  
+static inline bool is_hugetlb_pud(pud_t pud)

+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
  static inline pmd_t pmd_mkhuge(pmd_t pmd)
  {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
  699:
  
+	/* PUD has been loaded into REG1, interpret the value, seeing

+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
  

[PATCH v2 3/4] sparc64: Fix gup_huge_pmd

2017-06-19 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 7cfa9c5..b1c649d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



From: Nitin Gupta

2017-06-19 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta 
---

Changelog v2 vs v1:
 - Remove redundant brgez,pn (Bob Picco)
 - Remove unncessary label rename from 700 to 701 (Rob Gardner)
 - Add patch description (Paul)
 - Add 16G case to get_user_pages()

 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 30 +++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 41 
 6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2444b02 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-   sethi   %uhi(_PAGE_PMD_HUGE), %g7
+   sethi 

[PATCH v2 4/4] sparc64: Cleanup hugepage table walk functions

2017-06-19 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index f0bb42d..e8b7245 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
@@ -294,27 +286,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH v2 2/4] sparc64: Support huge PUD case in get_user_pages

2017-06-19 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h | 15 ++--
 arch/sparc/mm/gup.c | 47 -
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2444b02..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -692,6 +692,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -828,9 +830,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..7cfa9c5 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   head = pud_page(pud);
+   page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



Re: [PATCH] sparc64: Add 16GB hugepage support

2017-05-24 Thread Nitin Gupta
On 5/24/17 8:45 PM, David Miller wrote:
> From: Paul Gortmaker 
> Date: Wed, 24 May 2017 23:34:42 -0400
> 
>> [[PATCH] sparc64: Add 16GB hugepage support] On 24/05/2017 (Wed 17:29) Nitin 
>> Gupta wrote:
>>
>>> Orabug: 25362942
>>>
>>> Signed-off-by: Nitin Gupta 
>>
>> If this wasn't an accidental git send-email misfire, then there should
>> be a long log indicating the use case, the perforamnce increase, the
>> testing that was done, etc. etc. 
>>
>> Normally I'd not notice but since I was Cc'd I figured it was worth a
>> mention -- for example the vendor ID above doesn't mean a thing to
>> all the rest of us, hence why I suspect it was a git send-email misfire;
>> sadly, I think we've all accidentally done that at least once
> 
> Agreed.
> 
> No commit message whatsoever is basically unacceptable for something
> like this.
>

Ok, I will include usage, testing notes, performance numbers etc., in
v2 patch. Still, I do try to include "Orabug" for better tracking of
bugs internally; I hope that's okay.

Thanks,
Nitin



[PATCH] sparc64: Add 16GB hugepage support

2017-05-24 Thread Nitin Gupta
Orabug: 25362942

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 35 +-
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 41 
 6 files changed, 128 insertions(+), 32 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2444b02 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..fbd8da7 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,36 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+sllx   REG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -209,14 +239,14 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 sethi  %uhi(_PAGE_PMD_HUGE), REG2; \
sllxREG2, 32, REG2; \
andcc   REG1, REG2, %g0;\
-   be,pt   %xcc, 700f; \
+   be,pt   %xcc, 701f; \
 sethi  %hi(4 * 1024 * 1024), REG2; \
brgez,pnREG1, FAIL_LABEL;   \
 andn   REG1, REG2, REG1;   \
and VADDR, REG2, REG2;  \
brlz,pt REG1, PTE_LABEL;\
 or REG1, REG2, REG1;   \
-700:
+701:
 #else
 #define USER_PGTABLE_CHECK_PMD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
brz,pn  REG1, FAIL_LABEL; \
@@ -242,6 +272,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + P

[PATCH] sparc64: Fix mapping of 64k pages with MAP_FIXED

2017-05-15 Thread Nitin Gupta
An incorrect huge page alignment check caused
mmap failure for 64K pages when MAP_FIXED is used
with address not aligned to HPAGE_SIZE.

Orabug: 25885991

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/hugetlb.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h
index dcbf985..d1f837d 100644
--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -24,9 +24,11 @@ static inline int is_hugepage_only_range(struct mm_struct 
*mm,
 static inline int prepare_hugepage_range(struct file *file,
unsigned long addr, unsigned long len)
 {
-   if (len & ~HPAGE_MASK)
+   struct hstate *h = hstate_file(file);
+
+   if (len & ~huge_page_mask(h))
return -EINVAL;
-   if (addr & ~HPAGE_MASK)
+   if (addr & ~huge_page_mask(h))
return -EINVAL;
return 0;
 }
-- 
2.9.2



[PATCH] sparc64: Fix hugepage page table free

2017-04-17 Thread Nitin Gupta
Make sure the start adderess is aligned to PMD_SIZE
boundary when freeing page table backing a hugepage
region. The issue was causing segfaults when a region
backed by 64K pages was unmapped since such a region
is in general not PMD_SIZE aligned.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ee5273a..7c29d38 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -461,6 +461,22 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
pgd_t *pgd;
unsigned long next;
 
+   addr &= PMD_MASK;
+   if (addr < floor) {
+   addr += PMD_SIZE;
+   if (!addr)
+   return;
+   }
+   if (ceiling) {
+   ceiling &= PMD_MASK;
+   if (!ceiling)
+   return;
+   }
+   if (end - 1 > ceiling - 1)
+   end -= PMD_SIZE;
+   if (addr > end - 1)
+   return;
+
pgd = pgd_offset(tlb->mm, addr);
do {
next = pgd_addr_end(addr, end);
-- 
2.9.2



[PATCH] sparc64: Fix memory corruption when THP is enabled

2017-03-31 Thread Nitin Gupta
The memory corruption was happening due to incorrect
TLB/TSB flushing of hugepages.

Reported-by: David S. Miller 
Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/tlb.c | 6 +++---
 arch/sparc/mm/tsb.c | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index afda3bb..ee8066c 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -154,7 +154,7 @@ static void tlb_batch_pmd_scan(struct mm_struct *mm, 
unsigned long vaddr,
if (pte_val(*pte) & _PAGE_VALID) {
bool exec = pte_exec(*pte);
 
-   tlb_batch_add_one(mm, vaddr, exec, false);
+   tlb_batch_add_one(mm, vaddr, exec, PAGE_SHIFT);
}
pte++;
vaddr += PAGE_SIZE;
@@ -209,9 +209,9 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pte_t orig_pte = __pte(pmd_val(orig));
bool exec = pte_exec(orig_pte);
 
-   tlb_batch_add_one(mm, addr, exec, true);
+   tlb_batch_add_one(mm, addr, exec, REAL_HPAGE_SHIFT);
tlb_batch_add_one(mm, addr + REAL_HPAGE_SIZE, exec,
-   true);
+ REAL_HPAGE_SHIFT);
} else {
tlb_batch_pmd_scan(mm, addr, orig);
}
diff --git a/arch/sparc/mm/tsb.c b/arch/sparc/mm/tsb.c
index 0a04811..bedf08b 100644
--- a/arch/sparc/mm/tsb.c
+++ b/arch/sparc/mm/tsb.c
@@ -122,7 +122,7 @@ void flush_tsb_user(struct tlb_batch *tb)
 
spin_lock_irqsave(&mm->context.lock, flags);
 
-   if (tb->hugepage_shift < HPAGE_SHIFT) {
+   if (tb->hugepage_shift < REAL_HPAGE_SHIFT) {
base = (unsigned long) mm->context.tsb_block[MM_TSB_BASE].tsb;
nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
if (tlb_type == cheetah_plus || tlb_type == hypervisor)
@@ -155,7 +155,7 @@ void flush_tsb_user_page(struct mm_struct *mm, unsigned 
long vaddr,
 
spin_lock_irqsave(&mm->context.lock, flags);
 
-   if (hugepage_shift < HPAGE_SHIFT) {
+   if (hugepage_shift < REAL_HPAGE_SHIFT) {
base = (unsigned long) mm->context.tsb_block[MM_TSB_BASE].tsb;
nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
if (tlb_type == cheetah_plus || tlb_type == hypervisor)
-- 
2.9.2



[PATCH] sparc64: Add support for 2G hugepages

2017-03-09 Thread Nitin Gupta
Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/page_64.h | 3 ++-
 arch/sparc/mm/hugetlbpage.c  | 7 +++
 arch/sparc/mm/init_64.c  | 4 
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index f294dd4..5961b2d 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT)
@@ -27,7 +28,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE3
+#define HUGE_MAX_HSTATE4
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 3016850..ee5273a 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -143,6 +143,10 @@ static pte_t sun4v_hugepage_shift_to_tte(pte_t entry, 
unsigned int shift)
pte_val(entry) = pte_val(entry) & ~_PAGE_SZALL_4V;
 
switch (shift) {
+   case HPAGE_2GB_SHIFT:
+   hugepage_size = _PAGE_SZ2GB_4V;
+   pte_val(entry) |= _PAGE_PMD_HUGE;
+   break;
case HPAGE_256MB_SHIFT:
hugepage_size = _PAGE_SZ256MB_4V;
pte_val(entry) |= _PAGE_PMD_HUGE;
@@ -183,6 +187,9 @@ static unsigned int sun4v_huge_tte_to_shift(pte_t entry)
unsigned int shift;
 
switch (tte_szbits) {
+   case _PAGE_SZ2GB_4V:
+   shift = HPAGE_2GB_SHIFT;
+   break;
case _PAGE_SZ256MB_4V:
shift = HPAGE_256MB_SHIFT;
break;
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index ccd4553..3328043 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -337,6 +337,10 @@ static int __init setup_hugepagesz(char *string)
hugepage_shift = ilog2(hugepage_size);
 
switch (hugepage_shift) {
+   case HPAGE_2GB_SHIFT:
+   hv_pgsz_mask = HV_PGSZ_MASK_2GB;
+   hv_pgsz_idx = HV_PGSZ_IDX_2GB;
+   break;
case HPAGE_256MB_SHIFT:
hv_pgsz_mask = HV_PGSZ_MASK_256MB;
hv_pgsz_idx = HV_PGSZ_IDX_256MB;
-- 
2.9.2



[PATCH] sparc64: Fix size check in huge_pte_alloc

2017-03-03 Thread Nitin Gupta
Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 323bc6b..3016850 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -261,7 +261,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
if (!pmd)
return NULL;
 
-   if (sz == PMD_SHIFT)
+   if (sz >= PMD_SIZE)
pte = (pte_t *)pmd;
else
pte = pte_alloc_map(mm, pmd, addr);
-- 
2.9.2



[PATCH] sparc64: Fix build error in flush_tsb_user_page

2017-02-24 Thread Nitin Gupta
Patch "sparc64: Add 64K page size support"
unconditionally used __flush_huge_tsb_one_entry()
which is available only when hugetlb support is
enabled.

Another issue was incorrect TSB flushing for 64K
pages in flush_tsb_user().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c |  5 +++--
 arch/sparc/mm/tsb.c | 20 
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 605bfce..e98a3f2 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -309,7 +309,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
addr,
 
addr &= ~(size - 1);
orig = *ptep;
-   orig_shift = pte_none(orig) ? PAGE_SIZE : huge_tte_to_shift(orig);
+   orig_shift = pte_none(orig) ? PAGE_SHIFT : huge_tte_to_shift(orig);
 
for (i = 0; i < nptes; i++)
ptep[i] = __pte(pte_val(entry) + (i << shift));
@@ -335,7 +335,8 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, 
unsigned long addr,
else
nptes = size >> PAGE_SHIFT;
 
-   hugepage_shift = pte_none(entry) ? PAGE_SIZE : huge_tte_to_shift(entry);
+   hugepage_shift = pte_none(entry) ? PAGE_SHIFT :
+   huge_tte_to_shift(entry);
 
if (pte_present(entry))
mm->context.hugetlb_pte_count -= nptes;
diff --git a/arch/sparc/mm/tsb.c b/arch/sparc/mm/tsb.c
index e39fc57..23479c3 100644
--- a/arch/sparc/mm/tsb.c
+++ b/arch/sparc/mm/tsb.c
@@ -120,12 +120,18 @@ void flush_tsb_user(struct tlb_batch *tb)
 
spin_lock_irqsave(&mm->context.lock, flags);
 
-   if (tb->hugepage_shift == PAGE_SHIFT) {
+   if (tb->hugepage_shift < HPAGE_SHIFT) {
base = (unsigned long) mm->context.tsb_block[MM_TSB_BASE].tsb;
nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
if (tlb_type == cheetah_plus || tlb_type == hypervisor)
base = __pa(base);
-   __flush_tsb_one(tb, PAGE_SHIFT, base, nentries);
+   if (tb->hugepage_shift == PAGE_SHIFT)
+   __flush_tsb_one(tb, PAGE_SHIFT, base, nentries);
+#if defined(CONFIG_HUGETLB_PAGE)
+   else
+   __flush_huge_tsb_one(tb, PAGE_SHIFT, base, nentries,
+tb->hugepage_shift);
+#endif
}
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
else if (mm->context.tsb_block[MM_TSB_HUGE].tsb) {
@@ -152,8 +158,14 @@ void flush_tsb_user_page(struct mm_struct *mm, unsigned 
long vaddr,
nentries = mm->context.tsb_block[MM_TSB_BASE].tsb_nentries;
if (tlb_type == cheetah_plus || tlb_type == hypervisor)
base = __pa(base);
-   __flush_huge_tsb_one_entry(base, vaddr, PAGE_SHIFT, nentries,
-  hugepage_shift);
+   if (hugepage_shift == PAGE_SHIFT)
+   __flush_tsb_one_entry(base, vaddr, PAGE_SHIFT,
+ nentries);
+#if defined(CONFIG_HUGETLB_PAGE)
+   else
+   __flush_huge_tsb_one_entry(base, vaddr, PAGE_SHIFT,
+  nentries, hugepage_shift);
+#endif
}
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
else if (mm->context.tsb_block[MM_TSB_HUGE].tsb) {
-- 
2.9.2



[PATCH] sparc64: Add 64K page size support

2017-02-06 Thread Nitin Gupta
This patch depends on:
[v6] sparc64: Multi-page size support

- Testing

Tested on Sonoma by running stream benchmark instance which allocated
48G worth of 64K pages.

boot params: default_hugepagesz=64K hugepagesz=64K hugepages=1310720

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/page_64.h |  3 ++-
 arch/sparc/mm/hugetlbpage.c  | 54 
 arch/sparc/mm/init_64.c  |  4 +++
 arch/sparc/mm/tsb.c  |  5 ++--
 4 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index d76f38d..f294dd4 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -18,6 +18,7 @@
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
 #define HPAGE_256MB_SHIFT  28
+#define HPAGE_64K_SHIFT16
 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT)
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
@@ -26,7 +27,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE2
+#define HUGE_MAX_HSTATE3
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 618a568..605bfce 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -149,6 +149,9 @@ static pte_t sun4v_hugepage_shift_to_tte(pte_t entry, 
unsigned int shift)
case HPAGE_SHIFT:
pte_val(entry) |= _PAGE_PMD_HUGE;
break;
+   case HPAGE_64K_SHIFT:
+   hugepage_size = _PAGE_SZ64K_4V;
+   break;
default:
WARN_ONCE(1, "unsupported hugepage shift=%u\n", shift);
}
@@ -185,6 +188,9 @@ static unsigned int sun4v_huge_tte_to_shift(pte_t entry)
case _PAGE_SZ4MB_4V:
shift = REAL_HPAGE_SHIFT;
break;
+   case _PAGE_SZ64K_4V:
+   shift = HPAGE_64K_SHIFT;
+   break;
default:
shift = PAGE_SHIFT;
break;
@@ -204,6 +210,9 @@ static unsigned int sun4u_huge_tte_to_shift(pte_t entry)
case _PAGE_SZ4MB_4U:
shift = REAL_HPAGE_SHIFT;
break;
+   case _PAGE_SZ64K_4U:
+   shift = HPAGE_64K_SHIFT;
+   break;
default:
shift = PAGE_SHIFT;
break;
@@ -241,12 +250,21 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 {
pgd_t *pgd;
pud_t *pud;
+   pmd_t *pmd;
pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
-   if (pud)
-   pte = (pte_t *)pmd_alloc(mm, pud, addr);
+   if (pud) {
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+
+   if (sz == PMD_SHIFT)
+   pte = (pte_t *)pmd;
+   else
+   pte = pte_alloc_map(mm, pmd, addr);
+   }
 
return pte;
 }
@@ -255,42 +273,52 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
 {
pgd_t *pgd;
pud_t *pud;
+   pmd_t *pmd;
pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
if (!pgd_none(*pgd)) {
pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud))
-   pte = (pte_t *)pmd_offset(pud, addr);
+   if (!pud_none(*pud)) {
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_none(*pmd)) {
+   if (is_hugetlb_pmd(*pmd))
+   pte = (pte_t *)pmd;
+   else
+   pte = pte_offset_map(pmd, addr);
+   }
+   }
}
+
return pte;
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep, pte_t entry)
 {
-   unsigned int i, nptes, hugepage_shift;
+   unsigned int i, nptes, orig_shift, shift;
unsigned long size;
pte_t orig;
 
size = huge_tte_to_size(entry);
-   nptes = size >> PMD_SHIFT;
+   shift = size >= HPAGE_SIZE ? PMD_SHIFT : PAGE_SHIFT;
+   nptes = size >> shift;
 
if (!pte_present(*ptep) && pte_present(entry))
mm->context.hugetlb_pte_count += nptes;
 
addr &= ~(size - 1);
orig = *ptep;
-   hugepage_shift = pte_none(orig) ? PAGE_SIZE : huge_tte_to_shift(orig);
+   orig_shift = pte_none(orig) ? PAGE_SIZE : huge_tte_to_shift(orig);
 
for (i = 0; i < nptes; i++)
-   ptep[i] = __pte(pte_val(entry) + (i << PMD_SHIFT));

[PATCH v6] sparc64: Multi-page size support

2017-02-01 Thread Nitin Gupta
Add support for using multiple hugepage sizes simultaneously
on mainline. Currently, support for 256M has been added which
can be used along with 8M pages.

Page tables are set like this (e.g. for 256M page):
VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31]

and TSB is set similarly:
VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63]

- Testing

Tested on Sonoma (which supports 256M pages) by running stream
benchmark instances in parallel: one instance uses 8M pages and
another uses 256M pages, consuming 48G each.

Boot params used:

default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M
hugepages=1

Signed-off-by: Nitin Gupta 
---
Changelog v6 vs v5:
 - Fix _flush_huge_tsb_one_entry: add correct offset to base vaddr
Changelog v4 vs v5:
 - Enable hugepage initialization on sun4u
Changelog v3 vs v4:
 - Remove incorrect WARN_ON in __flush_huge_tsb_one_entry()

Changelog v2 vs v3:
 - Remove unused label in tsb.S (David)
 - Order local variables from longest to shortest line (David)

Changelog v1 vs v2:
 - Fix warning due to unused __flush_huge_tsb_one() when
   CONFIG_HUGETLB is not defined.
---
 arch/sparc/include/asm/page_64.h |   3 +-
 arch/sparc/include/asm/pgtable_64.h  |  23 +++--
 arch/sparc/include/asm/tlbflush_64.h |   5 +-
 arch/sparc/kernel/tsb.S  |  21 +
 arch/sparc/mm/hugetlbpage.c  | 160 +++
 arch/sparc/mm/init_64.c  |  42 -
 arch/sparc/mm/tlb.c  |  17 ++--
 arch/sparc/mm/tsb.c  |  44 --
 8 files changed, 253 insertions(+), 62 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index c1263fc..d76f38d 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,7 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
-
+#define HPAGE_256MB_SHIFT  28
 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT)
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
@@ -26,6 +26,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
+#define HUGE_MAX_HSTATE2
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 314b668..7932a4a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot)
 #define pgprot_noncached pgprot_noncached
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline unsigned long __pte_huge_mask(void)
+extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
+   struct page *page, int writable);
+#define arch_make_huge_pte arch_make_huge_pte
+static inline unsigned long __pte_default_huge_mask(void)
 {
unsigned long mask;
 
@@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void)
 
 static inline pte_t pte_mkhuge(pte_t pte)
 {
-   return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask());
+   return __pte(pte_val(pte) | __pte_default_huge_mask());
 }
 
-static inline bool is_hugetlb_pte(pte_t pte)
+static inline bool is_default_hugetlb_pte(pte_t pte)
 {
-   return !!(pte_val(pte) & __pte_huge_mask());
+   unsigned long mask = __pte_default_huge_mask();
+
+   return (pte_val(pte) & mask) == mask;
 }
 
 static inline bool is_hugetlb_pmd(pmd_t pmd)
@@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud)
 
 /* Actual page table PTE updates.  */
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-  pte_t *ptep, pte_t orig, int fullmm);
+  pte_t *ptep, pte_t orig, int fullmm,
+  unsigned int hugepage_shift);
 
 static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-   pte_t *ptep, pte_t orig, int fullmm)
+   pte_t *ptep, pte_t orig, int fullmm,
+   unsigned int hugepage_shift)
 {
/* It is more efficient to let flush_tlb_kernel_range()
 * handle init_mm tlb flushes.
@@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, 
unsigned long vaddr,
 * and SUN4V pte layout, so this inline test is fine.
 */
if (likely(mm != &init_mm) && pte_accessible(mm, orig))
-   tlb_batch_add(mm, vaddr, ptep, orig, fullmm);
+   tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift);
 }
 
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
@@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
pte

[PATCH v5] sparc64: Multi-page size support

2017-01-04 Thread Nitin Gupta
Add support for using multiple hugepage sizes simultaneously
on mainline. Currently, support for 256M has been added which
can be used along with 8M pages.

Page tables are set like this (e.g. for 256M page):
VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31]

and TSB is set similarly:
VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63]

- Testing

Tested on Sonoma (which supports 256M pages) by running stream
benchmark instances in parallel: one instance uses 8M pages and
another uses 256M pages, consuming 48G each.

Boot params used:

default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M
hugepages=1

Signed-off-by: Nitin Gupta 
---
Changelog v4 vs v5:
 - Enable hugepage initialization on sun4u (this patch has been
   tested only on sun4v).
Changelog v3 vs v4:
 - Remove incorrect WARN_ON in __flush_huge_tsb_one_entry()

Changelog v2 vs v3:
 - Remove unused label in tsb.S (David)
 - Order local variables from longest to shortest line (David)

Changelog v1 vs v2:
 - Fix warning due to unused __flush_huge_tsb_one() when
   CONFIG_HUGETLB is not defined.
---
 arch/sparc/include/asm/page_64.h |   3 +-
 arch/sparc/include/asm/pgtable_64.h  |  23 +++--
 arch/sparc/include/asm/tlbflush_64.h |   5 +-
 arch/sparc/kernel/tsb.S  |  21 +
 arch/sparc/mm/hugetlbpage.c  | 160 +++
 arch/sparc/mm/init_64.c  |  42 -
 arch/sparc/mm/tlb.c  |  17 ++--
 arch/sparc/mm/tsb.c  |  44 --
 8 files changed, 253 insertions(+), 62 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index c1263fc..d76f38d 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,7 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
-
+#define HPAGE_256MB_SHIFT  28
 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT)
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
@@ -26,6 +26,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
+#define HUGE_MAX_HSTATE2
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 314b668..7932a4a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot)
 #define pgprot_noncached pgprot_noncached
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline unsigned long __pte_huge_mask(void)
+extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
+   struct page *page, int writable);
+#define arch_make_huge_pte arch_make_huge_pte
+static inline unsigned long __pte_default_huge_mask(void)
 {
unsigned long mask;
 
@@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void)
 
 static inline pte_t pte_mkhuge(pte_t pte)
 {
-   return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask());
+   return __pte(pte_val(pte) | __pte_default_huge_mask());
 }
 
-static inline bool is_hugetlb_pte(pte_t pte)
+static inline bool is_default_hugetlb_pte(pte_t pte)
 {
-   return !!(pte_val(pte) & __pte_huge_mask());
+   unsigned long mask = __pte_default_huge_mask();
+
+   return (pte_val(pte) & mask) == mask;
 }
 
 static inline bool is_hugetlb_pmd(pmd_t pmd)
@@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud)
 
 /* Actual page table PTE updates.  */
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-  pte_t *ptep, pte_t orig, int fullmm);
+  pte_t *ptep, pte_t orig, int fullmm,
+  unsigned int hugepage_shift);
 
 static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-   pte_t *ptep, pte_t orig, int fullmm)
+   pte_t *ptep, pte_t orig, int fullmm,
+   unsigned int hugepage_shift)
 {
/* It is more efficient to let flush_tlb_kernel_range()
 * handle init_mm tlb flushes.
@@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, 
unsigned long vaddr,
 * and SUN4V pte layout, so this inline test is fine.
 */
if (likely(mm != &init_mm) && pte_accessible(mm, orig))
-   tlb_batch_add(mm, vaddr, ptep, orig, fullmm);
+   tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift);
 }
 
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
@@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
pte_t orig = *ptep;
 
*ptep = pte;
-   ma

Re: [PATCH v4] sparc64: Multi-page size support

2017-01-03 Thread Nitin Gupta


On 12/27/2016 09:34 AM, David Miller wrote:
> From: Nitin Gupta 
> Date: Tue, 13 Dec 2016 10:03:18 -0800
> 
>> +static unsigned int sun4u_huge_tte_to_shift(pte_t entry)
>> +{
>> +unsigned long tte_szbits = pte_val(entry) & _PAGE_SZALL_4V;
>> +unsigned int shift;
>> +
>> +switch (tte_szbits) {
>> +case _PAGE_SZ256MB_4U:
>> +shift = HPAGE_256MB_SHIFT;
>> +break;
> 
> You added all the code necessary to do this on the sun4u chips that support
> 256MB TTEs, so you might as well enable it in the initialization code.
> 
> I'm pretty sure this is an UltraSPARC-IV and later feature.
> 

I added sun4u related changes just for completeness sake. I don't have
access to a sun4u machine so can't be sure if sun4u would work.
That's why that _PAGE_SZALL_4V typo escaped my notice.

I will enable setup_hugepagesz() for non-hypervisor case and send a v5.

Thanks,
Nitin


[PATCH v4] sparc64: Multi-page size support

2016-12-13 Thread Nitin Gupta
Add support for using multiple hugepage sizes simultaneously
on mainline. Currently, support for 256M has been added which
can be used along with 8M pages.

Page tables are set like this (e.g. for 256M page):
VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31]

and TSB is set similarly:
VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63]

- Testing

Tested on Sonoma (which supports 256M pages) by running stream
benchmark instances in parallel: one instance uses 8M pages and
another uses 256M pages, consuming 48G each.

Boot params used:

default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M
hugepages=1

Signed-off-by: Nitin Gupta 
---
Changelog v3 vs v4:
 - Remove incorrect WARN_ON in __flush_huge_tsb_one_entry()

Changelog v2 vs v3:
 - Remove unused label in tsb.S (David)
 - Order local variables from longest to shortest line (David)

Changelog v1 vs v2:
 - Fix warning due to unused __flush_huge_tsb_one() when
   CONFIG_HUGETLB is not defined.
---
 arch/sparc/include/asm/page_64.h |   3 +-
 arch/sparc/include/asm/pgtable_64.h  |  23 +++--
 arch/sparc/include/asm/tlbflush_64.h |   5 +-
 arch/sparc/kernel/tsb.S  |  21 +
 arch/sparc/mm/hugetlbpage.c  | 160 +++
 arch/sparc/mm/init_64.c  |  45 +-
 arch/sparc/mm/tlb.c  |  17 ++--
 arch/sparc/mm/tsb.c  |  44 --
 8 files changed, 256 insertions(+), 62 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index c1263fc..d76f38d 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,7 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
-
+#define HPAGE_256MB_SHIFT  28
 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT)
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
@@ -26,6 +26,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
+#define HUGE_MAX_HSTATE2
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 314b668..7932a4a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot)
 #define pgprot_noncached pgprot_noncached
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline unsigned long __pte_huge_mask(void)
+extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
+   struct page *page, int writable);
+#define arch_make_huge_pte arch_make_huge_pte
+static inline unsigned long __pte_default_huge_mask(void)
 {
unsigned long mask;
 
@@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void)
 
 static inline pte_t pte_mkhuge(pte_t pte)
 {
-   return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask());
+   return __pte(pte_val(pte) | __pte_default_huge_mask());
 }
 
-static inline bool is_hugetlb_pte(pte_t pte)
+static inline bool is_default_hugetlb_pte(pte_t pte)
 {
-   return !!(pte_val(pte) & __pte_huge_mask());
+   unsigned long mask = __pte_default_huge_mask();
+
+   return (pte_val(pte) & mask) == mask;
 }
 
 static inline bool is_hugetlb_pmd(pmd_t pmd)
@@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud)
 
 /* Actual page table PTE updates.  */
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-  pte_t *ptep, pte_t orig, int fullmm);
+  pte_t *ptep, pte_t orig, int fullmm,
+  unsigned int hugepage_shift);
 
 static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-   pte_t *ptep, pte_t orig, int fullmm)
+   pte_t *ptep, pte_t orig, int fullmm,
+   unsigned int hugepage_shift)
 {
/* It is more efficient to let flush_tlb_kernel_range()
 * handle init_mm tlb flushes.
@@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, 
unsigned long vaddr,
 * and SUN4V pte layout, so this inline test is fine.
 */
if (likely(mm != &init_mm) && pte_accessible(mm, orig))
-   tlb_batch_add(mm, vaddr, ptep, orig, fullmm);
+   tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift);
 }
 
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
@@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
pte_t orig = *ptep;
 
*ptep = pte;
-   maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm);
+   maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PA

Re: [PATCH v3] sparc64: Multi-page size support

2016-12-12 Thread Nitin Gupta


On 12/11/2016 06:14 PM, David Miller wrote:
> From: David Miller 
> Date: Sun, 11 Dec 2016 21:06:30 -0500 (EST)
> 
>> Applied.
> 
> Actually, I'm reverting.
> 
> Just doing a simply "make -s -j128" kernel build on a T4-2 I'm
> getting kernel log warnings:
> 
> [2024810.925975] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [2024909.011397] random: crng init done
> [2024970.860642] [ cut here ]
> [2024970.869970] WARNING: CPU: 85 PID: 19335 at arch/sparc/mm/tsb.c:99 
> __flush_huge_tsb_one_entry.constprop.0+0x30/0x74
> [2024970.890947] Modules linked in: ipv6 usb_storage loop ehci_pci sg sr_mod 
> igb ptp pps_core ehci_hcd n2_rng rng_core
> [2024970.911785] CPU: 85 PID: 19335 Comm: ld Not tainted 4.9.0+ #9
> [2024970.923588] Call Trace:
> [2024970.928807]  [00463a2c] __warn+0xb0/0xc8
> [2024970.938349]  [0044efa4] 
> __flush_huge_tsb_one_entry.constprop.0+0x30/0x74
> [2024970.953463]  [0044f224] flush_tsb_user_page+0x88/0x9c
> [2024970.965268]  [0044eabc] tlb_batch_add_one+0x5c/0xd4
> [2024970.976722]  [0044ed84] set_pmd_at+0x104/0x184
> [2024970.987329]  [00552594] zap_huge_pmd+0x30/0x244
> [2024970.998097]  [0052a7a8] unmap_page_range+0x18c/0x794
> [2024971.009721]  [0052b05c] unmap_vmas+0x18/0x44
> [2024971.019976]  [005315f8] exit_mmap+0x94/0x114
> [2024971.030207]  [00461930] mmput+0x50/0xf8
> [2024971.039593]  [004674ac] do_exit+0x310/0x904
> [2024971.049651]  [00467c10] do_group_exit+0x80/0xbc
> [2024971.060415]  [00467c60] SyS_exit_group+0x14/0x20
> [2024971.071363]  [00406194] linux_sparc_syscall32+0x34/0x60
> 
> which is:
> 
> #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
> static void __flush_huge_tsb_one_entry(unsigned long tsb, unsigned long v,
>unsigned long hash_shift,
>unsigned long nentries,
>unsigned int hugepage_shift)
> {
> unsigned int hpage_entries;
> unsigned int i;
> 
> hpage_entries = 1 << (hugepage_shift - REAL_HPAGE_SHIFT);
> WARN_ON(v & ((1ul << hugepage_shift) - 1));
> ^^^


The warning is getting triggered since we do 'vaddr |= 1'
in case the addr is executable in tlb_batch_add_one().

I originally tested the patch with stream which allocates
pages only for heap. Somehow, I cannot not reproduce it (on a Sonoma)
even when I back text segments with huagepages using:

hugectl --force-preload --heap=8m --text=8m make -s -j128


I added this WARN_ON during debugging and can simply be removed.
Do you want to see a v4 with this warning removed or can you reapply
with just this change?

Thanks,
Nitin


[PATCH v3] sparc64: Multi-page size support

2016-11-22 Thread Nitin Gupta
Add support for using multiple hugepage sizes simultaneously
on mainline. Currently, support for 256M has been added which
can be used along with 8M pages.

Page tables are set like this (e.g. for 256M page):
VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31]

and TSB is set similarly:
VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63]

- Testing

Tested on Sonoma (which supports 256M pages) by running stream
benchmark instances in parallel: one instance uses 8M pages and
another uses 256M pages, consuming 48G each.

Boot params used:

default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M
hugepages=1

Signed-off-by: Nitin Gupta 
---

Changelog v2 vs v3:
 - Remove unused label in tsb.S (David)
 - Order local variables from longest to shortest line (David)

Changelog v1 vs v2:
 - Fix warning due to unused __flush_huge_tsb_one() when
   CONFIG_HUGETLB is not defined.
---
 arch/sparc/include/asm/page_64.h |   3 +-
 arch/sparc/include/asm/pgtable_64.h  |  23 +++--
 arch/sparc/include/asm/tlbflush_64.h |   5 +-
 arch/sparc/kernel/tsb.S  |  21 +
 arch/sparc/mm/hugetlbpage.c  | 160 +++
 arch/sparc/mm/init_64.c  |  45 +-
 arch/sparc/mm/tlb.c  |  17 ++--
 arch/sparc/mm/tsb.c  |  45 --
 8 files changed, 257 insertions(+), 62 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index c1263fc..d76f38d 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,7 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
-
+#define HPAGE_256MB_SHIFT  28
 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT)
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
@@ -26,6 +26,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
+#define HUGE_MAX_HSTATE2
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 1fb317f..96005b0 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot)
 #define pgprot_noncached pgprot_noncached
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline unsigned long __pte_huge_mask(void)
+extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
+   struct page *page, int writable);
+#define arch_make_huge_pte arch_make_huge_pte
+static inline unsigned long __pte_default_huge_mask(void)
 {
unsigned long mask;
 
@@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void)
 
 static inline pte_t pte_mkhuge(pte_t pte)
 {
-   return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask());
+   return __pte(pte_val(pte) | __pte_default_huge_mask());
 }
 
-static inline bool is_hugetlb_pte(pte_t pte)
+static inline bool is_default_hugetlb_pte(pte_t pte)
 {
-   return !!(pte_val(pte) & __pte_huge_mask());
+   unsigned long mask = __pte_default_huge_mask();
+
+   return (pte_val(pte) & mask) == mask;
 }
 
 static inline bool is_hugetlb_pmd(pmd_t pmd)
@@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud)
 
 /* Actual page table PTE updates.  */
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-  pte_t *ptep, pte_t orig, int fullmm);
+  pte_t *ptep, pte_t orig, int fullmm,
+  unsigned int hugepage_shift);
 
 static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-   pte_t *ptep, pte_t orig, int fullmm)
+   pte_t *ptep, pte_t orig, int fullmm,
+   unsigned int hugepage_shift)
 {
/* It is more efficient to let flush_tlb_kernel_range()
 * handle init_mm tlb flushes.
@@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, 
unsigned long vaddr,
 * and SUN4V pte layout, so this inline test is fine.
 */
if (likely(mm != &init_mm) && pte_accessible(mm, orig))
-   tlb_batch_add(mm, vaddr, ptep, orig, fullmm);
+   tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift);
 }
 
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
@@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
pte_t orig = *ptep;
 
*ptep = pte;
-   maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm);
+   maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT);
 }
 
 #define set_pte_at(mm,addr,ptep,pte)   \
diff --git a/arch/s

[PATCH] sparc64: Make SLUB the default allocator

2016-10-26 Thread Nitin Gupta
SLUB has better debugging support.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/configs/sparc64_defconfig | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/configs/sparc64_defconfig 
b/arch/sparc/configs/sparc64_defconfig
index 3583d67..0a615b0 100644
--- a/arch/sparc/configs/sparc64_defconfig
+++ b/arch/sparc/configs/sparc64_defconfig
@@ -7,7 +7,9 @@ CONFIG_LOG_BUF_SHIFT=18
 CONFIG_BLK_DEV_INITRD=y
 CONFIG_PERF_EVENTS=y
 # CONFIG_COMPAT_BRK is not set
-CONFIG_SLAB=y
+CONFIG_SLUB_DEBUG=y
+CONFIG_SLUB=y
+CONFIG_SLUB_CPU_PARTIAL=y
 CONFIG_PROFILING=y
 CONFIG_OPROFILE=m
 CONFIG_KPROBES=y
-- 
2.9.2



[PATCH v2] sparc64: Multi-page size support

2016-10-12 Thread Nitin Gupta
Add support for using multiple hugepage sizes simultaneously
on mainline. Currently, support for 256M has been added which
can be used along with 8M pages.

Page tables are set like this (e.g. for 256M page):
VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31]

and TSB is set similarly:
VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63]

- Testing

Tested on Sonoma (which supports 256M pages) by running stream
benchmark instances in parallel: one instance uses 8M pages and
another uses 256M pages, consuming 48G each.

Boot params used:

default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M
hugepages=1

Signed-off-by: Nitin Gupta 
---

Changelog v1 vs v2:
 - Fix warning due to unused __flush_huge_tsb_one() when
   CONFIG_HUGETLB is not defined.

arch/sparc/include/asm/page_64.h |   3 +-
 arch/sparc/include/asm/pgtable_64.h  |  23 +++--
 arch/sparc/include/asm/tlbflush_64.h |   5 +-
 arch/sparc/kernel/tsb.S  |  21 +
 arch/sparc/mm/hugetlbpage.c  | 158 +++
 arch/sparc/mm/init_64.c  |  45 +-
 arch/sparc/mm/tlb.c  |  17 ++--
 arch/sparc/mm/tsb.c  |  44 --
 8 files changed, 254 insertions(+), 62 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index c1263fc..d76f38d 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,7 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
-
+#define HPAGE_256MB_SHIFT  28
 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT)
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
@@ -26,6 +26,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
+#define HUGE_MAX_HSTATE2
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 1fb317f..96005b0 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot)
 #define pgprot_noncached pgprot_noncached
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline unsigned long __pte_huge_mask(void)
+extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
+   struct page *page, int writable);
+#define arch_make_huge_pte arch_make_huge_pte
+static inline unsigned long __pte_default_huge_mask(void)
 {
unsigned long mask;
 
@@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void)
 
 static inline pte_t pte_mkhuge(pte_t pte)
 {
-   return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask());
+   return __pte(pte_val(pte) | __pte_default_huge_mask());
 }
 
-static inline bool is_hugetlb_pte(pte_t pte)
+static inline bool is_default_hugetlb_pte(pte_t pte)
 {
-   return !!(pte_val(pte) & __pte_huge_mask());
+   unsigned long mask = __pte_default_huge_mask();
+
+   return (pte_val(pte) & mask) == mask;
 }
 
 static inline bool is_hugetlb_pmd(pmd_t pmd)
@@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud)
 
 /* Actual page table PTE updates.  */
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-  pte_t *ptep, pte_t orig, int fullmm);
+  pte_t *ptep, pte_t orig, int fullmm,
+  unsigned int hugepage_shift);
 
 static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-   pte_t *ptep, pte_t orig, int fullmm)
+   pte_t *ptep, pte_t orig, int fullmm,
+   unsigned int hugepage_shift)
 {
/* It is more efficient to let flush_tlb_kernel_range()
 * handle init_mm tlb flushes.
@@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, 
unsigned long vaddr,
 * and SUN4V pte layout, so this inline test is fine.
 */
if (likely(mm != &init_mm) && pte_accessible(mm, orig))
-   tlb_batch_add(mm, vaddr, ptep, orig, fullmm);
+   tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift);
 }
 
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
@@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
pte_t orig = *ptep;
 
*ptep = pte;
-   maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm);
+   maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT);
 }
 
 #define set_pte_at(mm,addr,ptep,pte)   \
diff --git a/arch/sparc/include/asm/tlbflush_64.h 
b/arch/sparc/include/asm/tlbflush_64.h
index a8e192e..54be88a 100644
--- a/arch/sparc/include/asm/

[PATCH] sparc64: Multi-page size support

2016-10-12 Thread Nitin Gupta
Add support for using multiple hugepage sizes simultaneously
on mainline. Currently, support for 256M has been added which
can be used along with 8M pages.

Page tables are set like this (e.g. for 256M page):
VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31]

and TSB is set similarly:
VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63]

- Testing

Tested on Sonoma (which supports 256M pages) by running stream
benchmark instances in parallel: one instance uses 8M pages and
another uses 256M pages, consuming 48G each.

Boot params used:

default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M
hugepages=1

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/page_64.h |   3 +-
 arch/sparc/include/asm/pgtable_64.h  |  23 +++--
 arch/sparc/include/asm/tlbflush_64.h |   5 +-
 arch/sparc/kernel/tsb.S  |  21 +
 arch/sparc/mm/hugetlbpage.c  | 158 +++
 arch/sparc/mm/init_64.c  |  45 +-
 arch/sparc/mm/tlb.c  |  17 ++--
 arch/sparc/mm/tsb.c  |  42 --
 8 files changed, 252 insertions(+), 62 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index c1263fc..d76f38d 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,7 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
-
+#define HPAGE_256MB_SHIFT  28
 #define REAL_HPAGE_SIZE(_AC(1,UL) << REAL_HPAGE_SHIFT)
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
@@ -26,6 +26,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
+#define HUGE_MAX_HSTATE2
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 1fb317f..96005b0 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -375,7 +375,10 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot)
 #define pgprot_noncached pgprot_noncached
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline unsigned long __pte_huge_mask(void)
+extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
+   struct page *page, int writable);
+#define arch_make_huge_pte arch_make_huge_pte
+static inline unsigned long __pte_default_huge_mask(void)
 {
unsigned long mask;
 
@@ -395,12 +398,14 @@ static inline unsigned long __pte_huge_mask(void)
 
 static inline pte_t pte_mkhuge(pte_t pte)
 {
-   return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask());
+   return __pte(pte_val(pte) | __pte_default_huge_mask());
 }
 
-static inline bool is_hugetlb_pte(pte_t pte)
+static inline bool is_default_hugetlb_pte(pte_t pte)
 {
-   return !!(pte_val(pte) & __pte_huge_mask());
+   unsigned long mask = __pte_default_huge_mask();
+
+   return (pte_val(pte) & mask) == mask;
 }
 
 static inline bool is_hugetlb_pmd(pmd_t pmd)
@@ -875,10 +880,12 @@ static inline unsigned long pud_pfn(pud_t pud)
 
 /* Actual page table PTE updates.  */
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-  pte_t *ptep, pte_t orig, int fullmm);
+  pte_t *ptep, pte_t orig, int fullmm,
+  unsigned int hugepage_shift);
 
 static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
-   pte_t *ptep, pte_t orig, int fullmm)
+   pte_t *ptep, pte_t orig, int fullmm,
+   unsigned int hugepage_shift)
 {
/* It is more efficient to let flush_tlb_kernel_range()
 * handle init_mm tlb flushes.
@@ -887,7 +894,7 @@ static void maybe_tlb_batch_add(struct mm_struct *mm, 
unsigned long vaddr,
 * and SUN4V pte layout, so this inline test is fine.
 */
if (likely(mm != &init_mm) && pte_accessible(mm, orig))
-   tlb_batch_add(mm, vaddr, ptep, orig, fullmm);
+   tlb_batch_add(mm, vaddr, ptep, orig, fullmm, hugepage_shift);
 }
 
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
@@ -906,7 +913,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
pte_t orig = *ptep;
 
*ptep = pte;
-   maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm);
+   maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT);
 }
 
 #define set_pte_at(mm,addr,ptep,pte)   \
diff --git a/arch/sparc/include/asm/tlbflush_64.h 
b/arch/sparc/include/asm/tlbflush_64.h
index a8e192e..54be88a 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -8,7 +8,7 @@
 #define TLB_BATCH_NR   192
 

[PATCH v2] sparc64: Trim page tables for 8M hugepages

2016-07-29 Thread Nitin Gupta
For PMD aligned (8M) hugepages, we currently allocate
all four page table levels which is wasteful. We now
allocate till PMD level only which saves memory usage
from page tables.

Also, when freeing page table for 8M hugepage backed region,
make sure we don't try to access non-existent PTE level.

Orabug: 22630259

Signed-off-by: Nitin Gupta 
---
Changelog v2 vs v1
 - Combine fix for page table freeing with the main trimming patch (Dave)

arch/sparc/include/asm/hugetlb.h|   12 +--
 arch/sparc/include/asm/pgtable_64.h |7 ++-
 arch/sparc/include/asm/tsb.h|2 +-
 arch/sparc/mm/fault_64.c|4 +-
 arch/sparc/mm/hugetlbpage.c |  166 +++---
 arch/sparc/mm/init_64.c |5 +-
 6 files changed, 129 insertions(+), 67 deletions(-)

diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h
index 139e711..dcbf985 100644
--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -31,14 +31,6 @@ static inline int prepare_hugepage_range(struct file *file,
return 0;
 }
 
-static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
- unsigned long addr, unsigned long end,
- unsigned long floor,
- unsigned long ceiling)
-{
-   free_pgd_range(tlb, addr, end, floor, ceiling);
-}
-
 static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
 unsigned long addr, pte_t *ptep)
 {
@@ -82,4 +74,8 @@ static inline void arch_clear_hugepage_flags(struct page 
*page)
 {
 }
 
+void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
+   unsigned long end, unsigned long floor,
+   unsigned long ceiling);
+
 #endif /* _ASM_SPARC64_HUGETLB_H */
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index e7d8280..1fb317f 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -395,7 +395,7 @@ static inline unsigned long __pte_huge_mask(void)
 
 static inline pte_t pte_mkhuge(pte_t pte)
 {
-   return __pte(pte_val(pte) | __pte_huge_mask());
+   return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask());
 }
 
 static inline bool is_hugetlb_pte(pte_t pte)
@@ -403,6 +403,11 @@ static inline bool is_hugetlb_pte(pte_t pte)
return !!(pte_val(pte) & __pte_huge_mask());
 }
 
+static inline bool is_hugetlb_pmd(pmd_t pmd)
+{
+   return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index c6a155c..32258e0 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -203,7 +203,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 * We have to propagate the 4MB bit of the virtual address
 * because we are fabricating 8MB pages using 4MB hw pages.
 */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
 #define USER_PGTABLE_CHECK_PMD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
brz,pn  REG1, FAIL_LABEL;   \
 sethi  %uhi(_PAGE_PMD_HUGE), REG2; \
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 6c43b92..575ecfe 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -111,8 +111,8 @@ static unsigned int get_user_insn(unsigned long tpc)
if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
goto out_irq_enable;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (pmd_trans_huge(*pmdp)) {
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+   if (is_hugetlb_pmd(*pmdp)) {
pa  = pmd_pfn(*pmdp) << PAGE_SHIFT;
pa += tpc & ~HPAGE_MASK;
 
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ba52e64..494c390 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -12,6 +12,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -131,23 +132,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 {
pgd_t *pgd;
pud_t *pud;
-   pmd_t *pmd;
pte_t *pte = NULL;
 
-   /* We must align the address, because our caller will run
-* set_huge_pte_at() on whatever we return, which writes out
-* all of the sub-ptes for the hugepage range.  So we have
-* to give it the first such sub-pte.
-*/
-   addr &= HPAGE_MASK;
-
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
-   if (pud) {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (pmd)
-   pte = 

[PATCH v2 2/2] sparc64: Fix pagetable freeing for hugepage regions

2016-06-23 Thread Nitin Gupta
8M pages now allocate page tables till PMD level only.
So, when freeing page table for 8M hugepage backed region,
make sure we don't try to access non-existent PTE level.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/hugetlb.h |   12 ++---
 arch/sparc/mm/hugetlbpage.c  |   98 ++
 2 files changed, 102 insertions(+), 8 deletions(-)

diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h
index 139e711..dcbf985 100644
--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -31,14 +31,6 @@ static inline int prepare_hugepage_range(struct file *file,
return 0;
 }
 
-static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
- unsigned long addr, unsigned long end,
- unsigned long floor,
- unsigned long ceiling)
-{
-   free_pgd_range(tlb, addr, end, floor, ceiling);
-}
-
 static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
 unsigned long addr, pte_t *ptep)
 {
@@ -82,4 +74,8 @@ static inline void arch_clear_hugepage_flags(struct page 
*page)
 {
 }
 
+void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
+   unsigned long end, unsigned long floor,
+   unsigned long ceiling);
+
 #endif /* _ASM_SPARC64_HUGETLB_H */
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index cafb5ca..494c390 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -12,6 +12,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -202,3 +203,100 @@ int pud_huge(pud_t pud)
 {
return 0;
 }
+
+static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+  unsigned long addr)
+{
+   pgtable_t token = pmd_pgtable(*pmd);
+
+   pmd_clear(pmd);
+   pte_free_tlb(tlb, token, addr);
+   atomic_long_dec(&tlb->mm->nr_ptes);
+}
+
+static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
+  unsigned long addr, unsigned long end,
+  unsigned long floor, unsigned long ceiling)
+{
+   pmd_t *pmd;
+   unsigned long next;
+   unsigned long start;
+
+   start = addr;
+   pmd = pmd_offset(pud, addr);
+   do {
+   next = pmd_addr_end(addr, end);
+   if (pmd_none(*pmd))
+   continue;
+   if (is_hugetlb_pmd(*pmd))
+   pmd_clear(pmd);
+   else
+   hugetlb_free_pte_range(tlb, pmd, addr);
+   } while (pmd++, addr = next, addr != end);
+
+   start &= PUD_MASK;
+   if (start < floor)
+   return;
+   if (ceiling) {
+   ceiling &= PUD_MASK;
+   if (!ceiling)
+   return;
+   }
+   if (end - 1 > ceiling - 1)
+   return;
+
+   pmd = pmd_offset(pud, start);
+   pud_clear(pud);
+   pmd_free_tlb(tlb, pmd, start);
+   mm_dec_nr_pmds(tlb->mm);
+}
+
+static void hugetlb_free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
+  unsigned long addr, unsigned long end,
+  unsigned long floor, unsigned long ceiling)
+{
+   pud_t *pud;
+   unsigned long next;
+   unsigned long start;
+
+   start = addr;
+   pud = pud_offset(pgd, addr);
+   do {
+   next = pud_addr_end(addr, end);
+   if (pud_none_or_clear_bad(pud))
+   continue;
+   hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
+  ceiling);
+   } while (pud++, addr = next, addr != end);
+
+   start &= PGDIR_MASK;
+   if (start < floor)
+   return;
+   if (ceiling) {
+   ceiling &= PGDIR_MASK;
+   if (!ceiling)
+   return;
+   }
+   if (end - 1 > ceiling - 1)
+   return;
+
+   pud = pud_offset(pgd, start);
+   pgd_clear(pgd);
+   pud_free_tlb(tlb, pud, start);
+}
+
+void hugetlb_free_pgd_range(struct mmu_gather *tlb,
+   unsigned long addr, unsigned long end,
+   unsigned long floor, unsigned long ceiling)
+{
+   pgd_t *pgd;
+   unsigned long next;
+
+   pgd = pgd_offset(tlb->mm, addr);
+   do {
+   next = pgd_addr_end(addr, end);
+   if (pgd_none_or_clear_bad(pgd))
+   continue;
+   hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
+   } while (pgd++, addr = next, addr != end);
+}
-- 
1.7.1



[PATCH v2 1/2] sparc64: Trim page tables for 8M hugepages

2016-06-23 Thread Nitin Gupta
For PMD aligned (8M) hugepages, we currently allocate
all four page table levels which is wasteful. We now
allocate till PMD level only which saves memory usage
from page tables.

Orabug: 22630259

Signed-off-by: Nitin Gupta 
---

Changelog v2 vs v1:
 - Move sparc specific declaration of hugetlb_free_pgd_range
   to arch specific hugetlb.h header.

 arch/sparc/include/asm/pgtable_64.h |7 +++-
 arch/sparc/include/asm/tsb.h|2 +-
 arch/sparc/mm/fault_64.c|4 +-
 arch/sparc/mm/hugetlbpage.c |   68 +++---
 arch/sparc/mm/init_64.c |5 ++-
 5 files changed, 27 insertions(+), 59 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index e7d8280..1fb317f 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -395,7 +395,7 @@ static inline unsigned long __pte_huge_mask(void)
 
 static inline pte_t pte_mkhuge(pte_t pte)
 {
-   return __pte(pte_val(pte) | __pte_huge_mask());
+   return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask());
 }
 
 static inline bool is_hugetlb_pte(pte_t pte)
@@ -403,6 +403,11 @@ static inline bool is_hugetlb_pte(pte_t pte)
return !!(pte_val(pte) & __pte_huge_mask());
 }
 
+static inline bool is_hugetlb_pmd(pmd_t pmd)
+{
+   return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index c6a155c..32258e0 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -203,7 +203,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 * We have to propagate the 4MB bit of the virtual address
 * because we are fabricating 8MB pages using 4MB hw pages.
 */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
 #define USER_PGTABLE_CHECK_PMD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
brz,pn  REG1, FAIL_LABEL;   \
 sethi  %uhi(_PAGE_PMD_HUGE), REG2; \
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cb841a3..ff3f9f9 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -111,8 +111,8 @@ static unsigned int get_user_insn(unsigned long tpc)
if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
goto out_irq_enable;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (pmd_trans_huge(*pmdp)) {
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+   if (is_hugetlb_pmd(*pmdp)) {
pa  = pmd_pfn(*pmdp) << PAGE_SHIFT;
pa += tpc & ~HPAGE_MASK;
 
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ba52e64..cafb5ca 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -131,23 +131,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 {
pgd_t *pgd;
pud_t *pud;
-   pmd_t *pmd;
pte_t *pte = NULL;
 
-   /* We must align the address, because our caller will run
-* set_huge_pte_at() on whatever we return, which writes out
-* all of the sub-ptes for the hugepage range.  So we have
-* to give it the first such sub-pte.
-*/
-   addr &= HPAGE_MASK;
-
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
-   if (pud) {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (pmd)
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
+   if (pud)
+   pte = (pte_t *)pmd_alloc(mm, pud, addr);
+
return pte;
 }
 
@@ -155,19 +145,13 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
 {
pgd_t *pgd;
pud_t *pud;
-   pmd_t *pmd;
pte_t *pte = NULL;
 
-   addr &= HPAGE_MASK;
-
pgd = pgd_offset(mm, addr);
if (!pgd_none(*pgd)) {
pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd))
-   pte = pte_offset_map(pmd, addr);
-   }
+   if (!pud_none(*pud))
+   pte = (pte_t *)pmd_offset(pud, addr);
}
return pte;
 }
@@ -175,67 +159,43 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep, pte_t entry)
 {
-   int i;
-   pte_t orig[2];
-   unsigned long nptes;
+   pte_t orig;
 
if (!pte_present(*ptep) && pte_present(entry))
mm->context.huge_pte_count++;
 
addr &= HPAGE_MASK;
-
-   nptes = 1 << HUGETLB_PAGE_ORDER;
-   orig[0] = *ptep;
-   orig[

[PATCH 2/2] sparc64: Fix pagetable freeing for hugepage regions

2016-06-22 Thread Nitin Gupta
8M pages now allocate page tables till PMD level only.
So, when freeing page table for 8M hugepage backed region,
make sure we don't try to access non-existent PTE level.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/hugetlb.h |  8 
 arch/sparc/mm/hugetlbpage.c  | 98 
 include/linux/hugetlb.h  |  4 ++
 3 files changed, 102 insertions(+), 8 deletions(-)

diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h
index 139e711..1a6708c 100644
--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -31,14 +31,6 @@ static inline int prepare_hugepage_range(struct file *file,
return 0;
 }
 
-static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
- unsigned long addr, unsigned long end,
- unsigned long floor,
- unsigned long ceiling)
-{
-   free_pgd_range(tlb, addr, end, floor, ceiling);
-}
-
 static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
 unsigned long addr, pte_t *ptep)
 {
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index cafb5ca..494c390 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -12,6 +12,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -202,3 +203,100 @@ int pud_huge(pud_t pud)
 {
return 0;
 }
+
+static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+  unsigned long addr)
+{
+   pgtable_t token = pmd_pgtable(*pmd);
+
+   pmd_clear(pmd);
+   pte_free_tlb(tlb, token, addr);
+   atomic_long_dec(&tlb->mm->nr_ptes);
+}
+
+static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
+  unsigned long addr, unsigned long end,
+  unsigned long floor, unsigned long ceiling)
+{
+   pmd_t *pmd;
+   unsigned long next;
+   unsigned long start;
+
+   start = addr;
+   pmd = pmd_offset(pud, addr);
+   do {
+   next = pmd_addr_end(addr, end);
+   if (pmd_none(*pmd))
+   continue;
+   if (is_hugetlb_pmd(*pmd))
+   pmd_clear(pmd);
+   else
+   hugetlb_free_pte_range(tlb, pmd, addr);
+   } while (pmd++, addr = next, addr != end);
+
+   start &= PUD_MASK;
+   if (start < floor)
+   return;
+   if (ceiling) {
+   ceiling &= PUD_MASK;
+   if (!ceiling)
+   return;
+   }
+   if (end - 1 > ceiling - 1)
+   return;
+
+   pmd = pmd_offset(pud, start);
+   pud_clear(pud);
+   pmd_free_tlb(tlb, pmd, start);
+   mm_dec_nr_pmds(tlb->mm);
+}
+
+static void hugetlb_free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
+  unsigned long addr, unsigned long end,
+  unsigned long floor, unsigned long ceiling)
+{
+   pud_t *pud;
+   unsigned long next;
+   unsigned long start;
+
+   start = addr;
+   pud = pud_offset(pgd, addr);
+   do {
+   next = pud_addr_end(addr, end);
+   if (pud_none_or_clear_bad(pud))
+   continue;
+   hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
+  ceiling);
+   } while (pud++, addr = next, addr != end);
+
+   start &= PGDIR_MASK;
+   if (start < floor)
+   return;
+   if (ceiling) {
+   ceiling &= PGDIR_MASK;
+   if (!ceiling)
+   return;
+   }
+   if (end - 1 > ceiling - 1)
+   return;
+
+   pud = pud_offset(pgd, start);
+   pgd_clear(pgd);
+   pud_free_tlb(tlb, pud, start);
+}
+
+void hugetlb_free_pgd_range(struct mmu_gather *tlb,
+   unsigned long addr, unsigned long end,
+   unsigned long floor, unsigned long ceiling)
+{
+   pgd_t *pgd;
+   unsigned long next;
+
+   pgd = pgd_offset(tlb->mm, addr);
+   do {
+   next = pgd_addr_end(addr, end);
+   if (pgd_none_or_clear_bad(pgd))
+   continue;
+   hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
+   } while (pgd++, addr = next, addr != end);
+}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c26d463..4461309 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -120,6 +120,10 @@ int pud_huge(pud_t pmd);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);
 
+void hugetlb_free_pgd

[PATCH 1/2] sparc64: Trim page tables for 8M hugepages

2016-06-22 Thread Nitin Gupta
For PMD aligned (8M) hugepages, we currently allocate
all four page table levels which is wasteful. We now
allocate till PMD level only which saves memory usage
from page tables.

Orabug: 22630259

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h |  7 +++-
 arch/sparc/include/asm/tsb.h|  2 +-
 arch/sparc/mm/fault_64.c|  4 +--
 arch/sparc/mm/hugetlbpage.c | 68 -
 arch/sparc/mm/init_64.c |  5 ++-
 5 files changed, 27 insertions(+), 59 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index e7d8280..1fb317f 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -395,7 +395,7 @@ static inline unsigned long __pte_huge_mask(void)
 
 static inline pte_t pte_mkhuge(pte_t pte)
 {
-   return __pte(pte_val(pte) | __pte_huge_mask());
+   return __pte(pte_val(pte) | _PAGE_PMD_HUGE | __pte_huge_mask());
 }
 
 static inline bool is_hugetlb_pte(pte_t pte)
@@ -403,6 +403,11 @@ static inline bool is_hugetlb_pte(pte_t pte)
return !!(pte_val(pte) & __pte_huge_mask());
 }
 
+static inline bool is_hugetlb_pmd(pmd_t pmd)
+{
+   return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index c6a155c..32258e0 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -203,7 +203,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 * We have to propagate the 4MB bit of the virtual address
 * because we are fabricating 8MB pages using 4MB hw pages.
 */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
 #define USER_PGTABLE_CHECK_PMD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
brz,pn  REG1, FAIL_LABEL;   \
 sethi  %uhi(_PAGE_PMD_HUGE), REG2; \
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cb841a3..ff3f9f9 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -111,8 +111,8 @@ static unsigned int get_user_insn(unsigned long tpc)
if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
goto out_irq_enable;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   if (pmd_trans_huge(*pmdp)) {
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+   if (is_hugetlb_pmd(*pmdp)) {
pa  = pmd_pfn(*pmdp) << PAGE_SHIFT;
pa += tpc & ~HPAGE_MASK;
 
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ba52e64..cafb5ca 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -131,23 +131,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 {
pgd_t *pgd;
pud_t *pud;
-   pmd_t *pmd;
pte_t *pte = NULL;
 
-   /* We must align the address, because our caller will run
-* set_huge_pte_at() on whatever we return, which writes out
-* all of the sub-ptes for the hugepage range.  So we have
-* to give it the first such sub-pte.
-*/
-   addr &= HPAGE_MASK;
-
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
-   if (pud) {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (pmd)
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
+   if (pud)
+   pte = (pte_t *)pmd_alloc(mm, pud, addr);
+
return pte;
 }
 
@@ -155,19 +145,13 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
 {
pgd_t *pgd;
pud_t *pud;
-   pmd_t *pmd;
pte_t *pte = NULL;
 
-   addr &= HPAGE_MASK;
-
pgd = pgd_offset(mm, addr);
if (!pgd_none(*pgd)) {
pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd))
-   pte = pte_offset_map(pmd, addr);
-   }
+   if (!pud_none(*pud))
+   pte = (pte_t *)pmd_offset(pud, addr);
}
return pte;
 }
@@ -175,67 +159,43 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep, pte_t entry)
 {
-   int i;
-   pte_t orig[2];
-   unsigned long nptes;
+   pte_t orig;
 
if (!pte_present(*ptep) && pte_present(entry))
mm->context.huge_pte_count++;
 
addr &= HPAGE_MASK;
-
-   nptes = 1 << HUGETLB_PAGE_ORDER;
-   orig[0] = *ptep;
-   orig[1] = *(ptep + nptes / 2);
-   for (i = 0; i < nptes; i++) {
-   *ptep = entry;
-   ptep++;
-   

[PATCH v4] sparc64: Reduce TLB flushes during hugepte changes

2016-03-30 Thread Nitin Gupta
During hugepage map/unmap, TSB and TLB flushes are currently
issued at every PAGE_SIZE'd boundary which is unnecessary.
We now issue the flush at REAL_HPAGE_SIZE boundaries only.

Without this patch workloads which unmap a large hugepage
backed VMA region get CPU lockups due to excessive TLB
flush calls.

Orabug: 22365539, 22643230, 22995196

Signed-off-by: Nitin Gupta 

---
Changelog v4 vs v3:
 - Fix build error when CONFIG_HUGETLB_PAGE is not defined
 - Tested build with randconfig, allyesconfig, allnoconfig
Changelog v3 vs v2:
 - Changed patch title to reflect that both map/unmap cases
   are affected.
 - Don't do TLB flush if original PTE wasn't valid (DaveM)
 - Use tlb_batch_add() instead of directly calling TLB flush
   function. This routine also flushes dcache (needed by older
   sparcs) (DaveM)
Changelog v1 vs v2:
 - Access PTEs in order (David Miller)
 - Issue TLB and TSB flush after clearing PTEs (David Miller)
---
 arch/sparc/include/asm/pgtable_64.h  |   43 +
 arch/sparc/include/asm/tlbflush_64.h |3 +-
 arch/sparc/mm/hugetlbpage.c  |   33 ++
 arch/sparc/mm/init_64.c  |   12 -
 arch/sparc/mm/tlb.c  |   25 ++-
 arch/sparc/mm/tsb.c  |   32 +
 6 files changed, 97 insertions(+), 51 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index f089cfa..5a189bf 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -375,7 +375,7 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot)
 #define pgprot_noncached pgprot_noncached
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline pte_t pte_mkhuge(pte_t pte)
+static inline unsigned long __pte_huge_mask(void)
 {
unsigned long mask;
 
@@ -390,8 +390,19 @@ static inline pte_t pte_mkhuge(pte_t pte)
: "=r" (mask)
: "i" (_PAGE_SZHUGE_4U), "i" (_PAGE_SZHUGE_4V));
 
-   return __pte(pte_val(pte) | mask);
+   return mask;
+}
+
+static inline pte_t pte_mkhuge(pte_t pte)
+{
+   return __pte(pte_val(pte) | __pte_huge_mask());
+}
+
+static inline bool is_hugetlb_pte(pte_t pte)
+{
+   return !!(pte_val(pte) & __pte_huge_mask());
 }
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
@@ -403,6 +414,11 @@ static inline pmd_t pmd_mkhuge(pmd_t pmd)
return __pmd(pte_val(pte));
 }
 #endif
+#else
+static inline bool is_hugetlb_pte(pte_t pte)
+{
+   return false;
+}
 #endif
 
 static inline pte_t pte_mkdirty(pte_t pte)
@@ -858,6 +874,19 @@ static inline unsigned long pud_pfn(pud_t pud)
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
   pte_t *ptep, pte_t orig, int fullmm);
 
+static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
+   pte_t *ptep, pte_t orig, int fullmm)
+{
+   /* It is more efficient to let flush_tlb_kernel_range()
+* handle init_mm tlb flushes.
+*
+* SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U
+* and SUN4V pte layout, so this inline test is fine.
+*/
+   if (likely(mm != &init_mm) && pte_accessible(mm, orig))
+   tlb_batch_add(mm, vaddr, ptep, orig, fullmm);
+}
+
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
unsigned long addr,
@@ -874,15 +903,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
pte_t orig = *ptep;
 
*ptep = pte;
-
-   /* It is more efficient to let flush_tlb_kernel_range()
-* handle init_mm tlb flushes.
-*
-* SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U
-* and SUN4V pte layout, so this inline test is fine.
-*/
-   if (likely(mm != &init_mm) && pte_accessible(mm, orig))
-   tlb_batch_add(mm, addr, ptep, orig, fullmm);
+   maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm);
 }
 
 #define set_pte_at(mm,addr,ptep,pte)   \
diff --git a/arch/sparc/include/asm/tlbflush_64.h 
b/arch/sparc/include/asm/tlbflush_64.h
index dea1cfa..a8e192e 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -8,6 +8,7 @@
 #define TLB_BATCH_NR   192
 
 struct tlb_batch {
+   bool huge;
struct mm_struct *mm;
unsigned long tlb_nr;
unsigned long active;
@@ -16,7 +17,7 @@ struct tlb_batch {
 
 void flush_tsb_kernel_range(unsigned long start, unsigned long end);
 void flush_tsb_user(struct tlb_batch *tb);
-void flush_tsb_user_page(struct mm_struct *mm, unsigned long vaddr);
+void flush_tsb_user_page(struct mm_struct *mm, unsigned long vaddr, bool huge);
 
 /* TLB flu

[PATCH v3] sparc64: Reduce TLB flushes during hugepte changes

2016-03-29 Thread Nitin Gupta
During hugepage map/unmap, TSB and TLB flushes are currently
issued at every PAGE_SIZE'd boundary which is unnecessary.
We now issue the flush at REAL_HPAGE_SIZE boundaries only.

Without this patch workloads which unmap a large hugepage
backed VMA region get CPU lockups due to excessive TLB
flush calls.

Orabug: 22365539, 22643230, 22995196

Signed-off-by: Nitin Gupta 

---
Changelog v3 vs v2:
 - Changed patch title to reflect that both map/unmap cases
   are affected.
 - Don't do TLB flush if original PTE wasn't valid (DaveM)
 - Use tlb_batch_add() instead of directly calling TLB flush
   function. This routine also flushes dcache (needed by older
   sparcs) (DaveM)
Changelog v1 vs v2:
 - Access PTEs in order (David Miller)
 - Issue TLB and TSB flush after clearing PTEs (David Miller)
---
 arch/sparc/include/asm/pgtable_64.h  | 38 +---
 arch/sparc/include/asm/tlbflush_64.h |  3 ++-
 arch/sparc/mm/hugetlbpage.c  | 33 ++-
 arch/sparc/mm/init_64.c  | 12 
 arch/sparc/mm/tlb.c  | 18 +
 arch/sparc/mm/tsb.c  | 32 --
 6 files changed, 88 insertions(+), 48 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 7a38d6a..0e706b8 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -375,7 +375,7 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot)
 #define pgprot_noncached pgprot_noncached
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline pte_t pte_mkhuge(pte_t pte)
+static inline unsigned long __pte_huge_mask(void)
 {
unsigned long mask;
 
@@ -390,8 +390,19 @@ static inline pte_t pte_mkhuge(pte_t pte)
: "=r" (mask)
: "i" (_PAGE_SZHUGE_4U), "i" (_PAGE_SZHUGE_4V));
 
-   return __pte(pte_val(pte) | mask);
+   return mask;
+}
+
+static inline pte_t pte_mkhuge(pte_t pte)
+{
+   return __pte(pte_val(pte) | __pte_huge_mask());
+}
+
+static inline bool is_hugetlb_pte(pte_t pte)
+{
+   return !!(pte_val(pte) & __pte_huge_mask());
 }
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
@@ -858,6 +869,19 @@ static inline unsigned long pud_pfn(pud_t pud)
 void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
   pte_t *ptep, pte_t orig, int fullmm);
 
+static void maybe_tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
+   pte_t *ptep, pte_t orig, int fullmm)
+{
+   /* It is more efficient to let flush_tlb_kernel_range()
+* handle init_mm tlb flushes.
+*
+* SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U
+* and SUN4V pte layout, so this inline test is fine.
+*/
+   if (likely(mm != &init_mm) && pte_accessible(mm, orig))
+   tlb_batch_add(mm, vaddr, ptep, orig, fullmm);
+}
+
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
unsigned long addr,
@@ -874,15 +898,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
pte_t orig = *ptep;
 
*ptep = pte;
-
-   /* It is more efficient to let flush_tlb_kernel_range()
-* handle init_mm tlb flushes.
-*
-* SUN4V NOTE: _PAGE_VALID is the same value in both the SUN4U
-* and SUN4V pte layout, so this inline test is fine.
-*/
-   if (likely(mm != &init_mm) && pte_accessible(mm, orig))
-   tlb_batch_add(mm, addr, ptep, orig, fullmm);
+   maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm);
 }
 
 #define set_pte_at(mm,addr,ptep,pte)   \
diff --git a/arch/sparc/include/asm/tlbflush_64.h 
b/arch/sparc/include/asm/tlbflush_64.h
index dea1cfa..a8e192e 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -8,6 +8,7 @@
 #define TLB_BATCH_NR   192
 
 struct tlb_batch {
+   bool huge;
struct mm_struct *mm;
unsigned long tlb_nr;
unsigned long active;
@@ -16,7 +17,7 @@ struct tlb_batch {
 
 void flush_tsb_kernel_range(unsigned long start, unsigned long end);
 void flush_tsb_user(struct tlb_batch *tb);
-void flush_tsb_user_page(struct mm_struct *mm, unsigned long vaddr);
+void flush_tsb_user_page(struct mm_struct *mm, unsigned long vaddr, bool huge);
 
 /* TLB flush operations. */
 
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 4977800..ba52e64 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -176,17 +176,31 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
addr,
 pte_t *ptep, pte_t entry)
 {
int i;
+   pte_t orig[2];
+   un

Fwd: [PATCH] char:misc minor is overflowing

2015-12-09 Thread Nitin Gupta
Hi,

Is there any modification / improvement needed in this patch ?

--- Original Message ---
Sender : Shivnandan Kumar Engineer/SRI-Noida-Advance 
Solutions - System 1 R&D Group/Samsung Electronics
Date : Nov 20, 2015 15:35 (GMT+05:30)
Title : [PATCH] char:misc minor is overflowing

When a driver register as a misc driver and 
it tries to allocate  minor number dynamically.
Then there is a chance of minor number overflow.
The problem is that 64(DYNAMIC_MINORS) is not enough 
for dynamic minor number and if kernel defines 0-63
for dynamic minor number, it should be reserved. But,0-10
was used for other devices,  for example 1 is reserved for 
PSMOUSE. I got the issue that misc_minors is 0x3FFF 
and so, value of variable 'i' in function misc_register becomes 
62 and so misc_minors become 1.(Which was already reserved
for PSMOUSE). This patch help to avoid the 
above problem.

Signed-off-by: shivnandan kumar 
---
drivers/char/misc.c|5 ++---
include/linux/miscdevice.h |1 +
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/char/misc.c b/drivers/char/misc.c
index 8069b36..1a6a640 100644
--- a/drivers/char/misc.c
+++ b/drivers/char/misc.c
@@ -198,7 +198,7 @@ int misc_register(struct miscdevice * misc)
err = -EBUSY;
goto out;
}
- misc->minor = DYNAMIC_MINORS - i - 1;
+ misc->minor = DYNAMIC_MINORS - i - 1 + DYNAMIC_MINOR_START;
set_bit(i, misc_minors);
} else {
struct miscdevice *c;
@@ -218,8 +218,7 @@ int misc_register(struct miscdevice * misc)
  misc, misc->groups, "%s", misc->name);
if (IS_ERR(misc->this_device)) {
if (is_dynamic) {
- int i = DYNAMIC_MINORS - misc->minor - 1;
+ int i = DYNAMIC_MINORS - misc->minor - 1 + DYNAMIC_MINOR_START;
if (i < DYNAMIC_MINORS && i >= 0)
clear_bit(i, misc_minors);
misc->minor = MISC_DYNAMIC_MINOR;
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index 81f6e42..7aa931e 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -19,6 +19,7 @@
#define APOLLO_MOUSE_MINOR 7 /* unused */
#define PC110PAD_MINOR 9 /* unused */
/*#define ADB_MOUSE_MINOR 10 FIXME OBSOLETE */
+#define DYNAMIC_MINOR_START 11
#define WATCHDOG_MINOR 130 /* Watchdog timer */
#define TEMP_MINOR 131 /* Temperature Sensor */
#define RTC_MINOR 135
-- 
1.7.9.5



 Nitin Gupta
 
Logix Cyber Park • Plot No. C 28-29, Tower D - Ground to 10th Floor, Tower C - 
8th to  10th Floor, Sector 62 • Noida(U.P.)  201301 • INDIA 



 

[PATCH] sparc64: Fix numa distance values

2015-11-02 Thread Nitin Gupta
Orabug: 21896119

Use machine descriptor (MD) to get node latency
values instead of just using default values.

Testing:
On an T5-8 system with:
 - total nodes = 8
 - self latencies = 0x26d18
 - latency to other nodes = 0x3a598
   => latency ratio = ~1.5

output of numactl --hardware

 - before fix:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  20  20  20  20  20  20
  1:  20  10  20  20  20  20  20  20
  2:  20  20  10  20  20  20  20  20
  3:  20  20  20  10  20  20  20  20
  4:  20  20  20  20  10  20  20  20
  5:  20  20  20  20  20  10  20  20
  6:  20  20  20  20  20  20  10  20
  7:  20  20  20  20  20  20  20  10

 - after fix:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  15  15  15  15  15  15  15
  1:  15  10  15  15  15  15  15  15
  2:  15  15  10  15  15  15  15  15
  3:  15  15  15  10  15  15  15  15
  4:  15  15  15  15  10  15  15  15
  5:  15  15  15  15  15  10  15  15
  6:  15  15  15  15  15  15  10  15
  7:  15  15  15  15  15  15  15  10

Signed-off-by: Nitin Gupta 
Reviewed-by: Chris Hyser 
Reviewed-by: Santosh Shilimkar 
---
Changelog v1 -> v2:
  - Drop extern keyword for function prototype (Sam Ravnborg)

arch/sparc/include/asm/topology_64.h |3 +
 arch/sparc/mm/init_64.c  |   70 +-
 2 files changed, 72 insertions(+), 1 deletions(-)

diff --git a/arch/sparc/include/asm/topology_64.h 
b/arch/sparc/include/asm/topology_64.h
index 01d1704..bec481a 100644
--- a/arch/sparc/include/asm/topology_64.h
+++ b/arch/sparc/include/asm/topology_64.h
@@ -31,6 +31,9 @@ static inline int pcibus_to_node(struct pci_bus *pbus)
 cpu_all_mask : \
 cpumask_of_node(pcibus_to_node(bus)))
 
+int __node_distance(int, int);
+#define node_distance(a, b) __node_distance(a, b)
+
 #else /* CONFIG_NUMA */
 
 #include 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 4ac88b7..3025bd5 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -93,6 +93,8 @@ static unsigned long cpu_pgsz_mask;
 static struct linux_prom64_registers pavail[MAX_BANKS];
 static int pavail_ents;
 
+u64 numa_latency[MAX_NUMNODES][MAX_NUMNODES];
+
 static int cmp_p64(const void *a, const void *b)
 {
const struct linux_prom64_registers *x = a, *y = b;
@@ -1157,6 +1159,48 @@ static struct mdesc_mlgroup * __init find_mlgroup(u64 
node)
return NULL;
 }
 
+int __node_distance(int from, int to)
+{
+   if ((from >= MAX_NUMNODES) || (to >= MAX_NUMNODES)) {
+   pr_warn("Returning default NUMA distance value for %d->%d\n",
+   from, to);
+   return (from == to) ? LOCAL_DISTANCE : REMOTE_DISTANCE;
+   }
+   return numa_latency[from][to];
+}
+
+static int find_best_numa_node_for_mlgroup(struct mdesc_mlgroup *grp)
+{
+   int i;
+
+   for (i = 0; i < MAX_NUMNODES; i++) {
+   struct node_mem_mask *n = &node_masks[i];
+
+   if ((grp->mask == n->mask) && (grp->match == n->val))
+   break;
+   }
+   return i;
+}
+
+static void find_numa_latencies_for_group(struct mdesc_handle *md, u64 grp,
+ int index)
+{
+   u64 arc;
+
+   mdesc_for_each_arc(arc, md, grp, MDESC_ARC_TYPE_FWD) {
+   int tnode;
+   u64 target = mdesc_arc_target(md, arc);
+   struct mdesc_mlgroup *m = find_mlgroup(target);
+
+   if (!m)
+   continue;
+   tnode = find_best_numa_node_for_mlgroup(m);
+   if (tnode == MAX_NUMNODES)
+   continue;
+   numa_latency[index][tnode] = m->latency;
+   }
+}
+
 static int __init numa_attach_mlgroup(struct mdesc_handle *md, u64 grp,
  int index)
 {
@@ -1220,9 +1264,16 @@ static int __init numa_parse_mdesc_group(struct 
mdesc_handle *md, u64 grp,
 static int __init numa_parse_mdesc(void)
 {
struct mdesc_handle *md = mdesc_grab();
-   int i, err, count;
+   int i, j, err, count;
u64 node;
 
+   /* Some sane defaults for numa latency values */
+   for (i = 0; i < MAX_NUMNODES; i++) {
+   for (j = 0; j < MAX_NUMNODES; j++)
+   numa_latency[i][j] = (i == j) ?
+   LOCAL_DISTANCE : REMOTE_DISTANCE;
+   }
+
node = mdesc_node_by_name(md, MDESC_NODE_NULL, "latency-groups");
if (node == MDESC_NODE_NULL) {
mdesc_release(md);
@@ -1245,6 +1296,23 @@ static int __init numa_parse_mdesc(void)
count++;
}
 
+   count = 0;
+   mdesc_for_each_node_by_name(md, node, "group") {
+   find_numa_latencies_for_group(md, node, count);
+   count++;
+   }
+
+   /* Normalize numa latency matrix according to ACPI SLIT spec. */
+   for (i =

Re: [PATCH] sparc64: Fix numa distance values

2015-10-29 Thread Nitin Gupta

On 10/29/2015 11:50 AM, Sam Ravnborg wrote:

Small nit.


diff --git a/arch/sparc/include/asm/topology_64.h 
b/arch/sparc/include/asm/topology_64.h
index 01d1704..ed3dfdd 100644
--- a/arch/sparc/include/asm/topology_64.h
+++ b/arch/sparc/include/asm/topology_64.h
@@ -31,6 +31,9 @@ static inline int pcibus_to_node(struct pci_bus *pbus)
 cpu_all_mask : \
 cpumask_of_node(pcibus_to_node(bus)))

+extern int __node_distance(int, int);

We have dropped using "extern" for function prototypes.



ok, dropped extern here.


+#define node_distance(a, b) __node_distance(a, b)


And had this be written as:
#define node_distance node_distance


underscores here to separate macro name from function name
seems to be clearer and would also avoid confusing
cross-referencing tools.


int node_distance(int, int);

Then there had been no need for the leadign underscores.

But as I said - only nits.

Sam


Thanks for the review.
Nitin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sparc64: Fix numa distance values

2015-10-28 Thread Nitin Gupta
Orabug: 21896119

Use machine descriptor (MD) to get node latency
values instead of just using default values.

Testing:
On an T5-8 system with:
 - total nodes = 8
 - self latencies = 0x26d18
 - latency to other nodes = 0x3a598
   => latency ratio = ~1.5

output of numactl --hardware

 - before fix:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  20  20  20  20  20  20
  1:  20  10  20  20  20  20  20  20
  2:  20  20  10  20  20  20  20  20
  3:  20  20  20  10  20  20  20  20
  4:  20  20  20  20  10  20  20  20
  5:  20  20  20  20  20  10  20  20
  6:  20  20  20  20  20  20  10  20
  7:  20  20  20  20  20  20  20  10

 - after fix:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  15  15  15  15  15  15  15
  1:  15  10  15  15  15  15  15  15
  2:  15  15  10  15  15  15  15  15
  3:  15  15  15  10  15  15  15  15
  4:  15  15  15  15  10  15  15  15
  5:  15  15  15  15  15  10  15  15
  6:  15  15  15  15  15  15  10  15
  7:  15  15  15  15  15  15  15  10

Signed-off-by: Nitin Gupta 
Reviewed-by: Chris Hyser 
Reviewed-by: Santosh Shilimkar 
---
 arch/sparc/include/asm/topology_64.h |3 +
 arch/sparc/mm/init_64.c  |   70 +-
 2 files changed, 72 insertions(+), 1 deletions(-)

diff --git a/arch/sparc/include/asm/topology_64.h 
b/arch/sparc/include/asm/topology_64.h
index 01d1704..ed3dfdd 100644
--- a/arch/sparc/include/asm/topology_64.h
+++ b/arch/sparc/include/asm/topology_64.h
@@ -31,6 +31,9 @@ static inline int pcibus_to_node(struct pci_bus *pbus)
 cpu_all_mask : \
 cpumask_of_node(pcibus_to_node(bus)))
 
+extern int __node_distance(int, int);
+#define node_distance(a, b) __node_distance(a, b)
+
 #else /* CONFIG_NUMA */
 
 #include 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 4ac88b7..3025bd5 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -93,6 +93,8 @@ static unsigned long cpu_pgsz_mask;
 static struct linux_prom64_registers pavail[MAX_BANKS];
 static int pavail_ents;
 
+u64 numa_latency[MAX_NUMNODES][MAX_NUMNODES];
+
 static int cmp_p64(const void *a, const void *b)
 {
const struct linux_prom64_registers *x = a, *y = b;
@@ -1157,6 +1159,48 @@ static struct mdesc_mlgroup * __init find_mlgroup(u64 
node)
return NULL;
 }
 
+int __node_distance(int from, int to)
+{
+   if ((from >= MAX_NUMNODES) || (to >= MAX_NUMNODES)) {
+   pr_warn("Returning default NUMA distance value for %d->%d\n",
+   from, to);
+   return (from == to) ? LOCAL_DISTANCE : REMOTE_DISTANCE;
+   }
+   return numa_latency[from][to];
+}
+
+static int find_best_numa_node_for_mlgroup(struct mdesc_mlgroup *grp)
+{
+   int i;
+
+   for (i = 0; i < MAX_NUMNODES; i++) {
+   struct node_mem_mask *n = &node_masks[i];
+
+   if ((grp->mask == n->mask) && (grp->match == n->val))
+   break;
+   }
+   return i;
+}
+
+static void find_numa_latencies_for_group(struct mdesc_handle *md, u64 grp,
+ int index)
+{
+   u64 arc;
+
+   mdesc_for_each_arc(arc, md, grp, MDESC_ARC_TYPE_FWD) {
+   int tnode;
+   u64 target = mdesc_arc_target(md, arc);
+   struct mdesc_mlgroup *m = find_mlgroup(target);
+
+   if (!m)
+   continue;
+   tnode = find_best_numa_node_for_mlgroup(m);
+   if (tnode == MAX_NUMNODES)
+   continue;
+   numa_latency[index][tnode] = m->latency;
+   }
+}
+
 static int __init numa_attach_mlgroup(struct mdesc_handle *md, u64 grp,
  int index)
 {
@@ -1220,9 +1264,16 @@ static int __init numa_parse_mdesc_group(struct 
mdesc_handle *md, u64 grp,
 static int __init numa_parse_mdesc(void)
 {
struct mdesc_handle *md = mdesc_grab();
-   int i, err, count;
+   int i, j, err, count;
u64 node;
 
+   /* Some sane defaults for numa latency values */
+   for (i = 0; i < MAX_NUMNODES; i++) {
+   for (j = 0; j < MAX_NUMNODES; j++)
+   numa_latency[i][j] = (i == j) ?
+   LOCAL_DISTANCE : REMOTE_DISTANCE;
+   }
+
node = mdesc_node_by_name(md, MDESC_NODE_NULL, "latency-groups");
if (node == MDESC_NODE_NULL) {
mdesc_release(md);
@@ -1245,6 +1296,23 @@ static int __init numa_parse_mdesc(void)
count++;
}
 
+   count = 0;
+   mdesc_for_each_node_by_name(md, node, "group") {
+   find_numa_latencies_for_group(md, node, count);
+   count++;
+   }
+
+   /* Normalize numa latency matrix according to ACPI SLIT spec. */
+   for (i = 0; i < MAX_NUMNODES; i++) {
+   u64 sel

Re: [PATCH] staging: zsmalloc: Ensure handle is never 0 on success

2013-11-12 Thread Nitin Gupta

On 11/12/13, 6:42 PM, Greg KH wrote:

On Wed, Nov 13, 2013 at 12:41:38AM +0900, Minchan Kim wrote:

We spent much time with preventing zram enhance since it have been in staging
and Greg never want to improve without promotion.


It's not "improve", it's "Greg does not want you adding new features and
functionality while the code is in staging."  I want you to spend your
time on getting it out of staging first.

Now if something needs to be done based on review and comments to the
code, then that's fine to do and I'll accept that, but I've been seeing
new functionality be added to the code, which I will not accept because
it seems that you all have given up on getting it merged, which isn't
ok.



It's not that people have given up on getting it merged but every time 
patches are posted, there is really no response from maintainers perhaps 
due to their lack of interest in embedded, or perhaps they believe 
embedded folks are making a wrong choice by using zram. Either way, a 
final word, instead of just silence would be more helpful.


Thanks,
Nitin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] staging: zsmalloc: Ensure handle is never 0 on success

2013-11-08 Thread Nitin Gupta
On Thu, Nov 7, 2013 at 5:58 PM, Olav Haugan  wrote:
> zsmalloc encodes a handle using the pfn and an object
> index. On hardware platforms with physical memory starting
> at 0x0 the pfn can be 0. This causes the encoded handle to be
> 0 and is incorrectly interpreted as an allocation failure.
>
> To prevent this false error we ensure that the encoded handle
> will not be 0 when allocation succeeds.
>
> Signed-off-by: Olav Haugan 
> ---
>  drivers/staging/zsmalloc/zsmalloc-main.c | 17 +
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c 
> b/drivers/staging/zsmalloc/zsmalloc-main.c
> index 1a67537..3b950e5 100644
> --- a/drivers/staging/zsmalloc/zsmalloc-main.c
> +++ b/drivers/staging/zsmalloc/zsmalloc-main.c
> @@ -430,7 +430,12 @@ static struct page *get_next_page(struct page *page)
> return next;
>  }
>
> -/* Encode  as a single handle value */
> +/*
> + * Encode  as a single handle value.
> + * On hardware platforms with physical memory starting at 0x0 the pfn
> + * could be 0 so we ensure that the handle will never be 0 by adjusting the
> + * encoded obj_idx value before encoding.
> + */
>  static void *obj_location_to_handle(struct page *page, unsigned long obj_idx)
>  {
> unsigned long handle;
> @@ -441,17 +446,21 @@ static void *obj_location_to_handle(struct page *page, 
> unsigned long obj_idx)
> }
>
> handle = page_to_pfn(page) << OBJ_INDEX_BITS;
> -   handle |= (obj_idx & OBJ_INDEX_MASK);
> +   handle |= ((obj_idx + 1) & OBJ_INDEX_MASK);
>
> return (void *)handle;
>  }
>
> -/* Decode  pair from the given object handle */
> +/*
> + * Decode  pair from the given object handle. We adjust the
> + * decoded obj_idx back to its original value since it was adjusted in
> + * obj_location_to_handle().
> + */
>  static void obj_handle_to_location(unsigned long handle, struct page **page,
> unsigned long *obj_idx)
>  {
> *page = pfn_to_page(handle >> OBJ_INDEX_BITS);
> -   *obj_idx = handle & OBJ_INDEX_MASK;
> +   *obj_idx = (handle & OBJ_INDEX_MASK) - 1;
>  }
>
>  static unsigned long obj_idx_to_offset(struct page *page,

Acked-by: Nitin Gupta 

Thanks,
Nitin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   >