[PATCH] x86: Unify definition of jiffies

2018-01-07 Thread Zhihui Zhang
jiffies_64 is always defined in file kernel/time/timer.c. Thus, we
can unify definition of jiffies and make it less confusing. This
only affects 64-bit platforms.

Signed-off-by: Zhihui Zhang <zzhs...@gmail.com>
---
 arch/x86/kernel/time.c| 4 
 arch/x86/kernel/vmlinux.lds.S | 4 ++--
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
index 749d189..2fedb33 100644
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -23,10 +23,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_X86_64
-__visible volatile unsigned long jiffies __cacheline_aligned = INITIAL_JIFFIES;
-#endif
-
 unsigned long profile_pc(struct pt_regs *regs)
 {
unsigned long pc = instruction_pointer(regs);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 1e413a93..940c190 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -36,13 +36,13 @@ OUTPUT_FORMAT(CONFIG_OUTPUT_FORMAT, CONFIG_OUTPUT_FORMAT, 
CONFIG_OUTPUT_FORMAT)
 #ifdef CONFIG_X86_32
 OUTPUT_ARCH(i386)
 ENTRY(phys_startup_32)
-jiffies = jiffies_64;
 #else
 OUTPUT_ARCH(i386:x86-64)
 ENTRY(phys_startup_64)
-jiffies_64 = jiffies;
 #endif
 
+jiffies = jiffies_64;
+
 #if defined(CONFIG_X86_64)
 /*
  * On 64-bit, align RODATA to 2MB so we retain large page mappings for
-- 
2.7.4



[PATCH] x86: Unify definition of jiffies

2018-01-07 Thread Zhihui Zhang
jiffies_64 is always defined in file kernel/time/timer.c. Thus, we
can unify definition of jiffies and make it less confusing. This
only affects 64-bit platforms.

Signed-off-by: Zhihui Zhang 
---
 arch/x86/kernel/time.c| 4 
 arch/x86/kernel/vmlinux.lds.S | 4 ++--
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
index 749d189..2fedb33 100644
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -23,10 +23,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_X86_64
-__visible volatile unsigned long jiffies __cacheline_aligned = INITIAL_JIFFIES;
-#endif
-
 unsigned long profile_pc(struct pt_regs *regs)
 {
unsigned long pc = instruction_pointer(regs);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 1e413a93..940c190 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -36,13 +36,13 @@ OUTPUT_FORMAT(CONFIG_OUTPUT_FORMAT, CONFIG_OUTPUT_FORMAT, 
CONFIG_OUTPUT_FORMAT)
 #ifdef CONFIG_X86_32
 OUTPUT_ARCH(i386)
 ENTRY(phys_startup_32)
-jiffies = jiffies_64;
 #else
 OUTPUT_ARCH(i386:x86-64)
 ENTRY(phys_startup_64)
-jiffies_64 = jiffies;
 #endif
 
+jiffies = jiffies_64;
+
 #if defined(CONFIG_X86_64)
 /*
  * On 64-bit, align RODATA to 2MB so we retain large page mappings for
-- 
2.7.4



Re: [PATCH] timers: Reconcile the code and the comment for the 250HZ case

2017-01-24 Thread Zhihui Zhang
Ah, I see your point. Thanks for the detail explanation.

-Zhihui

On Mon, Jan 23, 2017 at 6:10 AM, Thomas Gleixner <t...@linutronix.de> wrote:
> On Sat, 21 Jan 2017, Zhihui Zhang wrote:
>
>> Sure, I believe that comments should always match the code. In this
>
> That's fine.
>
>> case, using either LVL_SIZE - 1 or LVL_SIZE is fine based on my
>> understanding about 20 days ago. But I could be wrong and miss some
>> subtle details. Anyway, my point is about readability.
>
> Well, readability is one thing, but correctness is more important, right?
>
> Let's assume we have 4 buckets per level and base->clk is 0. So Level 0
> has the following expiry times:
>
> Bucket 0:   base->clk + 0
> Bucket 1:   base->clk + 1
> Bucket 2:   base->clk + 2
> Bucket 3:   base->clk + 3
>
> So we can accomodate 4 timers here, but there is a nifty detail. We
> guarantee that expiries are not short, so a timer armed for base->clk
> will expire at base->clk + 1.
>
> The reason for this is that we have no distinction between absolute and
> relative timeouts. But for relative timeouts we have to guarantee that the
> timeout does not expire before the number of jiffies has elapsed.
>
> Now a timer armed with 1 jiffie relativ to now (jiffies) cannot be queued
> to bucket 0 because jiffies can be incremented immediately after queueing
> the timer which would expire it early. So it's queued to bucket 1 and
> that's why we need to have LVL_SIZE - 1 and not LVL_SIZE. See also
> calc_index().
>
> Your change completely breaks the wheel. Let's assume the above and a
> timer expiring at base->clk + 3.
>
> With your change the timer would fall into Level 0. So no calc_index()
> does:
>
> expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
> return LVL_OFFS(lvl) + (expires & LVL_MASK);
>
> Let's substitute that for the expires = base->clk + 3 case:
>
> expires = (base->clk + 3 + 1) >> 0;
>
> --->expires = 4;
>
> return 0 + (4 & 0x03);
>
> --->index = 0
>
> So the timer gets queued into bucket 0 and expires 4 jiffies too early.
>
> So using either LVL_SIZE - 1 or LVL_SIZE is _NOT_ fine.
>
> Thanks,
>
> tglx
>
>
>
>
>
>


Re: [PATCH] timers: Reconcile the code and the comment for the 250HZ case

2017-01-24 Thread Zhihui Zhang
Ah, I see your point. Thanks for the detail explanation.

-Zhihui

On Mon, Jan 23, 2017 at 6:10 AM, Thomas Gleixner  wrote:
> On Sat, 21 Jan 2017, Zhihui Zhang wrote:
>
>> Sure, I believe that comments should always match the code. In this
>
> That's fine.
>
>> case, using either LVL_SIZE - 1 or LVL_SIZE is fine based on my
>> understanding about 20 days ago. But I could be wrong and miss some
>> subtle details. Anyway, my point is about readability.
>
> Well, readability is one thing, but correctness is more important, right?
>
> Let's assume we have 4 buckets per level and base->clk is 0. So Level 0
> has the following expiry times:
>
> Bucket 0:   base->clk + 0
> Bucket 1:   base->clk + 1
> Bucket 2:   base->clk + 2
> Bucket 3:   base->clk + 3
>
> So we can accomodate 4 timers here, but there is a nifty detail. We
> guarantee that expiries are not short, so a timer armed for base->clk
> will expire at base->clk + 1.
>
> The reason for this is that we have no distinction between absolute and
> relative timeouts. But for relative timeouts we have to guarantee that the
> timeout does not expire before the number of jiffies has elapsed.
>
> Now a timer armed with 1 jiffie relativ to now (jiffies) cannot be queued
> to bucket 0 because jiffies can be incremented immediately after queueing
> the timer which would expire it early. So it's queued to bucket 1 and
> that's why we need to have LVL_SIZE - 1 and not LVL_SIZE. See also
> calc_index().
>
> Your change completely breaks the wheel. Let's assume the above and a
> timer expiring at base->clk + 3.
>
> With your change the timer would fall into Level 0. So no calc_index()
> does:
>
> expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
> return LVL_OFFS(lvl) + (expires & LVL_MASK);
>
> Let's substitute that for the expires = base->clk + 3 case:
>
> expires = (base->clk + 3 + 1) >> 0;
>
> --->expires = 4;
>
> return 0 + (4 & 0x03);
>
> --->index = 0
>
> So the timer gets queued into bucket 0 and expires 4 jiffies too early.
>
> So using either LVL_SIZE - 1 or LVL_SIZE is _NOT_ fine.
>
> Thanks,
>
> tglx
>
>
>
>
>
>


Re: [PATCH] timers: Reconcile the code and the comment for the 250HZ case

2017-01-21 Thread Zhihui Zhang
Sure, I believe that comments should always match the code. In this
case, using either LVL_SIZE - 1 or LVL_SIZE is fine based on my
understanding about 20 days ago. But I could be wrong and miss some
subtle details. Anyway, my point is about readability.

thanks,

On Fri, Jan 20, 2017 at 5:41 PM, John Stultz <john.stu...@linaro.org> wrote:
> On Mon, Jan 2, 2017 at 1:14 PM, Zhihui Zhang <zzhs...@gmail.com> wrote:
>> Adjust the time start of each level to match the comments. Note that
>> LVL_START(n) is never used for n = 0 case.  Also, each level (except
>> level 0) has more than enough room to accommodate all its timers.
>
> So instead of just covering what your patch does, can you explain in
> some detail why this patch is useful? What net effect does it bring?
> What sort of bugs would it solve?
>
> thanks
> -john


Re: [PATCH] timers: Reconcile the code and the comment for the 250HZ case

2017-01-21 Thread Zhihui Zhang
Sure, I believe that comments should always match the code. In this
case, using either LVL_SIZE - 1 or LVL_SIZE is fine based on my
understanding about 20 days ago. But I could be wrong and miss some
subtle details. Anyway, my point is about readability.

thanks,

On Fri, Jan 20, 2017 at 5:41 PM, John Stultz  wrote:
> On Mon, Jan 2, 2017 at 1:14 PM, Zhihui Zhang  wrote:
>> Adjust the time start of each level to match the comments. Note that
>> LVL_START(n) is never used for n = 0 case.  Also, each level (except
>> level 0) has more than enough room to accommodate all its timers.
>
> So instead of just covering what your patch does, can you explain in
> some detail why this patch is useful? What net effect does it bring?
> What sort of bugs would it solve?
>
> thanks
> -john


[PATCH] timers: Reconcile the code and the comment for the 250HZ case

2017-01-02 Thread Zhihui Zhang
Adjust the time start of each level to match the comments. Note that
LVL_START(n) is never used for n = 0 case.  Also, each level (except
level 0) has more than enough room to accommodate all its timers.

Signed-off-by: Zhihui Zhang <zzhs...@gmail.com>
---
 kernel/time/timer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index ec33a69..268d5ae 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -132,7 +132,7 @@ EXPORT_SYMBOL(jiffies_64);
  *  5   320131072 ms (~2m)1048576 ms -8388607 ms (~17m - ~2h)
  *  6   384   1048576 ms (~17m)   8388608 ms -   67108863 ms (~2h - ~18h)
  *  7   448   8388608 ms (~2h)   67108864 ms -  536870911 ms (~18h - ~6d)
- *  8512  67108864 ms (~18h) 536870912 ms - 4294967288 ms (~6d - ~49d)
+ *  8512  67108864 ms (~18h) 536870912 ms - 4294967295 ms (~6d - ~49d)
  *
  * HZ  100
  * Level Offset  GranularityRange
@@ -157,7 +157,7 @@ EXPORT_SYMBOL(jiffies_64);
  * The time start value for each level to select the bucket at enqueue
  * time.
  */
-#define LVL_START(n)   ((LVL_SIZE - 1) << (((n) - 1) * LVL_CLK_SHIFT))
+#define LVL_START(n)   (LVL_SIZE << (((n) - 1) * LVL_CLK_SHIFT))
 
 /* Size of each clock level */
 #define LVL_BITS   6
-- 
2.7.4



[PATCH] timers: Reconcile the code and the comment for the 250HZ case

2017-01-02 Thread Zhihui Zhang
Adjust the time start of each level to match the comments. Note that
LVL_START(n) is never used for n = 0 case.  Also, each level (except
level 0) has more than enough room to accommodate all its timers.

Signed-off-by: Zhihui Zhang 
---
 kernel/time/timer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index ec33a69..268d5ae 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -132,7 +132,7 @@ EXPORT_SYMBOL(jiffies_64);
  *  5   320131072 ms (~2m)1048576 ms -8388607 ms (~17m - ~2h)
  *  6   384   1048576 ms (~17m)   8388608 ms -   67108863 ms (~2h - ~18h)
  *  7   448   8388608 ms (~2h)   67108864 ms -  536870911 ms (~18h - ~6d)
- *  8512  67108864 ms (~18h) 536870912 ms - 4294967288 ms (~6d - ~49d)
+ *  8512  67108864 ms (~18h) 536870912 ms - 4294967295 ms (~6d - ~49d)
  *
  * HZ  100
  * Level Offset  GranularityRange
@@ -157,7 +157,7 @@ EXPORT_SYMBOL(jiffies_64);
  * The time start value for each level to select the bucket at enqueue
  * time.
  */
-#define LVL_START(n)   ((LVL_SIZE - 1) << (((n) - 1) * LVL_CLK_SHIFT))
+#define LVL_START(n)   (LVL_SIZE << (((n) - 1) * LVL_CLK_SHIFT))
 
 /* Size of each clock level */
 #define LVL_BITS   6
-- 
2.7.4



[PATCH] sched/fair: remove the swap() logic in load_too_imbalanced()

2015-10-17 Thread Zhihui Zhang
The swap() logic was introduced before we scaled the load by the
actual CPU capacity. Now it looks funny that we swap the load but
not the CPU capacity. In fact, we probably need to check both
directions to ensure that the load of neither side increase too
high compared to the other side.

Signed-off-by: Zhihui Zhang 
---
 kernel/sched/fair.c | 36 +++-
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e2e348..5e8e17b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1203,7 +1203,7 @@ static void task_numa_assign(struct task_numa_env *env,
 static bool load_too_imbalanced(long src_load, long dst_load,
struct task_numa_env *env)
 {
-   long imb, old_imb;
+   long imb1, imb2, orig_imb1, orig_imb2;
long orig_src_load, orig_dst_load;
long src_capacity, dst_capacity;
 
@@ -1217,31 +1217,33 @@ static bool load_too_imbalanced(long src_load, long 
dst_load,
src_capacity = env->src_stats.compute_capacity;
dst_capacity = env->dst_stats.compute_capacity;
 
-   /* We care about the slope of the imbalance, not the direction. */
-   if (dst_load < src_load)
-   swap(dst_load, src_load);
-
-   /* Is the difference below the threshold? */
-   imb = dst_load * src_capacity * 100 -
- src_load * dst_capacity * env->imbalance_pct;
-   if (imb <= 0)
+   /* Does the difference in either direction exceed the threshold? */
+   imb1 = dst_load * src_capacity * 100 -
+  src_load * dst_capacity * env->imbalance_pct;
+   imb2 = src_load * dst_capacity * 100 -
+  dst_load * src_capacity * env->imbalance_pct;
+   if (imb1 <= 0 && imb2 <= 0)
return false;
 
/*
-* The imbalance is above the allowed threshold.
-* Compare it with the old imbalance.
+* At least one imbalance is above the allowed threshold.
+* Compare it with the original imbalance.
 */
orig_src_load = env->src_stats.load;
orig_dst_load = env->dst_stats.load;
 
-   if (orig_dst_load < orig_src_load)
-   swap(orig_dst_load, orig_src_load);
-
-   old_imb = orig_dst_load * src_capacity * 100 -
- orig_src_load * dst_capacity * env->imbalance_pct;
+   orig_imb1 = orig_dst_load * src_capacity * 100 -
+   orig_src_load * dst_capacity * env->imbalance_pct;
+   orig_imb2 = orig_src_load * dst_capacity * 100 -
+   orig_dst_load * src_capacity * env->imbalance_pct;
 
/* Would this change make things worse? */
-   return (imb > old_imb);
+   if (imb1 > 0 && imb1 > orig_imb1)
+   return true;
+   if (imb2 > 0 && imb2 > orig_imb2)
+   return true;
+
+   return false;
 }
 
 /*
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched/fair: remove the swap() logic in load_too_imbalanced()

2015-10-17 Thread Zhihui Zhang
The swap() logic was introduced before we scaled the load by the
actual CPU capacity. Now it looks funny that we swap the load but
not the CPU capacity. In fact, we probably need to check both
directions to ensure that the load of neither side increase too
high compared to the other side.

Signed-off-by: Zhihui Zhang <zzhs...@gmail.com>
---
 kernel/sched/fair.c | 36 +++-
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e2e348..5e8e17b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1203,7 +1203,7 @@ static void task_numa_assign(struct task_numa_env *env,
 static bool load_too_imbalanced(long src_load, long dst_load,
struct task_numa_env *env)
 {
-   long imb, old_imb;
+   long imb1, imb2, orig_imb1, orig_imb2;
long orig_src_load, orig_dst_load;
long src_capacity, dst_capacity;
 
@@ -1217,31 +1217,33 @@ static bool load_too_imbalanced(long src_load, long 
dst_load,
src_capacity = env->src_stats.compute_capacity;
dst_capacity = env->dst_stats.compute_capacity;
 
-   /* We care about the slope of the imbalance, not the direction. */
-   if (dst_load < src_load)
-   swap(dst_load, src_load);
-
-   /* Is the difference below the threshold? */
-   imb = dst_load * src_capacity * 100 -
- src_load * dst_capacity * env->imbalance_pct;
-   if (imb <= 0)
+   /* Does the difference in either direction exceed the threshold? */
+   imb1 = dst_load * src_capacity * 100 -
+  src_load * dst_capacity * env->imbalance_pct;
+   imb2 = src_load * dst_capacity * 100 -
+  dst_load * src_capacity * env->imbalance_pct;
+   if (imb1 <= 0 && imb2 <= 0)
return false;
 
/*
-* The imbalance is above the allowed threshold.
-* Compare it with the old imbalance.
+* At least one imbalance is above the allowed threshold.
+* Compare it with the original imbalance.
 */
orig_src_load = env->src_stats.load;
orig_dst_load = env->dst_stats.load;
 
-   if (orig_dst_load < orig_src_load)
-   swap(orig_dst_load, orig_src_load);
-
-   old_imb = orig_dst_load * src_capacity * 100 -
- orig_src_load * dst_capacity * env->imbalance_pct;
+   orig_imb1 = orig_dst_load * src_capacity * 100 -
+   orig_src_load * dst_capacity * env->imbalance_pct;
+   orig_imb2 = orig_src_load * dst_capacity * 100 -
+   orig_dst_load * src_capacity * env->imbalance_pct;
 
/* Would this change make things worse? */
-   return (imb > old_imb);
+   if (imb1 > 0 && imb1 > orig_imb1)
+   return true;
+   if (imb2 > 0 && imb2 > orig_imb2)
+   return true;
+
+   return false;
 }
 
 /*
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Rename RECLAIM_SWAP to RECLAIM_UNMAP.

2015-05-10 Thread Zhihui Zhang
the name SWAP implies that we are dealing with anonymous pages only.
In fact, the original patch that introduced the min_unmapped_ratio
logic was to fix an issue related to file pages. Rename it to
RECLAIM_UNMAP to match what does.

Historically, commit  renamed .may_swap to .may_unmap,
leaving RECLAIM_SWAP behind.  commit <2e2e42598908> reintroduced .may_swap
for memory controller.

Signed-off-by: Zhihui Zhang 
---
 mm/vmscan.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..15328de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3596,7 +3596,7 @@ int zone_reclaim_mode __read_mostly;
 #define RECLAIM_OFF 0
 #define RECLAIM_ZONE (1<<0)/* Run shrink_inactive_list on the zone */
 #define RECLAIM_WRITE (1<<1)   /* Writeout pages during reclaim */
-#define RECLAIM_SWAP (1<<2)/* Swap pages out during reclaim */
+#define RECLAIM_UNMAP (1<<2)   /* Unmap pages during reclaim */
 
 /*
  * Priority for ZONE_RECLAIM. This determines the fraction of pages
@@ -3638,12 +3638,12 @@ static long zone_pagecache_reclaimable(struct zone 
*zone)
long delta = 0;
 
/*
-* If RECLAIM_SWAP is set, then all file pages are considered
+* If RECLAIM_UNMAP is set, then all file pages are considered
 * potentially reclaimable. Otherwise, we have to worry about
 * pages like swapcache and zone_unmapped_file_pages() provides
 * a better estimate
 */
-   if (zone_reclaim_mode & RECLAIM_SWAP)
+   if (zone_reclaim_mode & RECLAIM_UNMAP)
nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
else
nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
@@ -3674,15 +3674,15 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
.order = order,
.priority = ZONE_RECLAIM_PRIORITY,
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
-   .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
+   .may_unmap = !!(zone_reclaim_mode & RECLAIM_UNMAP),
.may_swap = 1,
};
 
cond_resched();
/*
-* We need to be able to allocate from the reserves for RECLAIM_SWAP
+* We need to be able to allocate from the reserves for RECLAIM_UNMAP
 * and we also need to be able to write out pages for RECLAIM_WRITE
-* and RECLAIM_SWAP.
+* and RECLAIM_UNMAP.
 */
p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
lockdep_set_current_reclaim_state(gfp_mask);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Rename RECLAIM_SWAP to RECLAIM_UNMAP.

2015-05-10 Thread Zhihui Zhang
the name SWAP implies that we are dealing with anonymous pages only.
In fact, the original patch that introduced the min_unmapped_ratio
logic was to fix an issue related to file pages. Rename it to
RECLAIM_UNMAP to match what does.

Historically, commit a6dc60f8975a renamed .may_swap to .may_unmap,
leaving RECLAIM_SWAP behind.  commit 2e2e42598908 reintroduced .may_swap
for memory controller.

Signed-off-by: Zhihui Zhang zzhs...@gmail.com
---
 mm/vmscan.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..15328de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3596,7 +3596,7 @@ int zone_reclaim_mode __read_mostly;
 #define RECLAIM_OFF 0
 #define RECLAIM_ZONE (10)/* Run shrink_inactive_list on the zone */
 #define RECLAIM_WRITE (11)   /* Writeout pages during reclaim */
-#define RECLAIM_SWAP (12)/* Swap pages out during reclaim */
+#define RECLAIM_UNMAP (12)   /* Unmap pages during reclaim */
 
 /*
  * Priority for ZONE_RECLAIM. This determines the fraction of pages
@@ -3638,12 +3638,12 @@ static long zone_pagecache_reclaimable(struct zone 
*zone)
long delta = 0;
 
/*
-* If RECLAIM_SWAP is set, then all file pages are considered
+* If RECLAIM_UNMAP is set, then all file pages are considered
 * potentially reclaimable. Otherwise, we have to worry about
 * pages like swapcache and zone_unmapped_file_pages() provides
 * a better estimate
 */
-   if (zone_reclaim_mode  RECLAIM_SWAP)
+   if (zone_reclaim_mode  RECLAIM_UNMAP)
nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
else
nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
@@ -3674,15 +3674,15 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
.order = order,
.priority = ZONE_RECLAIM_PRIORITY,
.may_writepage = !!(zone_reclaim_mode  RECLAIM_WRITE),
-   .may_unmap = !!(zone_reclaim_mode  RECLAIM_SWAP),
+   .may_unmap = !!(zone_reclaim_mode  RECLAIM_UNMAP),
.may_swap = 1,
};
 
cond_resched();
/*
-* We need to be able to allocate from the reserves for RECLAIM_SWAP
+* We need to be able to allocate from the reserves for RECLAIM_UNMAP
 * and we also need to be able to write out pages for RECLAIM_WRITE
-* and RECLAIM_SWAP.
+* and RECLAIM_UNMAP.
 */
p-flags |= PF_MEMALLOC | PF_SWAPWRITE;
lockdep_set_current_reclaim_state(gfp_mask);
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not use arbitrary large movablecore to calculate kernelcore

2015-04-02 Thread Zhihui Zhang
If you specify movablecore > totalpages, required_kernelcore will end
up with a big number because corepages is an unsigned integer. If so,
the following nested is a waste of time. But I see your point.

-Zhihui

On Wed, Apr 1, 2015 at 7:00 PM, Mel Gorman  wrote:
> On Sat, Mar 28, 2015 at 11:36:02PM -0400, Zhihui Zhang wrote:
>> If kernelcore is not set, then we are working with a very large kernelcore
>> for nothing - no movable zone will be created. If kernelcore is set,
>> then it is not respected at all.
>>
>> Signed-off-by: Zhihui Zhang 
>
> I'm confused. What bug is this patch fixing? What is the user-visible
> impcat of the patch?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Do not use arbitrary large movablecore to calculate kernelcore

2015-04-02 Thread Zhihui Zhang
If you specify movablecore  totalpages, required_kernelcore will end
up with a big number because corepages is an unsigned integer. If so,
the following nested is a waste of time. But I see your point.

-Zhihui

On Wed, Apr 1, 2015 at 7:00 PM, Mel Gorman mgor...@suse.de wrote:
 On Sat, Mar 28, 2015 at 11:36:02PM -0400, Zhihui Zhang wrote:
 If kernelcore is not set, then we are working with a very large kernelcore
 for nothing - no movable zone will be created. If kernelcore is set,
 then it is not respected at all.

 Signed-off-by: Zhihui Zhang zzhs...@gmail.com

 I'm confused. What bug is this patch fixing? What is the user-visible
 impcat of the patch?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Do not use arbitrary large movablecore to calculate kernelcore

2015-03-28 Thread Zhihui Zhang
If kernelcore is not set, then we are working with a very large kernelcore
for nothing - no movable zone will be created. If kernelcore is set,
then it is not respected at all.

Signed-off-by: Zhihui Zhang 
---
 mm/page_alloc.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40e2942..32bf5da 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5199,7 +5199,11 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 */
required_movablecore =
roundup(required_movablecore, MAX_ORDER_NR_PAGES);
-   corepages = totalpages - required_movablecore;
+
+   if (totalpages > required_movablecore)
+   corepages = totalpages - required_movablecore;
+   else
+   corepages = 0;
 
required_kernelcore = max(required_kernelcore, corepages);
}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Do not use arbitrary large movablecore to calculate kernelcore

2015-03-28 Thread Zhihui Zhang
If kernelcore is not set, then we are working with a very large kernelcore
for nothing - no movable zone will be created. If kernelcore is set,
then it is not respected at all.

Signed-off-by: Zhihui Zhang zzhs...@gmail.com
---
 mm/page_alloc.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40e2942..32bf5da 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5199,7 +5199,11 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 */
required_movablecore =
roundup(required_movablecore, MAX_ORDER_NR_PAGES);
-   corepages = totalpages - required_movablecore;
+
+   if (totalpages  required_movablecore)
+   corepages = totalpages - required_movablecore;
+   else
+   corepages = 0;
 
required_kernelcore = max(required_kernelcore, corepages);
}
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [sched] Account the elapse of each period accurately

2015-01-03 Thread Zhihui Zhang
Currently, the decayed values of previous periods can spill into the lower 10 
bits of
runnable_avg_period. This makes the next period to be less than 1024 us. If we 
want
to decay exactly every 1024us, which I see no reason not to (less math overhead 
and
consistent decay period among all tasks), we can use a separate field to track 
how
much time the current period has elapsed instead of overloading 
runnable_avg_period.
This patch achieves this.

Signed-off-by: Zhihui Zhang 
---
 include/linux/sched.h | 2 +-
 kernel/sched/fair.c   | 5 -
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..fa6b23b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1116,7 +1116,7 @@ struct sched_avg {
 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
 * choices of y < 1-2^(-32)*1024.
 */
-   u32 runnable_avg_sum, runnable_avg_period;
+   u32 accrue, runnable_avg_sum, runnable_avg_period;
u64 last_runnable_update;
s64 decay_count;
unsigned long load_avg_contrib;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df2cdf7..c87ecf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -676,6 +676,7 @@ void init_task_runnable_average(struct task_struct *p)
 {
u32 slice;
 
+   p->se.avg.accrue = 0;
p->se.avg.decay_count = 0;
slice = sched_slice(task_cfs_rq(p), >se) >> 10;
p->se.avg.runnable_avg_sum = slice;
@@ -2526,11 +2527,12 @@ static __always_inline int 
__update_entity_runnable_avg(u64 now,
sa->last_runnable_update = now;
 
/* delta_w is the amount already accumulated against our next period */
-   delta_w = sa->runnable_avg_period % 1024;
+   delta_w = sa->accrue;
if (delta + delta_w >= 1024) {
/* period roll-over */
decayed = 1;
 
+   sa->accrue = 0;
/*
 * Now that we know we're crossing a period boundary, figure
 * out how much from delta we need to complete the current
@@ -2558,6 +2560,7 @@ static __always_inline int 
__update_entity_runnable_avg(u64 now,
sa->runnable_avg_sum += runnable_contrib;
sa->runnable_avg_period += runnable_contrib;
}
+   sa->accrue += delta;
 
/* Remainder of delta accrued against u_0` */
if (runnable)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [sched] Account the elapse of each period accurately

2015-01-03 Thread Zhihui Zhang
Currently, the decayed values of previous periods can spill into the lower 10 
bits of
runnable_avg_period. This makes the next period to be less than 1024 us. If we 
want
to decay exactly every 1024us, which I see no reason not to (less math overhead 
and
consistent decay period among all tasks), we can use a separate field to track 
how
much time the current period has elapsed instead of overloading 
runnable_avg_period.
This patch achieves this.

Signed-off-by: Zhihui Zhang zzhs...@gmail.com
---
 include/linux/sched.h | 2 +-
 kernel/sched/fair.c   | 5 -
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..fa6b23b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1116,7 +1116,7 @@ struct sched_avg {
 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
 * choices of y  1-2^(-32)*1024.
 */
-   u32 runnable_avg_sum, runnable_avg_period;
+   u32 accrue, runnable_avg_sum, runnable_avg_period;
u64 last_runnable_update;
s64 decay_count;
unsigned long load_avg_contrib;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df2cdf7..c87ecf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -676,6 +676,7 @@ void init_task_runnable_average(struct task_struct *p)
 {
u32 slice;
 
+   p-se.avg.accrue = 0;
p-se.avg.decay_count = 0;
slice = sched_slice(task_cfs_rq(p), p-se)  10;
p-se.avg.runnable_avg_sum = slice;
@@ -2526,11 +2527,12 @@ static __always_inline int 
__update_entity_runnable_avg(u64 now,
sa-last_runnable_update = now;
 
/* delta_w is the amount already accumulated against our next period */
-   delta_w = sa-runnable_avg_period % 1024;
+   delta_w = sa-accrue;
if (delta + delta_w = 1024) {
/* period roll-over */
decayed = 1;
 
+   sa-accrue = 0;
/*
 * Now that we know we're crossing a period boundary, figure
 * out how much from delta we need to complete the current
@@ -2558,6 +2560,7 @@ static __always_inline int 
__update_entity_runnable_avg(u64 now,
sa-runnable_avg_sum += runnable_contrib;
sa-runnable_avg_period += runnable_contrib;
}
+   sa-accrue += delta;
 
/* Remainder of delta accrued against u_0` */
if (runnable)
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [mempolicy] remove unnecessary is_valid_nodemask()

2014-11-17 Thread Zhihui Zhang
The filtering occurs in mpol_set_nodemask(), it reads like this:

if (pol->flags & MPOL_F_RELATIVE_NODES)
mpol_relative_nodemask(>mask2, nodes,>mask1);
else
nodes_and(nsc->mask2, *nodes, nsc->mask1);

so mask2 is based on mask1. mask2 is only used when nodes is not NULL
later. so we don't care the
case of (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes)).

-Zhihui

On Mon, Nov 17, 2014 at 6:08 PM, Andrew Morton
 wrote:
> On Sat, 15 Nov 2014 21:49:57 -0500 Zhihui Zhang  wrote:
>
>> When nodes is true, nsc->mask2 has already been filtered by nsc->mask1, 
>> which has
>> already factored in node_states[N_MEMORY].
>>
>
> Please be more specific.  Where does that filtering occur?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [mempolicy] remove unnecessary is_valid_nodemask()

2014-11-17 Thread Zhihui Zhang
The filtering occurs in mpol_set_nodemask(), it reads like this:

if (pol-flags  MPOL_F_RELATIVE_NODES)
mpol_relative_nodemask(nsc-mask2, nodes,nsc-mask1);
else
nodes_and(nsc-mask2, *nodes, nsc-mask1);

so mask2 is based on mask1. mask2 is only used when nodes is not NULL
later. so we don't care the
case of (pol-mode == MPOL_PREFERRED  nodes_empty(*nodes)).

-Zhihui

On Mon, Nov 17, 2014 at 6:08 PM, Andrew Morton
a...@linux-foundation.org wrote:
 On Sat, 15 Nov 2014 21:49:57 -0500 Zhihui Zhang zzhs...@gmail.com wrote:

 When nodes is true, nsc-mask2 has already been filtered by nsc-mask1, 
 which has
 already factored in node_states[N_MEMORY].


 Please be more specific.  Where does that filtering occur?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [mempolicy] remove unnecessary is_valid_nodemask()

2014-11-15 Thread Zhihui Zhang
When nodes is true, nsc->mask2 has already been filtered by nsc->mask1, which 
has
already factored in node_states[N_MEMORY].

Signed-off-by: Zhihui Zhang 
---
 mm/mempolicy.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e58725a..f22c559 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -162,12 +162,6 @@ static const struct mempolicy_operations {
enum mpol_rebind_step step);
 } mpol_ops[MPOL_MAX];
 
-/* Check that the nodemask contains at least one populated zone */
-static int is_valid_nodemask(const nodemask_t *nodemask)
-{
-   return nodes_intersects(*nodemask, node_states[N_MEMORY]);
-}
-
 static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
 {
return pol->flags & MPOL_MODE_FLAGS;
@@ -202,7 +196,7 @@ static int mpol_new_preferred(struct mempolicy *pol, const 
nodemask_t *nodes)
 
 static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
 {
-   if (!is_valid_nodemask(nodes))
+   if (nodes_empty(*nodes))
return -EINVAL;
pol->v.nodes = *nodes;
return 0;
@@ -234,7 +228,7 @@ static int mpol_set_nodemask(struct mempolicy *pol,
nodes = NULL;   /* explicit local allocation */
else {
if (pol->flags & MPOL_F_RELATIVE_NODES)
-   mpol_relative_nodemask(>mask2, nodes,>mask1);
+   mpol_relative_nodemask(>mask2, nodes, >mask1);
else
nodes_and(nsc->mask2, *nodes, nsc->mask1);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [mempolicy] remove unnecessary is_valid_nodemask()

2014-11-15 Thread Zhihui Zhang
When nodes is true, nsc-mask2 has already been filtered by nsc-mask1, which 
has
already factored in node_states[N_MEMORY].

Signed-off-by: Zhihui Zhang zzhs...@gmail.com
---
 mm/mempolicy.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e58725a..f22c559 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -162,12 +162,6 @@ static const struct mempolicy_operations {
enum mpol_rebind_step step);
 } mpol_ops[MPOL_MAX];
 
-/* Check that the nodemask contains at least one populated zone */
-static int is_valid_nodemask(const nodemask_t *nodemask)
-{
-   return nodes_intersects(*nodemask, node_states[N_MEMORY]);
-}
-
 static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
 {
return pol-flags  MPOL_MODE_FLAGS;
@@ -202,7 +196,7 @@ static int mpol_new_preferred(struct mempolicy *pol, const 
nodemask_t *nodes)
 
 static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
 {
-   if (!is_valid_nodemask(nodes))
+   if (nodes_empty(*nodes))
return -EINVAL;
pol-v.nodes = *nodes;
return 0;
@@ -234,7 +228,7 @@ static int mpol_set_nodemask(struct mempolicy *pol,
nodes = NULL;   /* explicit local allocation */
else {
if (pol-flags  MPOL_F_RELATIVE_NODES)
-   mpol_relative_nodemask(nsc-mask2, nodes,nsc-mask1);
+   mpol_relative_nodemask(nsc-mask2, nodes, nsc-mask1);
else
nodes_and(nsc-mask2, *nodes, nsc-mask1);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [percpu] Make the unit size of the first chunk the same as other chunks

2014-10-29 Thread Zhihui Zhang
I see your point. Thanks.

-Zhihui

On Wed, Oct 29, 2014 at 12:18 AM, Tejun Heo  wrote:
> Plesae restore lkml cc when replying.
>
> On Tue, Oct 28, 2014 at 08:12:30PM -0400, Zhihui Zhang wrote:
>> My patch just increases the dynamic area in the first chunk slightly
>> to cover the round up surplus. On my 64-bit laptop, it is 12288 bytes.
>
> As I wrote before, it's 12288 bytes on your laptop but it can be much
> larger on other setups.
>
>> It will mostly likely be used, and in fact, a second chunk will be
>> most likely needed as well.  So in theory you are right, but in
>> practice, it probably won't matter.
>
> If the initial dynamic reserve is too small, which could be the case
> given the overall increase in percpu memory, increase
> PERCPU_DYNAMIC_EARLY_SIZE.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [percpu] Make the unit size of the first chunk the same as other chunks

2014-10-29 Thread Zhihui Zhang
I see your point. Thanks.

-Zhihui

On Wed, Oct 29, 2014 at 12:18 AM, Tejun Heo t...@kernel.org wrote:
 Plesae restore lkml cc when replying.

 On Tue, Oct 28, 2014 at 08:12:30PM -0400, Zhihui Zhang wrote:
 My patch just increases the dynamic area in the first chunk slightly
 to cover the round up surplus. On my 64-bit laptop, it is 12288 bytes.

 As I wrote before, it's 12288 bytes on your laptop but it can be much
 larger on other setups.

 It will mostly likely be used, and in fact, a second chunk will be
 most likely needed as well.  So in theory you are right, but in
 practice, it probably won't matter.

 If the initial dynamic reserve is too small, which could be the case
 given the overall increase in percpu memory, increase
 PERCPU_DYNAMIC_EARLY_SIZE.

 --
 tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [percpu] Make the unit size of the first chunk the same as other chunks

2014-10-27 Thread Zhihui Zhang
In pcpu_embed_first_chunk(), we allocate full unit size for each CPU
in the first chunk:

1981 /* allocate space for the whole group */
1982 ptr = alloc_fn(cpu, gi->nr_units * ai->unit_size,
atom_size);
1983 if (!ptr) {
1984 rc = -ENOMEM;
1985 goto out_free_areas;
1986 }


Later we freed unused part:


2009 /* copy and return the unused part */
2010 memcpy(ptr, __per_cpu_load, ai->static_size);
2011 free_fn(ptr + size_sum, ai->unit_size - size_sum);

I am trying to make each CPU to have a full unit size in the first
chunk, same as in all other chunks. Does this make sense?

-Zhihui

On Mon, Oct 27, 2014 at 10:08 AM, Tejun Heo  wrote:
> On Sat, Oct 25, 2014 at 11:05:58AM -0400, Zhihui Zhang wrote:
>> Since we have already allocated the full unit size for the first chunk, we 
>> might as well use
>> it so that the unit size are the same for all chunks. The page first chunk 
>> allocator already
>> has this effect because it allocates one page at a time.
>
> I'm not following.  Where do we allocate the full unit size for the
> first chunk?
>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [percpu] Make the unit size of the first chunk the same as other chunks

2014-10-27 Thread Zhihui Zhang
In pcpu_embed_first_chunk(), we allocate full unit size for each CPU
in the first chunk:

1981 /* allocate space for the whole group */
1982 ptr = alloc_fn(cpu, gi-nr_units * ai-unit_size,
atom_size);
1983 if (!ptr) {
1984 rc = -ENOMEM;
1985 goto out_free_areas;
1986 }


Later we freed unused part:


2009 /* copy and return the unused part */
2010 memcpy(ptr, __per_cpu_load, ai-static_size);
2011 free_fn(ptr + size_sum, ai-unit_size - size_sum);

I am trying to make each CPU to have a full unit size in the first
chunk, same as in all other chunks. Does this make sense?

-Zhihui

On Mon, Oct 27, 2014 at 10:08 AM, Tejun Heo t...@kernel.org wrote:
 On Sat, Oct 25, 2014 at 11:05:58AM -0400, Zhihui Zhang wrote:
 Since we have already allocated the full unit size for the first chunk, we 
 might as well use
 it so that the unit size are the same for all chunks. The page first chunk 
 allocator already
 has this effect because it allocates one page at a time.

 I'm not following.  Where do we allocate the full unit size for the
 first chunk?

 Thanks.

 --
 tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [percpu] Make the unit size of the first chunk the same as other chunks

2014-10-25 Thread Zhihui Zhang
Since we have already allocated the full unit size for the first chunk, we 
might as well use
it so that the unit size are the same for all chunks. The page first chunk 
allocator already
has this effect because it allocates one page at a time.

Signed-off-by: Zhihui Zhang 
---
 mm/percpu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 014bab6..7242360 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1960,6 +1960,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, 
size_t dyn_size,
return PTR_ERR(ai);
 
size_sum = ai->static_size + ai->reserved_size + ai->dyn_size;
+   ai->dyn_size += ai->unit_size - size_sum;
areas_size = PFN_ALIGN(ai->nr_groups * sizeof(void *));
 
areas = memblock_virt_alloc_nopanic(areas_size, 0);
@@ -2006,9 +2007,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, 
size_t dyn_size,
free_fn(ptr, ai->unit_size);
continue;
}
-   /* copy and return the unused part */
+   /* copy static data */
memcpy(ptr, __per_cpu_load, ai->static_size);
-   free_fn(ptr + size_sum, ai->unit_size - size_sum);
}
}
 
@@ -2034,7 +2034,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, 
size_t dyn_size,
}
 
pr_info("PERCPU: Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n",
-   PFN_DOWN(size_sum), base, ai->static_size, ai->reserved_size,
+   PFN_DOWN(ai->unit_size), base, ai->static_size, 
ai->reserved_size,
ai->dyn_size, ai->unit_size);
 
rc = pcpu_setup_first_chunk(ai, base);
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [percpu] Make the unit size of the first chunk the same as other chunks

2014-10-25 Thread Zhihui Zhang
Since we have already allocated the full unit size for the first chunk, we 
might as well use
it so that the unit size are the same for all chunks. The page first chunk 
allocator already
has this effect because it allocates one page at a time.

Signed-off-by: Zhihui Zhang zzhs...@gmail.com
---
 mm/percpu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 014bab6..7242360 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1960,6 +1960,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, 
size_t dyn_size,
return PTR_ERR(ai);
 
size_sum = ai-static_size + ai-reserved_size + ai-dyn_size;
+   ai-dyn_size += ai-unit_size - size_sum;
areas_size = PFN_ALIGN(ai-nr_groups * sizeof(void *));
 
areas = memblock_virt_alloc_nopanic(areas_size, 0);
@@ -2006,9 +2007,8 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, 
size_t dyn_size,
free_fn(ptr, ai-unit_size);
continue;
}
-   /* copy and return the unused part */
+   /* copy static data */
memcpy(ptr, __per_cpu_load, ai-static_size);
-   free_fn(ptr + size_sum, ai-unit_size - size_sum);
}
}
 
@@ -2034,7 +2034,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, 
size_t dyn_size,
}
 
pr_info(PERCPU: Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n,
-   PFN_DOWN(size_sum), base, ai-static_size, ai-reserved_size,
+   PFN_DOWN(ai-unit_size), base, ai-static_size, 
ai-reserved_size,
ai-dyn_size, ai-unit_size);
 
rc = pcpu_setup_first_chunk(ai, base);
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:sched/core] sched: Clean up some typos and grammatical errors in code/comments

2014-09-21 Thread tip-bot for Zhihui Zhang
Commit-ID:  9c58c79a8a76c510cd3a5012c536d4fe3c81ec3b
Gitweb: http://git.kernel.org/tip/9c58c79a8a76c510cd3a5012c536d4fe3c81ec3b
Author: Zhihui Zhang 
AuthorDate: Sat, 20 Sep 2014 21:24:36 -0400
Committer:  Ingo Molnar 
CommitDate: Sun, 21 Sep 2014 09:00:02 +0200

sched: Clean up some typos and grammatical errors in code/comments

Signed-off-by: Zhihui Zhang 
Cc: pet...@infradead.org
Link: 
http://lkml.kernel.org/r/1411262676-19928-1-git-send-email-zzhs...@gmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c  | 4 ++--
 kernel/sched/fair.c  | 6 +++---
 kernel/sched/sched.h | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 61ee2b3..a284190 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8069,7 +8069,7 @@ static int tg_cfs_schedulable_down(struct task_group *tg, 
void *data)
struct cfs_bandwidth *parent_b = >parent->cfs_bandwidth;
 
quota = normalize_cfs_quota(tg, d);
-   parent_quota = parent_b->hierarchal_quota;
+   parent_quota = parent_b->hierarchical_quota;
 
/*
 * ensure max(child_quota) <= parent_quota, inherit when no
@@ -8080,7 +8080,7 @@ static int tg_cfs_schedulable_down(struct task_group *tg, 
void *data)
else if (parent_quota != RUNTIME_INF && quota > parent_quota)
return -EINVAL;
}
-   cfs_b->hierarchal_quota = quota;
+   cfs_b->hierarchical_quota = quota;
 
return 0;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 74fa2c2..2a1e6ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2224,8 +2224,8 @@ static __always_inline u64 decay_load(u64 val, u64 n)
 
/*
 * As y^PERIOD = 1/2, we can combine
-*y^n = 1/2^(n/PERIOD) * k^(n%PERIOD)
-* With a look-up table which covers k^n (navg_load >= busiest->avg_load)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aa0f73b..1bc6aad 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -188,7 +188,7 @@ struct cfs_bandwidth {
raw_spinlock_t lock;
ktime_t period;
u64 quota, runtime;
-   s64 hierarchal_quota;
+   s64 hierarchical_quota;
u64 runtime_expires;
 
int idle, timer_active;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:sched/core] sched: Clean up some typos and grammatical errors in code/comments

2014-09-21 Thread tip-bot for Zhihui Zhang
Commit-ID:  9c58c79a8a76c510cd3a5012c536d4fe3c81ec3b
Gitweb: http://git.kernel.org/tip/9c58c79a8a76c510cd3a5012c536d4fe3c81ec3b
Author: Zhihui Zhang zzhs...@gmail.com
AuthorDate: Sat, 20 Sep 2014 21:24:36 -0400
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Sun, 21 Sep 2014 09:00:02 +0200

sched: Clean up some typos and grammatical errors in code/comments

Signed-off-by: Zhihui Zhang zzhs...@gmail.com
Cc: pet...@infradead.org
Link: 
http://lkml.kernel.org/r/1411262676-19928-1-git-send-email-zzhs...@gmail.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 kernel/sched/core.c  | 4 ++--
 kernel/sched/fair.c  | 6 +++---
 kernel/sched/sched.h | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 61ee2b3..a284190 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8069,7 +8069,7 @@ static int tg_cfs_schedulable_down(struct task_group *tg, 
void *data)
struct cfs_bandwidth *parent_b = tg-parent-cfs_bandwidth;
 
quota = normalize_cfs_quota(tg, d);
-   parent_quota = parent_b-hierarchal_quota;
+   parent_quota = parent_b-hierarchical_quota;
 
/*
 * ensure max(child_quota) = parent_quota, inherit when no
@@ -8080,7 +8080,7 @@ static int tg_cfs_schedulable_down(struct task_group *tg, 
void *data)
else if (parent_quota != RUNTIME_INF  quota  parent_quota)
return -EINVAL;
}
-   cfs_b-hierarchal_quota = quota;
+   cfs_b-hierarchical_quota = quota;
 
return 0;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 74fa2c2..2a1e6ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2224,8 +2224,8 @@ static __always_inline u64 decay_load(u64 val, u64 n)
 
/*
 * As y^PERIOD = 1/2, we can combine
-*y^n = 1/2^(n/PERIOD) * k^(n%PERIOD)
-* With a look-up table which covers k^n (nPERIOD)
+*y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
+* With a look-up table which covers y^n (nPERIOD)
 *
 * To achieve constant time decay_load.
 */
@@ -6410,7 +6410,7 @@ static struct sched_group *find_busiest_group(struct 
lb_env *env)
goto force_balance;
 
/*
-* If the local group is more busy than the selected busiest group
+* If the local group is busier than the selected busiest group
 * don't try and pull any tasks.
 */
if (local-avg_load = busiest-avg_load)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aa0f73b..1bc6aad 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -188,7 +188,7 @@ struct cfs_bandwidth {
raw_spinlock_t lock;
ktime_t period;
u64 quota, runtime;
-   s64 hierarchal_quota;
+   s64 hierarchical_quota;
u64 runtime_expires;
 
int idle, timer_active;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [sched] Clean up some typos and grammatical errors in code/comments

2014-09-20 Thread Zhihui Zhang
Well, the subject line says it all.

Signed-off-by: Zhihui Zhang 
---
 kernel/sched/core.c  | 4 ++--
 kernel/sched/fair.c  | 6 +++---
 kernel/sched/sched.h | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec1a286..eb5505f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8005,7 +8005,7 @@ static int tg_cfs_schedulable_down(struct task_group *tg, 
void *data)
struct cfs_bandwidth *parent_b = >parent->cfs_bandwidth;
 
quota = normalize_cfs_quota(tg, d);
-   parent_quota = parent_b->hierarchal_quota;
+   parent_quota = parent_b->hierarchical_quota;
 
/*
 * ensure max(child_quota) <= parent_quota, inherit when no
@@ -8016,7 +8016,7 @@ static int tg_cfs_schedulable_down(struct task_group *tg, 
void *data)
else if (parent_quota != RUNTIME_INF && quota > parent_quota)
return -EINVAL;
}
-   cfs_b->hierarchal_quota = quota;
+   cfs_b->hierarchical_quota = quota;
 
return 0;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa3c86..6d83845 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2211,8 +2211,8 @@ static __always_inline u64 decay_load(u64 val, u64 n)
 
/*
 * As y^PERIOD = 1/2, we can combine
-*y^n = 1/2^(n/PERIOD) * k^(n%PERIOD)
-* With a look-up table which covers k^n (navg_load >= busiest->avg_load)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 579712f..80b124d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -184,7 +184,7 @@ struct cfs_bandwidth {
raw_spinlock_t lock;
ktime_t period;
u64 quota, runtime;
-   s64 hierarchal_quota;
+   s64 hierarchical_quota;
u64 runtime_expires;
 
int idle, timer_active;
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [sched] Clean up some typos and grammatical errors in code/comments

2014-09-20 Thread Zhihui Zhang
Well, the subject line says it all.

Signed-off-by: Zhihui Zhang zzhs...@gmail.com
---
 kernel/sched/core.c  | 4 ++--
 kernel/sched/fair.c  | 6 +++---
 kernel/sched/sched.h | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec1a286..eb5505f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8005,7 +8005,7 @@ static int tg_cfs_schedulable_down(struct task_group *tg, 
void *data)
struct cfs_bandwidth *parent_b = tg-parent-cfs_bandwidth;
 
quota = normalize_cfs_quota(tg, d);
-   parent_quota = parent_b-hierarchal_quota;
+   parent_quota = parent_b-hierarchical_quota;
 
/*
 * ensure max(child_quota) = parent_quota, inherit when no
@@ -8016,7 +8016,7 @@ static int tg_cfs_schedulable_down(struct task_group *tg, 
void *data)
else if (parent_quota != RUNTIME_INF  quota  parent_quota)
return -EINVAL;
}
-   cfs_b-hierarchal_quota = quota;
+   cfs_b-hierarchical_quota = quota;
 
return 0;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa3c86..6d83845 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2211,8 +2211,8 @@ static __always_inline u64 decay_load(u64 val, u64 n)
 
/*
 * As y^PERIOD = 1/2, we can combine
-*y^n = 1/2^(n/PERIOD) * k^(n%PERIOD)
-* With a look-up table which covers k^n (nPERIOD)
+*y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
+* With a look-up table which covers y^n (nPERIOD)
 *
 * To achieve constant time decay_load.
 */
@@ -6346,7 +6346,7 @@ static struct sched_group *find_busiest_group(struct 
lb_env *env)
goto force_balance;
 
/*
-* If the local group is more busy than the selected busiest group
+* If the local group is busier than the selected busiest group
 * don't try and pull any tasks.
 */
if (local-avg_load = busiest-avg_load)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 579712f..80b124d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -184,7 +184,7 @@ struct cfs_bandwidth {
raw_spinlock_t lock;
ktime_t period;
u64 quota, runtime;
-   s64 hierarchal_quota;
+   s64 hierarchical_quota;
u64 runtime_expires;
 
int idle, timer_active;
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:sched/core] sched: Rename a misleading variable in build_overlap_sched_groups()

2014-08-12 Thread tip-bot for Zhihui Zhang
Commit-ID:  aaecac4ad46b35ad308245384d019633fb9bc21b
Gitweb: http://git.kernel.org/tip/aaecac4ad46b35ad308245384d019633fb9bc21b
Author: Zhihui Zhang 
AuthorDate: Fri, 1 Aug 2014 21:18:03 -0400
Committer:  Ingo Molnar 
CommitDate: Tue, 12 Aug 2014 12:48:21 +0200

sched: Rename a misleading variable in build_overlap_sched_groups()

The child variable in build_overlap_sched_groups() actually refers to the
peer or sibling domain of the given CPU. Rename it to sibling to be consistent
with the naming in build_group_mask().

Signed-off-by: Zhihui Zhang 
Signed-off-by: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: linux-kernel@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1406942283-18249-1-git-send-email-zzhs...@gmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1211575..7d1ec6e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5739,7 +5739,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
const struct cpumask *span = sched_domain_span(sd);
struct cpumask *covered = sched_domains_tmpmask;
struct sd_data *sdd = sd->private;
-   struct sched_domain *child;
+   struct sched_domain *sibling;
int i;
 
cpumask_clear(covered);
@@ -5750,10 +5750,10 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
if (cpumask_test_cpu(i, covered))
continue;
 
-   child = *per_cpu_ptr(sdd->sd, i);
+   sibling = *per_cpu_ptr(sdd->sd, i);
 
/* See the comment near build_group_mask(). */
-   if (!cpumask_test_cpu(i, sched_domain_span(child)))
+   if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
continue;
 
sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
@@ -5763,10 +5763,9 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
goto fail;
 
sg_span = sched_group_cpus(sg);
-   if (child->child) {
-   child = child->child;
-   cpumask_copy(sg_span, sched_domain_span(child));
-   } else
+   if (sibling->child)
+   cpumask_copy(sg_span, 
sched_domain_span(sibling->child));
+   else
cpumask_set_cpu(i, sg_span);
 
cpumask_or(covered, covered, sg_span);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:sched/core] sched: Rename a misleading variable in build_overlap_sched_groups()

2014-08-12 Thread tip-bot for Zhihui Zhang
Commit-ID:  aaecac4ad46b35ad308245384d019633fb9bc21b
Gitweb: http://git.kernel.org/tip/aaecac4ad46b35ad308245384d019633fb9bc21b
Author: Zhihui Zhang zzhs...@gmail.com
AuthorDate: Fri, 1 Aug 2014 21:18:03 -0400
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Tue, 12 Aug 2014 12:48:21 +0200

sched: Rename a misleading variable in build_overlap_sched_groups()

The child variable in build_overlap_sched_groups() actually refers to the
peer or sibling domain of the given CPU. Rename it to sibling to be consistent
with the naming in build_group_mask().

Signed-off-by: Zhihui Zhang zzhs...@gmail.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1406942283-18249-1-git-send-email-zzhs...@gmail.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 kernel/sched/core.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1211575..7d1ec6e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5739,7 +5739,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
const struct cpumask *span = sched_domain_span(sd);
struct cpumask *covered = sched_domains_tmpmask;
struct sd_data *sdd = sd-private;
-   struct sched_domain *child;
+   struct sched_domain *sibling;
int i;
 
cpumask_clear(covered);
@@ -5750,10 +5750,10 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
if (cpumask_test_cpu(i, covered))
continue;
 
-   child = *per_cpu_ptr(sdd-sd, i);
+   sibling = *per_cpu_ptr(sdd-sd, i);
 
/* See the comment near build_group_mask(). */
-   if (!cpumask_test_cpu(i, sched_domain_span(child)))
+   if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
continue;
 
sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
@@ -5763,10 +5763,9 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
goto fail;
 
sg_span = sched_group_cpus(sg);
-   if (child-child) {
-   child = child-child;
-   cpumask_copy(sg_span, sched_domain_span(child));
-   } else
+   if (sibling-child)
+   cpumask_copy(sg_span, 
sched_domain_span(sibling-child));
+   else
cpumask_set_cpu(i, sg_span);
 
cpumask_or(covered, covered, sg_span);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [sched] Rename a misleading variable in build_overlap_sched_groups()

2014-08-01 Thread Zhihui Zhang
The child variable in build_overlap_sched_groups() actually refers to the
peer or sibling domain of the given CPU. Rename it to sibling to be consistent
with the naming in build_group_mask().

Signed-off-by: Zhihui Zhang 
---
 kernel/sched/core.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bc1638b..8ba66006 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5720,7 +5720,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
const struct cpumask *span = sched_domain_span(sd);
struct cpumask *covered = sched_domains_tmpmask;
struct sd_data *sdd = sd->private;
-   struct sched_domain *child;
+   struct sched_domain *sibling;
int i;
 
cpumask_clear(covered);
@@ -5731,10 +5731,10 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
if (cpumask_test_cpu(i, covered))
continue;
 
-   child = *per_cpu_ptr(sdd->sd, i);
+   sibling = *per_cpu_ptr(sdd->sd, i);
 
/* See the comment near build_group_mask(). */
-   if (!cpumask_test_cpu(i, sched_domain_span(child)))
+   if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
continue;
 
sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
@@ -5744,10 +5744,9 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
goto fail;
 
sg_span = sched_group_cpus(sg);
-   if (child->child) {
-   child = child->child;
-   cpumask_copy(sg_span, sched_domain_span(child));
-   } else
+   if (sibling->child)
+   cpumask_copy(sg_span, 
sched_domain_span(sibling->child));
+   else
cpumask_set_cpu(i, sg_span);
 
cpumask_or(covered, covered, sg_span);
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [sched] Rename a misleading variable in build_overlap_sched_groups()

2014-08-01 Thread Zhihui Zhang
The child variable in build_overlap_sched_groups() actually refers to the
peer or sibling domain of the given CPU. Rename it to sibling to be consistent
with the naming in build_group_mask().

Signed-off-by: Zhihui Zhang zzhs...@gmail.com
---
 kernel/sched/core.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bc1638b..8ba66006 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5720,7 +5720,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
const struct cpumask *span = sched_domain_span(sd);
struct cpumask *covered = sched_domains_tmpmask;
struct sd_data *sdd = sd-private;
-   struct sched_domain *child;
+   struct sched_domain *sibling;
int i;
 
cpumask_clear(covered);
@@ -5731,10 +5731,10 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
if (cpumask_test_cpu(i, covered))
continue;
 
-   child = *per_cpu_ptr(sdd-sd, i);
+   sibling = *per_cpu_ptr(sdd-sd, i);
 
/* See the comment near build_group_mask(). */
-   if (!cpumask_test_cpu(i, sched_domain_span(child)))
+   if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
continue;
 
sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
@@ -5744,10 +5744,9 @@ build_overlap_sched_groups(struct sched_domain *sd, int 
cpu)
goto fail;
 
sg_span = sched_group_cpus(sg);
-   if (child-child) {
-   child = child-child;
-   cpumask_copy(sg_span, sched_domain_span(child));
-   } else
+   if (sibling-child)
+   cpumask_copy(sg_span, 
sched_domain_span(sibling-child));
+   else
cpumask_set_cpu(i, sg_span);
 
cpumask_or(covered, covered, sg_span);
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [sched] Don't account time after deadline twice

2014-07-03 Thread Zhihui Zhang
We calculate difference between two readings of a clock to see how
much time has elapsed.  Part of the time between rq_clock(rq) -
dl_se->deadline can indeed be accounted for by reading a different
clock
(i.e., rq_clock_task()) if the task was running during the period.
And that is how dl_se->runtime is obtained. After all, both clocks are
running independently, right? Furthermore, the caller of
dl_runtime_exceeded() will still use rq_clock() and dl_se->deadline to
determine if we throttle or replenish. Anyway, I have failed to see
any steal of time.  Could you please give a concrete example (perhaps
with numbers)?

thanks,

-Zhihui

On Thu, Jul 3, 2014 at 5:50 AM, Juri Lelli  wrote:
> On Wed, 2 Jul 2014 19:44:04 -0400
> Zhihui Zhang  wrote:
>
>> My point is that rq_clock(rq) - dl_se->deadline is already part of
>> dl_se->runtime, which is decremented before calling dl_runtime_exceeded().
>
> But, we decrement dl_se->runtime looking at rq_clock_task(rq), that is
> in general <= rq_clock(rq), that we use to handle deadlines. So, if we
> do like you suggest, in some cases we could end up stealing some
> bandwidth from the system. Indeed, we prefer some pessimism here.
>
> Thanks,
>
> - Juri
>
>> So the following line is not needed in the case of both overrun and missing
>> deadline:
>>
>> dl_se->runtime -= rq_clock(rq) - dl_se->deadline;
>>
>> Or did I miss anything?
>>
>> thanks,
>>
>>
>> On Tue, Jul 1, 2014 at 9:59 AM, Juri Lelli  wrote:
>>
>> > On Tue, 1 Jul 2014 15:08:16 +0200
>> > Peter Zijlstra  wrote:
>> >
>> > > On Sun, Jun 29, 2014 at 09:26:10PM -0400, Zhihui Zhang wrote:
>> > > > Unless we want to double-penalize an overrun task, the time after the
>> > deadline
>> > > > and before the current time is already accounted in the negative
>> > dl_se->runtime
>> > > > value. So we can leave it as is in the case of dmiss && rorun.
>> > >
>> > > Juri?
>> > >
>> > > > Signed-off-by: Zhihui Zhang 
>> > > > ---
>> > > >  kernel/sched/deadline.c | 6 ++
>> > > >  1 file changed, 2 insertions(+), 4 deletions(-)
>> > > >
>> > > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> > > > index fc4f98b1..67df0d6 100644
>> > > > --- a/kernel/sched/deadline.c
>> > > > +++ b/kernel/sched/deadline.c
>> > > > @@ -579,10 +579,8 @@ int dl_runtime_exceeded(struct rq *rq, struct
>> > sched_dl_entity *dl_se)
>> > > >  * the next instance. Thus, if we do not account that, we are
>> > > >  * stealing bandwidth from the system at each deadline miss!
>> > > >  */
>> > > > -   if (dmiss) {
>> > > > -   dl_se->runtime = rorun ? dl_se->runtime : 0;
>> >
>> > If we didn't return 0 before, we are going to throttle (or replenish)
>> > the entity, and you want runtime to be <=0. So, this is needed.
>> >
>> > > > -   dl_se->runtime -= rq_clock(rq) - dl_se->deadline;
>> > > > -   }
>> >
>> > A little pessimism in some cases, due to the fact that we use both
>> > rq_clock and rq_clock_task (for the budget).
>> >
>> > Thanks,
>> >
>> > - Juri
>> >
>> > > > +   if (dmiss && !rorun)
>> > > > +   dl_se->runtime = dl_se->deadline - rq_clock(rq);
>> > > >
>> > > > return 1;
>> > > >  }
>> > > > --
>> > > > 1.8.1.2
>> > > >
>> >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [sched] Don't account time after deadline twice

2014-07-03 Thread Zhihui Zhang
We calculate difference between two readings of a clock to see how
much time has elapsed.  Part of the time between rq_clock(rq) -
dl_se-deadline can indeed be accounted for by reading a different
clock
(i.e., rq_clock_task()) if the task was running during the period.
And that is how dl_se-runtime is obtained. After all, both clocks are
running independently, right? Furthermore, the caller of
dl_runtime_exceeded() will still use rq_clock() and dl_se-deadline to
determine if we throttle or replenish. Anyway, I have failed to see
any steal of time.  Could you please give a concrete example (perhaps
with numbers)?

thanks,

-Zhihui

On Thu, Jul 3, 2014 at 5:50 AM, Juri Lelli juri.le...@gmail.com wrote:
 On Wed, 2 Jul 2014 19:44:04 -0400
 Zhihui Zhang zzhs...@gmail.com wrote:

 My point is that rq_clock(rq) - dl_se-deadline is already part of
 dl_se-runtime, which is decremented before calling dl_runtime_exceeded().

 But, we decrement dl_se-runtime looking at rq_clock_task(rq), that is
 in general = rq_clock(rq), that we use to handle deadlines. So, if we
 do like you suggest, in some cases we could end up stealing some
 bandwidth from the system. Indeed, we prefer some pessimism here.

 Thanks,

 - Juri

 So the following line is not needed in the case of both overrun and missing
 deadline:

 dl_se-runtime -= rq_clock(rq) - dl_se-deadline;

 Or did I miss anything?

 thanks,


 On Tue, Jul 1, 2014 at 9:59 AM, Juri Lelli juri.le...@gmail.com wrote:

  On Tue, 1 Jul 2014 15:08:16 +0200
  Peter Zijlstra pet...@infradead.org wrote:
 
   On Sun, Jun 29, 2014 at 09:26:10PM -0400, Zhihui Zhang wrote:
Unless we want to double-penalize an overrun task, the time after the
  deadline
and before the current time is already accounted in the negative
  dl_se-runtime
value. So we can leave it as is in the case of dmiss  rorun.
  
   Juri?
  
Signed-off-by: Zhihui Zhang zzhs...@gmail.com
---
 kernel/sched/deadline.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)
   
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fc4f98b1..67df0d6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -579,10 +579,8 @@ int dl_runtime_exceeded(struct rq *rq, struct
  sched_dl_entity *dl_se)
 * the next instance. Thus, if we do not account that, we are
 * stealing bandwidth from the system at each deadline miss!
 */
-   if (dmiss) {
-   dl_se-runtime = rorun ? dl_se-runtime : 0;
 
  If we didn't return 0 before, we are going to throttle (or replenish)
  the entity, and you want runtime to be =0. So, this is needed.
 
-   dl_se-runtime -= rq_clock(rq) - dl_se-deadline;
-   }
 
  A little pessimism in some cases, due to the fact that we use both
  rq_clock and rq_clock_task (for the budget).
 
  Thanks,
 
  - Juri
 
+   if (dmiss  !rorun)
+   dl_se-runtime = dl_se-deadline - rq_clock(rq);
   
return 1;
 }
--
1.8.1.2
   
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [sched] Don't account time after deadline twice

2014-06-29 Thread Zhihui Zhang
Unless we want to double-penalize an overrun task, the time after the deadline
and before the current time is already accounted in the negative dl_se->runtime
value. So we can leave it as is in the case of dmiss && rorun.

Signed-off-by: Zhihui Zhang 
---
 kernel/sched/deadline.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fc4f98b1..67df0d6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -579,10 +579,8 @@ int dl_runtime_exceeded(struct rq *rq, struct 
sched_dl_entity *dl_se)
 * the next instance. Thus, if we do not account that, we are
 * stealing bandwidth from the system at each deadline miss!
 */
-   if (dmiss) {
-   dl_se->runtime = rorun ? dl_se->runtime : 0;
-   dl_se->runtime -= rq_clock(rq) - dl_se->deadline;
-   }
+   if (dmiss && !rorun)
+   dl_se->runtime = dl_se->deadline - rq_clock(rq);
 
return 1;
 }
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] [sched] Don't account time after deadline twice

2014-06-29 Thread Zhihui Zhang
Unless we want to double-penalize an overrun task, the time after the deadline
and before the current time is already accounted in the negative dl_se-runtime
value. So we can leave it as is in the case of dmiss  rorun.

Signed-off-by: Zhihui Zhang zzhs...@gmail.com
---
 kernel/sched/deadline.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fc4f98b1..67df0d6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -579,10 +579,8 @@ int dl_runtime_exceeded(struct rq *rq, struct 
sched_dl_entity *dl_se)
 * the next instance. Thus, if we do not account that, we are
 * stealing bandwidth from the system at each deadline miss!
 */
-   if (dmiss) {
-   dl_se-runtime = rorun ? dl_se-runtime : 0;
-   dl_se-runtime -= rq_clock(rq) - dl_se-deadline;
-   }
+   if (dmiss  !rorun)
+   dl_se-runtime = dl_se-deadline - rq_clock(rq);
 
return 1;
 }
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Use LOAD_PHYSICAL_ADDR in vmlinux.lds.S

2014-01-26 Thread Zhihui Zhang
This unifies the way to specify start VMA on both 32 and 64-bit platforms.
I would like to remove __PHYSICAL_START as well, but that appears to be harder.

Signed-off-by: Zhihui Zhang 
---
 arch/x86/kernel/vmlinux.lds.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index da6b35a..e81bf49 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -81,11 +81,10 @@ PHDRS {
 
 SECTIONS
 {
-#ifdef CONFIG_X86_32
 . = LOAD_OFFSET + LOAD_PHYSICAL_ADDR;
+#ifdef CONFIG_X86_32
 phys_startup_32 = startup_32 - LOAD_OFFSET;
 #else
-. = __START_KERNEL;
 phys_startup_64 = startup_64 - LOAD_OFFSET;
 #endif
 
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Use LOAD_PHYSICAL_ADDR in vmlinux.lds.S

2014-01-26 Thread Zhihui Zhang
This unifies the way to specify start VMA on both 32 and 64-bit platforms.
I would like to remove __PHYSICAL_START as well, but that appears to be harder.

Signed-off-by: Zhihui Zhang zzhs...@gmail.com
---
 arch/x86/kernel/vmlinux.lds.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index da6b35a..e81bf49 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -81,11 +81,10 @@ PHDRS {
 
 SECTIONS
 {
-#ifdef CONFIG_X86_32
 . = LOAD_OFFSET + LOAD_PHYSICAL_ADDR;
+#ifdef CONFIG_X86_32
 phys_startup_32 = startup_32 - LOAD_OFFSET;
 #else
-. = __START_KERNEL;
 phys_startup_64 = startup_64 - LOAD_OFFSET;
 #endif
 
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] F2FS: Fix the logic of IS_DNODE()

2013-04-07 Thread Zhihui Zhang
Signed-off-by: Zhihui Zhang 
---
 fs/f2fs/node.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/f2fs/node.h b/fs/f2fs/node.h
index afdb130..2be47b2 100644
--- a/fs/f2fs/node.h
+++ b/fs/f2fs/node.h
@@ -239,7 +239,7 @@ static inline bool IS_DNODE(struct page *node_page)
return false;
if (ofs >= 6 + 2 * NIDS_PER_BLOCK) {
ofs -= 6 + 2 * NIDS_PER_BLOCK;
-   if ((long int)ofs % (NIDS_PER_BLOCK + 1))
+   if (!((long int)ofs % (NIDS_PER_BLOCK + 1)))
return false;
}
return true;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] F2FS: Fix the logic of IS_DNODE()

2013-04-07 Thread Zhihui Zhang
Signed-off-by: Zhihui Zhang zzhs...@gmail.com
---
 fs/f2fs/node.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/f2fs/node.h b/fs/f2fs/node.h
index afdb130..2be47b2 100644
--- a/fs/f2fs/node.h
+++ b/fs/f2fs/node.h
@@ -239,7 +239,7 @@ static inline bool IS_DNODE(struct page *node_page)
return false;
if (ofs = 6 + 2 * NIDS_PER_BLOCK) {
ofs -= 6 + 2 * NIDS_PER_BLOCK;
-   if ((long int)ofs % (NIDS_PER_BLOCK + 1))
+   if (!((long int)ofs % (NIDS_PER_BLOCK + 1)))
return false;
}
return true;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/