from:"tip\-bot for Wanpeng Li"

[tip:sched/core] sched/isolation: Prefer housekeeping CPU in local node

2019-07-25 Thread tip-bot for Wanpeng Li

Commit-ID:  e0e8d4911ed2695b12c3a01c15634000ede9bc73
Gitweb: https://git.kernel.org/tip/e0e8d4911ed2695b12c3a01c15634000ede9bc73
Author: Wanpeng Li 
AuthorDate: Fri, 28 Jun 2019 16:51:41 +0800
Committer:  Ingo Molnar 
CommitDate: Thu, 25 Jul 2019 15:51:55 +0200

sched/isolation: Prefer housekeeping CPU in local node

In real product setup, there will be houseeking CPUs in each nodes, it
is prefer to do housekeeping from local node, fallback to global online
cpumask if failed to find houseeking CPU from local node.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Frederic Weisbecker 
Reviewed-by: Srikar Dronamraju 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
https://lkml.kernel.org/r/1561711901-4755-2-git-send-email-wanpen...@tencent.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/isolation.c | 12 ++--
 kernel/sched/sched.h |  8 +---
 kernel/sched/topology.c  | 20 
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index ccb28085b114..9fcb2a695a41 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -22,9 +22,17 @@ EXPORT_SYMBOL_GPL(housekeeping_enabled);
 
 int housekeeping_any_cpu(enum hk_flags flags)
 {
-   if (static_branch_unlikely(&housekeeping_overridden))
-   if (housekeeping_flags & flags)
+   int cpu;
+
+   if (static_branch_unlikely(&housekeeping_overridden)) {
+   if (housekeeping_flags & flags) {
+   cpu = sched_numa_find_closest(housekeeping_mask, 
smp_processor_id());
+   if (cpu < nr_cpu_ids)
+   return cpu;
+
return cpumask_any_and(housekeeping_mask, 
cpu_online_mask);
+   }
+   }
return smp_processor_id();
 }
 EXPORT_SYMBOL_GPL(housekeeping_any_cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aaca0e743776..16126efd14ed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1262,16 +1262,18 @@ enum numa_topology_type {
 extern enum numa_topology_type sched_numa_topology_type;
 extern int sched_max_numa_distance;
 extern bool find_numa_distance(int distance);
-#endif
-
-#ifdef CONFIG_NUMA
 extern void sched_init_numa(void);
 extern void sched_domains_numa_masks_set(unsigned int cpu);
 extern void sched_domains_numa_masks_clear(unsigned int cpu);
+extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu);
 #else
 static inline void sched_init_numa(void) { }
 static inline void sched_domains_numa_masks_set(unsigned int cpu) { }
 static inline void sched_domains_numa_masks_clear(unsigned int cpu) { }
+static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
+{
+   return nr_cpu_ids;
+}
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f751ce0b783e..4eea2c9bc732 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1724,6 +1724,26 @@ void sched_domains_numa_masks_clear(unsigned int cpu)
}
 }
 
+/*
+ * sched_numa_find_closest() - given the NUMA topology, find the cpu
+ * closest to @cpu from @cpumask.
+ * cpumask: cpumask to find a cpu from
+ * cpu: cpu to be close to
+ *
+ * returns: cpu, or nr_cpu_ids when nothing found.
+ */
+int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
+{
+   int i, j = cpu_to_node(cpu);
+
+   for (i = 0; i < sched_domains_numa_levels; i++) {
+   cpu = cpumask_any_and(cpus, sched_domains_numa_masks[i][j]);
+   if (cpu < nr_cpu_ids)
+   return cpu;
+   }
+   return nr_cpu_ids;
+}
+
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)

[tip:sched/urgent] sched/cputime: Don't use smp_processor_id() in preemptible context

2017-07-14 Thread tip-bot for Wanpeng Li

Commit-ID:  0e4097c3354e2f5a5ad8affd9dc7f7f7d00bb6b9
Gitweb: http://git.kernel.org/tip/0e4097c3354e2f5a5ad8affd9dc7f7f7d00bb6b9
Author: Wanpeng Li 
AuthorDate: Sun, 9 Jul 2017 00:40:28 -0700
Committer:  Ingo Molnar 
CommitDate: Fri, 14 Jul 2017 10:27:15 +0200

sched/cputime: Don't use smp_processor_id() in preemptible context

Recent kernels trigger this warning:

 BUG: using smp_processor_id() in preemptible [] code: 99-trinity/181
 caller is debug_smp_processor_id+0x17/0x19
 CPU: 0 PID: 181 Comm: 99-trinity Not tainted 4.12.0-01059-g2a42eb9 #1
 Call Trace:
  dump_stack+0x82/0xb8
  check_preemption_disabled()
  debug_smp_processor_id()
  vtime_delta()
  task_cputime()
  thread_group_cputime()
  thread_group_cputime_adjusted()
  wait_consider_task()
  do_wait()
  SYSC_wait4()
  do_syscall_64()
  entry_SYSCALL64_slow_path()

As Frederic pointed out:

| Although those sched_clock_cpu() things seem to only matter when the
| sched_clock() is unstable. And that stability is a condition for nohz_full
| to work anyway. So probably sched_clock() alone would be enough.

This patch fixes it by replacing sched_clock_cpu() with sched_clock() to
avoid calling smp_processor_id() in a preemptible context.

Reported-by: Xiaolong Ye 
Signed-off-by: Wanpeng Li 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Luiz Capitulino 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1499586028-7402-1-git-send-email-wanpeng...@hotmail.com
[ Prettified the changelog. ]
Signed-off-by: Ingo Molnar 
---
 kernel/sched/cputime.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 6e3ea4a..14d2dbf 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -683,7 +683,7 @@ static u64 vtime_delta(struct vtime *vtime)
 {
unsigned long long clock;
 
-   clock = sched_clock_cpu(smp_processor_id());
+   clock = sched_clock();
if (clock < vtime->starttime)
return 0;
 
@@ -814,7 +814,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 
write_seqcount_begin(&vtime->seqcount);
vtime->state = VTIME_SYS;
-   vtime->starttime = sched_clock_cpu(smp_processor_id());
+   vtime->starttime = sched_clock();
write_seqcount_end(&vtime->seqcount);
 }
 
@@ -826,7 +826,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
local_irq_save(flags);
write_seqcount_begin(&vtime->seqcount);
vtime->state = VTIME_SYS;
-   vtime->starttime = sched_clock_cpu(cpu);
+   vtime->starttime = sched_clock();
write_seqcount_end(&vtime->seqcount);
local_irq_restore(flags);
 }

[tip:sched/urgent] sched/cputime: Accumulate vtime on top of nsec clocksource

2017-07-05 Thread tip-bot for Wanpeng Li

Commit-ID:  2a42eb9594a1480b4ead9e036e06ee1290e5fa6d
Gitweb: http://git.kernel.org/tip/2a42eb9594a1480b4ead9e036e06ee1290e5fa6d
Author: Wanpeng Li 
AuthorDate: Thu, 29 Jun 2017 19:15:11 +0200
Committer:  Ingo Molnar 
CommitDate: Wed, 5 Jul 2017 09:54:15 +0200

sched/cputime: Accumulate vtime on top of nsec clocksource

Currently the cputime source used by vtime is jiffies. When we cross
a context boundary and jiffies have changed since the last snapshot, the
pending cputime is accounted to the switching out context.

This system works ok if the ticks are not aligned across CPUs. If they
instead are aligned (ie: all fire at the same time) and the CPUs run in
userspace, the jiffies change is only observed on tick exit and therefore
the user cputime is accounted as system cputime. This is because the
CPU that maintains timekeeping fires its tick at the same time as the
others. It updates jiffies in the middle of the tick and the other CPUs
see that update on IRQ exit:

CPU 0 (timekeeper)  CPU 1
---  -
  jiffies = N
...  run in userspace for a jiffy
tick entry   tick entry (sees jiffies = N)
set jiffies = N + 1
tick exittick exit (sees jiffies = N + 1)
account 1 jiffy as stime

Fix this with using a nanosec clock source instead of jiffies. The
cputime is then accumulated and flushed everytime the pending delta
reaches a jiffy in order to mitigate the accounting overhead.

[ fweisbec: changelog, rebase on struct vtime, field renames, add delta
  on cputime readers, keep idle vtime as-is (low overhead accounting),
  harmonize clock sources. ]

Suggested-by: Thomas Gleixner 
Reported-by: Luiz Capitulino 
Tested-by: Luiz Capitulino 
Signed-off-by: Wanpeng Li 
Signed-off-by: Frederic Weisbecker 
Reviewed-by: Thomas Gleixner 
Acked-by: Rik van Riel 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Wanpeng Li 
Link: 
http://lkml.kernel.org/r/1498756511-11714-6-git-send-email-fweis...@gmail.com
Signed-off-by: Ingo Molnar 
---
 include/linux/sched.h  |  3 +++
 kernel/sched/cputime.c | 64 +-
 2 files changed, 45 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eeff8a0..4818126 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -236,6 +236,9 @@ struct vtime {
seqcount_t  seqcount;
unsigned long long  starttime;
enum vtime_statestate;
+   u64 utime;
+   u64 stime;
+   u64 gtime;
 };
 
 struct sched_info {
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 9ee725e..6e3ea4a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -681,18 +681,19 @@ void thread_group_cputime_adjusted(struct task_struct *p, 
u64 *ut, u64 *st)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static u64 vtime_delta(struct vtime *vtime)
 {
-   unsigned long now = READ_ONCE(jiffies);
+   unsigned long long clock;
 
-   if (time_before(now, (unsigned long)vtime->starttime))
+   clock = sched_clock_cpu(smp_processor_id());
+   if (clock < vtime->starttime)
return 0;
 
-   return jiffies_to_nsecs(now - vtime->starttime);
+   return clock - vtime->starttime;
 }
 
 static u64 get_vtime_delta(struct vtime *vtime)
 {
-   unsigned long now = READ_ONCE(jiffies);
-   u64 delta, other;
+   u64 delta = vtime_delta(vtime);
+   u64 other;
 
/*
 * Unlike tick based timing, vtime based timing never has lost
@@ -701,17 +702,31 @@ static u64 get_vtime_delta(struct vtime *vtime)
 * elapsed time. Limit account_other_time to prevent rounding
 * errors from causing elapsed vtime to go negative.
 */
-   delta = jiffies_to_nsecs(now - vtime->starttime);
other = account_other_time(delta);
WARN_ON_ONCE(vtime->state == VTIME_INACTIVE);
-   vtime->starttime = now;
+   vtime->starttime += delta;
 
return delta - other;
 }
 
-static void __vtime_account_system(struct task_struct *tsk)
+static void __vtime_account_system(struct task_struct *tsk,
+  struct vtime *vtime)
 {
-   account_system_time(tsk, irq_count(), get_vtime_delta(&tsk->vtime));
+   vtime->stime += get_vtime_delta(vtime);
+   if (vtime->stime >= TICK_NSEC) {
+   account_system_time(tsk, irq_count(), vtime->stime);
+   vtime->stime = 0;
+   }
+}
+
+static void vtime_account_guest(struct task_struct *tsk,
+   struct vtime *vtime)
+{
+   vtime->gtime += get_vtime_delta(vtime);
+   if (vtime->gtime >= TICK_NSEC) {
+   account_guest_time(tsk, vtime->gtime);
+   vtime->gtime = 0;
+   }
 }

[tip:sched/core] sched/core: Fix rq lock pinning warning after call balance callbacks

2017-03-23 Thread tip-bot for Wanpeng Li

Commit-ID:  d7921a5ddab8d30e06e321f37eec629f23797486
Gitweb: http://git.kernel.org/tip/d7921a5ddab8d30e06e321f37eec629f23797486
Author: Wanpeng Li 
AuthorDate: Thu, 16 Mar 2017 19:45:19 -0700
Committer:  Ingo Molnar 
CommitDate: Thu, 23 Mar 2017 07:44:51 +0100

sched/core: Fix rq lock pinning warning after call balance callbacks

This can be reproduced by running rt-migrate-test:

 WARNING: CPU: 2 PID: 2195 at kernel/locking/lockdep.c:3670 lock_unpin_lock()
 unpinning an unpinned lock
 ...
 Call Trace:
  dump_stack()
  __warn()
  warn_slowpath_fmt()
  lock_unpin_lock()
  __balance_callback()
  __schedule()
  schedule()
  futex_wait_queue_me()
  futex_wait()
  do_futex()
  SyS_futex()
  do_syscall_64()
  entry_SYSCALL64_slow_path()

Revert the rq_lock_irqsave() usage here, the whole point of the
balance_callback() was to allow dropping rq->lock.

Reported-by: Fengguang Wu 
Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Fixes: 8a8c69c32778 ("sched/core: Add rq->lock wrappers")
Link: 
http://lkml.kernel.org/r/1489718719-3951-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c762f62..ab9f6ac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2776,9 +2776,9 @@ static void __balance_callback(struct rq *rq)
 {
struct callback_head *head, *next;
void (*func)(struct rq *rq);
-   struct rq_flags rf;
+   unsigned long flags;
 
-   rq_lock_irqsave(rq, &rf);
+   raw_spin_lock_irqsave(&rq->lock, flags);
head = rq->balance_callback;
rq->balance_callback = NULL;
while (head) {
@@ -2789,7 +2789,7 @@ static void __balance_callback(struct rq *rq)
 
func(rq);
}
-   rq_unlock_irqrestore(rq, &rf);
+   raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 
 static inline void balance_callback(struct rq *rq)

[tip:sched/core] sched/deadline: Add missing update_rq_clock() in dl_task_timer()

2017-03-16 Thread tip-bot for Wanpeng Li

Commit-ID:  dcc3b5ffe1b32771c9a22e2c916fb94c4fcf5b79
Gitweb: http://git.kernel.org/tip/dcc3b5ffe1b32771c9a22e2c916fb94c4fcf5b79
Author: Wanpeng Li 
AuthorDate: Mon, 6 Mar 2017 21:51:28 -0800
Committer:  Ingo Molnar 
CommitDate: Thu, 16 Mar 2017 09:20:59 +0100

sched/deadline: Add missing update_rq_clock() in dl_task_timer()

The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:

 [ cut here ]
 WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 
replenish_dl_entity+0x71e/0xc40
 rq->clock_update_flags < RQCF_ACT_SKIP
 CPU: 7 PID: 0 Comm: swapper/7 Tainted: GB   4.11.0-rc1+ #24
 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 
02/16/2016
 Call Trace:
  
  dump_stack+0x85/0xc4
  __warn+0x172/0x1b0
  warn_slowpath_fmt+0xb4/0xf0
  ? __warn+0x1b0/0x1b0
  ? debug_check_no_locks_freed+0x2c0/0x2c0
  ? cpudl_set+0x3d/0x2b0
  replenish_dl_entity+0x71e/0xc40
  enqueue_task_dl+0x2ea/0x12e0
  ? dl_task_timer+0x777/0x990
  ? __hrtimer_run_queues+0x270/0xa50
  dl_task_timer+0x316/0x990
  ? enqueue_task_dl+0x12e0/0x12e0
  ? enqueue_task_dl+0x12e0/0x12e0
  __hrtimer_run_queues+0x270/0xa50
  ? hrtimer_cancel+0x20/0x20
  ? hrtimer_interrupt+0x119/0x600
  hrtimer_interrupt+0x19c/0x600
  ? trace_hardirqs_off+0xd/0x10
  local_apic_timer_interrupt+0x74/0xe0
  smp_apic_timer_interrupt+0x76/0xa0
  apic_timer_interrupt+0x93/0xa0

The DL task will be migrated to a suitable later deadline rq once the DL
timer fires and currnet rq is offline. The rq clock of the new rq should
be updated. This patch fixes it by updating the rq clock after holding
the new rq's rq lock.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Matt Fleming 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 99b2c33..c6db3fd 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -638,6 +638,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer 
*timer)
lockdep_unpin_lock(&rq->lock, rf.cookie);
rq = dl_task_offline_migration(rq, p);
rf.cookie = lockdep_pin_lock(&rq->lock);
+   update_rq_clock(rq);
 
/*
 * Now that the task has been migrated to the new RQ and we

[tip:sched/urgent] sched/fair: Update rq clock before changing a task's CPU affinity

2017-02-24 Thread tip-bot for Wanpeng Li

Commit-ID:  a499c3ead88ccf147fc50689e85a530ad923ce36
Gitweb: http://git.kernel.org/tip/a499c3ead88ccf147fc50689e85a530ad923ce36
Author: Wanpeng Li 
AuthorDate: Tue, 21 Feb 2017 23:52:55 -0800
Committer:  Ingo Molnar 
CommitDate: Fri, 24 Feb 2017 08:58:33 +0100

sched/fair: Update rq clock before changing a task's CPU affinity

This is triggered during boot when CONFIG_SCHED_DEBUG is enabled:

 [ cut here ]
 WARNING: CPU: 6 PID: 81 at kernel/sched/sched.h:812 set_next_entity+0x11d/0x380
 rq->clock_update_flags < RQCF_ACT_SKIP
 CPU: 6 PID: 81 Comm: torture_shuffle Not tainted 4.10.0+ #1
 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 
02/16/2016
 Call Trace:
  dump_stack+0x85/0xc2
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  set_next_entity+0x11d/0x380
  set_curr_task_fair+0x2b/0x60
  do_set_cpus_allowed+0x139/0x180
  __set_cpus_allowed_ptr+0x113/0x260
  set_cpus_allowed_ptr+0x10/0x20
  torture_shuffle+0xfd/0x180
  kthread+0x10f/0x150
  ? torture_shutdown_init+0x60/0x60
  ? kthread_create_on_node+0x60/0x60
  ret_from_fork+0x31/0x40
 ---[ end trace dd94d92344cea9c6 ]---

The task is running && !queued, so there is no rq clock update before calling
set_curr_task().

This patch fixes it by updating rq clock after holding rq->lock/pi_lock
just as what other dequeue + put_prev + enqueue + set_curr story does.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Matt Fleming 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1487749975-5994-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d6e828..cc1e3e0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1087,6 +1087,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
int ret = 0;
 
rq = task_rq_lock(p, &rf);
+   update_rq_clock(rq);
 
if (p->flags & PF_KTHREAD) {
/*

[tip:x86/apic] x86/apic: Prevent tracing on apic_msr_write_eoi()

2016-11-09 Thread tip-bot for Wanpeng Li

Commit-ID:  8ca225520e278e41396dab0524989f4848626f83
Gitweb: http://git.kernel.org/tip/8ca225520e278e41396dab0524989f4848626f83
Author: Wanpeng Li 
AuthorDate: Mon, 7 Nov 2016 11:13:40 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 9 Nov 2016 22:03:14 +0100

x86/apic: Prevent tracing on apic_msr_write_eoi()

The following RCU lockdep warning led to adding irq_enter()/irq_exit() into
smp_reschedule_interrupt():

 RCU used illegally from idle CPU!
 rcu_scheduler_active = 1, debug_locks = 0
 RCU used illegally from extended quiescent state!
 no locks held by swapper/1/0.
 
  do_trace_write_msr
  native_write_msr
  native_apic_msr_eoi_write
  smp_reschedule_interrupt
  reschedule_interrupt

As Peterz pointed out:

| So now we're making a very frequent interrupt slower because of debug 
| code.
|
| The thing is, many many smp_reschedule_interrupt() invocations don't
| actually execute anything much at all and are only sent to tickle the
| return to user path (which does the actual preemption).
| 
| Having to do the whole irq_enter/irq_exit dance just for this unlikely
| debug case totally blows.

Use the wrmsr_notrace() variant in native_apic_msr_write_eoi, annotate the
kvm variant with notrace and add a native_apic_eoi callback to the apic
structure so KVM guests are covered as well.

This allows to revert the irq_enter/irq_exit dance in
smp_reschedule_interrupt().

Suggested-by: Peter Zijlstra 
Suggested-by: Paolo Bonzini 
Signed-off-by: Wanpeng Li 
Acked-by: Paolo Bonzini 
Cc: k...@vger.kernel.org
Cc: Mike Galbraith 
Cc: Borislav Petkov 
Link: 
http://lkml.kernel.org/r/1478488420-5982-3-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Thomas Gleixner 

---
 arch/x86/include/asm/apic.h | 3 ++-
 arch/x86/kernel/apic/apic.c | 1 +
 arch/x86/kernel/kvm.c   | 4 ++--
 arch/x86/kernel/smp.c   | 2 --
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index f5aaf6c..a5a0bcf 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -196,7 +196,7 @@ static inline void native_apic_msr_write(u32 reg, u32 v)
 
 static inline void native_apic_msr_eoi_write(u32 reg, u32 v)
 {
-   wrmsr(APIC_BASE_MSR + (APIC_EOI >> 4), APIC_EOI_ACK, 0);
+   wrmsr_notrace(APIC_BASE_MSR + (APIC_EOI >> 4), APIC_EOI_ACK, 0);
 }
 
 static inline u32 native_apic_msr_read(u32 reg)
@@ -332,6 +332,7 @@ struct apic {
 * on write for EOI.
 */
void (*eoi_write)(u32 reg, u32 v);
+   void (*native_eoi_write)(u32 reg, u32 v);
u64 (*icr_read)(void);
void (*icr_write)(u32 low, u32 high);
void (*wait_icr_idle)(void);
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 88c657b..2686894 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2263,6 +2263,7 @@ void __init apic_set_eoi_write(void (*eoi_write)(u32 reg, 
u32 v))
for (drv = __apicdrivers; drv < __apicdrivers_end; drv++) {
/* Should happen once for each apic */
WARN_ON((*drv)->eoi_write == eoi_write);
+   (*drv)->native_eoi_write = (*drv)->eoi_write;
(*drv)->eoi_write = eoi_write;
}
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index edbbfc8..aad52f1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -308,7 +308,7 @@ static void kvm_register_steal_time(void)
 
 static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
 
-static void kvm_guest_apic_eoi_write(u32 reg, u32 val)
+static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 val)
 {
/**
 * This relies on __test_and_clear_bit to modify the memory
@@ -319,7 +319,7 @@ static void kvm_guest_apic_eoi_write(u32 reg, u32 val)
 */
if (__test_and_clear_bit(KVM_PV_EOI_BIT, this_cpu_ptr(&kvm_apic_eoi)))
return;
-   apic_write(APIC_EOI, APIC_EOI_ACK);
+   apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK);
 }
 
 static void kvm_guest_cpu_init(void)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index c00cb64..68f8cc2 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -261,10 +261,8 @@ static inline void __smp_reschedule_interrupt(void)
 
 __visible void smp_reschedule_interrupt(struct pt_regs *regs)
 {
-   irq_enter();
ack_APIC_irq();
__smp_reschedule_interrupt();
-   irq_exit();
/*
 * KVM uses this interrupt to force a cpu out of guest mode
 */

[tip:x86/apic] x86/msr: Add wrmsr_notrace()

2016-11-09 Thread tip-bot for Wanpeng Li

Commit-ID:  b2c5ea4f759190ee9f75687a00035c1a66d0d743
Gitweb: http://git.kernel.org/tip/b2c5ea4f759190ee9f75687a00035c1a66d0d743
Author: Wanpeng Li 
AuthorDate: Mon, 7 Nov 2016 11:13:39 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 9 Nov 2016 22:03:14 +0100

x86/msr: Add wrmsr_notrace()

Required to remove the extra irq_enter()/irq_exit() in
smp_reschedule_interrupt().

Suggested-by: Peter Zijlstra 
Suggested-by: Paolo Bonzini 
Signed-off-by: Wanpeng Li 
Acked-by: Paolo Bonzini 
Reviewed-by: Borislav Petkov 
Cc: k...@vger.kernel.org
Cc: Mike Galbraith 
Link: 
http://lkml.kernel.org/r/1478488420-5982-2-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Thomas Gleixner 

---
 arch/x86/include/asm/msr.h | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index b5fee97..9b0a232 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -115,17 +115,29 @@ static inline unsigned long long 
native_read_msr_safe(unsigned int msr,
 }
 
 /* Can be uninlined because referenced by paravirt */
-notrace static inline void native_write_msr(unsigned int msr,
+static notrace inline void __native_write_msr_notrace(unsigned int msr,
unsigned low, unsigned high)
 {
asm volatile("1: wrmsr\n"
 "2:\n"
 _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wrmsr_unsafe)
 : : "c" (msr), "a"(low), "d" (high) : "memory");
+}
+
+/* Can be uninlined because referenced by paravirt */
+static notrace inline void native_write_msr(unsigned int msr,
+   unsigned low, unsigned high)
+{
+   __native_write_msr_notrace(msr, low, high);
if (msr_tracepoint_active(__tracepoint_write_msr))
do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
 }
 
+static inline void wrmsr_notrace(unsigned msr, unsigned low, unsigned high)
+{
+   __native_write_msr_notrace(msr, low, high);
+}
+
 /* Can be uninlined because referenced by paravirt */
 notrace static inline int native_write_msr_safe(unsigned int msr,
unsigned low, unsigned high)

[tip:x86/urgent] x86/smp: Add irq_enter/exit() in smp_reschedule_interrupt()

2016-10-14 Thread tip-bot for Wanpeng Li

Commit-ID:  1ec6ec14a2943f6f611fc1d5fb2d4eaa85bd9d72
Gitweb: http://git.kernel.org/tip/1ec6ec14a2943f6f611fc1d5fb2d4eaa85bd9d72
Author: Wanpeng Li 
AuthorDate: Fri, 14 Oct 2016 09:48:53 +0800
Committer:  Thomas Gleixner 
CommitDate: Fri, 14 Oct 2016 14:14:20 +0200

x86/smp: Add irq_enter/exit() in smp_reschedule_interrupt()

 ===
 [ INFO: suspicious RCU usage. ]
 4.8.0+ #24 Not tainted
 ---
 ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage!
 
 other info that might help us debug this:
 
 RCU used illegally from idle CPU!
 rcu_scheduler_active = 1, debug_locks = 0
 RCU used illegally from extended quiescent state!
 no locks held by swapper/1/0.
 
  [] do_trace_write_msr+0x135/0x140
  [] native_write_msr+0x20/0x30
  [] native_apic_msr_eoi_write+0x1d/0x30
  [] smp_reschedule_interrupt+0x1d/0x30
  [] reschedule_interrupt+0x96/0xa0

Reschedule interrupt may be called in cpu idle state. This causes lockdep 
check warning above. 

Add irq_enter/exit() in smp_reschedule_interrupt(), irq_enter() tells the RCU 
subsystems to end the extended quiescent state, so the following trace call in 
ack_APIC_irq() works correctly.

Signed-off-by: Wanpeng Li 
Cc: Peter Zijlstra 
Cc: Mike Galbraith 
Link: 
http://lkml.kernel.org/r/1476409733-5133-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Thomas Gleixner 

---
 arch/x86/kernel/smp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 658777c..ac2ee87 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -259,8 +259,10 @@ static inline void __smp_reschedule_interrupt(void)
 
 __visible void smp_reschedule_interrupt(struct pt_regs *regs)
 {
+   irq_enter();
ack_APIC_irq();
__smp_reschedule_interrupt();
+   irq_exit();
/*
 * KVM uses this interrupt to force a cpu out of guest mode
 */

[tip:sched/urgent] sched/fair: Fix sched domains NULL dereference in select_idle_sibling()

2016-10-11 Thread tip-bot for Wanpeng Li

Commit-ID:  9cfb38a7ba5a9c27c1af8093fb1af4b699c0a441
Gitweb: http://git.kernel.org/tip/9cfb38a7ba5a9c27c1af8093fb1af4b699c0a441
Author: Wanpeng Li 
AuthorDate: Sun, 9 Oct 2016 08:04:03 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 11 Oct 2016 10:40:06 +0200

sched/fair: Fix sched domains NULL dereference in select_idle_sibling()

Commit:

  10e2f1acd01 ("sched/core: Rewrite and improve select_idle_siblings()")

... improved select_idle_sibling(), but also triggered a regression (crash)
during CPU-hotplug:

  BUG: unable to handle kernel NULL pointer dereference at 0078
  IP: [] select_idle_sibling+0x1c2/0x4f0
  Call Trace:
   
select_task_rq_fair+0x749/0x930
? select_task_rq_fair+0xb4/0x930
? __lock_is_held+0x54/0x70
try_to_wake_up+0x19a/0x5b0
default_wake_function+0x12/0x20
autoremove_wake_function+0x12/0x40
__wake_up_common+0x55/0x90
__wake_up+0x39/0x50
wake_up_klogd_work_func+0x40/0x60
irq_work_run_list+0x57/0x80
irq_work_run+0x2c/0x30
smp_irq_work_interrupt+0x2e/0x40
irq_work_interrupt+0x96/0xa0
   
? _raw_spin_unlock_irqrestore+0x45/0x80
try_to_wake_up+0x4a/0x5b0
wake_up_state+0x10/0x20
__kthread_unpark+0x67/0x70
kthread_unpark+0x22/0x30
cpuhp_online_idle+0x3e/0x70
cpu_startup_entry+0x6a/0x450
start_secondary+0x154/0x180

This can be reproduced by running the ftrace test case of kselftest, the
test case will hot-unplug the CPU and the CPU will attach to the NULL
sched-domain during scheduler teardown.

The step 2 for the rewrite select_idle_siblings():

  | Step 2) tracks the average cost of the scan and compares this to the
  | average idle time guestimate for the CPU doing the wakeup.

If the CPU which doing the wakeup is the going hot-unplug CPU, then NULL
sched domain will be dereferenced to acquire the average cost of the scan.

This patch fix it by failing the search of an idle CPU in the LLC process
if this sched domain is NULL.

Tested-by: Catalin Marinas 
Signed-off-by: Wanpeng Li 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1475971443-3187-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 502e95a..8b03fb5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5471,13 +5471,18 @@ static inline int select_idle_smt(struct task_struct 
*p, struct sched_domain *sd
  */
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int 
target)
 {
-   struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
-   u64 avg_idle = this_rq()->avg_idle;
-   u64 avg_cost = this_sd->avg_scan_cost;
+   struct sched_domain *this_sd;
+   u64 avg_cost, avg_idle = this_rq()->avg_idle;
u64 time, cost;
s64 delta;
int cpu, wrap;
 
+   this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+   if (!this_sd)
+   return -1;
+
+   avg_cost = this_sd->avg_scan_cost;
+
/*
 * Due to large variance we need a large fuzz factor; hackbench in
 * particularly is sensitive here.

[tip:x86/urgent] x86/entry/64: Fix context tracking state warning when load_gs_index fails

2016-09-30 Thread tip-bot for Wanpeng Li

Commit-ID:  2fa5f04f85730d0c4f49f984b7efeb4f8d5bd1fc
Gitweb: http://git.kernel.org/tip/2fa5f04f85730d0c4f49f984b7efeb4f8d5bd1fc
Author: Wanpeng Li 
AuthorDate: Fri, 30 Sep 2016 09:01:06 +0800
Committer:  Ingo Molnar 
CommitDate: Fri, 30 Sep 2016 13:53:12 +0200

x86/entry/64: Fix context tracking state warning when load_gs_index fails

This warning:

 WARNING: CPU: 0 PID: 3331 at arch/x86/entry/common.c:45 
enter_from_user_mode+0x32/0x50
 CPU: 0 PID: 3331 Comm: ldt_gdt_64 Not tainted 4.8.0-rc7+ #13
 Call Trace:
  dump_stack+0x99/0xd0
  __warn+0xd1/0xf0
  warn_slowpath_null+0x1d/0x20
  enter_from_user_mode+0x32/0x50
  error_entry+0x6d/0xc0
  ? general_protection+0x12/0x30
  ? native_load_gs_index+0xd/0x20
  ? do_set_thread_area+0x19c/0x1f0
  SyS_set_thread_area+0x24/0x30
  do_int80_syscall_32+0x7c/0x220
  entry_INT80_compat+0x38/0x50

... can be reproduced by running the GS testcase of the ldt_gdt test unit in
the x86 selftests.

do_int80_syscall_32() will call enter_form_user_mode() to convert context
tracking state from user state to kernel state. The load_gs_index() call
can fail with user gsbase, gsbase will be fixed up and proceed if this
happen.

However, enter_from_user_mode() will be called again in the fixed up path
though it is context tracking kernel state currently.

This patch fixes it by just fixing up gsbase and telling lockdep that IRQs
are off once load_gs_index() failed with user gsbase.

Signed-off-by: Wanpeng Li 
Acked-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1475197266-3440-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/entry/entry_64.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index d172c61..02fff3e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1002,7 +1002,6 @@ ENTRY(error_entry)
testb   $3, CS+8(%rsp)
jz  .Lerror_kernelspace
 
-.Lerror_entry_from_usermode_swapgs:
/*
 * We entered from user mode or we're pretending to have entered
 * from user mode due to an IRET fault.
@@ -1045,7 +1044,8 @@ ENTRY(error_entry)
 * gsbase and proceed.  We'll fix up the exception and land in
 * .Lgs_change's error handler with kernel gsbase.
 */
-   jmp .Lerror_entry_from_usermode_swapgs
+   SWAPGS
+   jmp .Lerror_entry_done
 
 .Lbstep_iret:
/* Fix truncated RIP */

[tip:x86/apic] x86/apic: Order irq_enter/exit() calls correctly vs. ack_APIC_irq()

2016-09-19 Thread tip-bot for Wanpeng Li

Commit-ID:  b0f48706a176b71a6e54f399d7404bbeeaa7cfab
Gitweb: http://git.kernel.org/tip/b0f48706a176b71a6e54f399d7404bbeeaa7cfab
Author: Wanpeng Li 
AuthorDate: Sun, 18 Sep 2016 19:34:51 +0800
Committer:  Thomas Gleixner 
CommitDate: Tue, 20 Sep 2016 00:31:19 +0200

x86/apic: Order irq_enter/exit() calls correctly vs. ack_APIC_irq()

===
[ INFO: suspicious RCU usage. ]
4.8.0-rc6+ #5 Not tainted
---
./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

RCU used illegally from idle CPU!
rcu_scheduler_active = 1, debug_locks = 0
RCU used illegally from extended quiescent state!
no locks held by swapper/2/0.

stack backtrace:
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.8.0-rc6+ #5
Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
  8d1bd6003f10 94446949 8d1bd4a68000
 0001 8d1bd6003f40 940e9247 8d1bbdfcf3d0
 080b   8d1bd6003f70
Call Trace:
   [] dump_stack+0x99/0xd0
 [] lockdep_rcu_suspicious+0xe7/0x120
 [] do_trace_write_msr+0x135/0x140
 [] native_write_msr+0x20/0x30
 [] native_apic_msr_eoi_write+0x1d/0x30
 [] smp_trace_call_function_interrupt+0x1e/0x270
 [] trace_call_function_interrupt+0x96/0xa0
   [] ? cpuidle_enter_state+0xe4/0x360
 [] ? cpuidle_enter_state+0xcf/0x360
 [] cpuidle_enter+0x17/0x20
 [] cpu_startup_entry+0x338/0x4d0
 [] start_secondary+0x154/0x180

This can be reproduced readily by running ftrace test case of kselftest.

Move the irq_enter() call before ack_APIC_irq(), because irq_enter() tells
the RCU susbstems to end the extended quiescent state, so that the
following trace call in ack_APIC_irq() works correctly. The same applies to
exiting_ack_irq() which calls ack_APIC_irq() after irq_exit().

[ tglx: Massaged changelog ]

Signed-off-by: Wanpeng Li 
Cc: Peter Zijlstra 
Cc: Wanpeng Li 
Link: 
http://lkml.kernel.org/r/1474198491-3738-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Thomas Gleixner 
---
 arch/x86/include/asm/apic.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 1243577..f5aaf6c 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -650,8 +650,8 @@ static inline void entering_ack_irq(void)
 
 static inline void ipi_entering_ack_irq(void)
 {
-   ack_APIC_irq();
irq_enter();
+   ack_APIC_irq();
 }
 
 static inline void exiting_irq(void)
@@ -661,9 +661,8 @@ static inline void exiting_irq(void)
 
 static inline void exiting_ack_irq(void)
 {
-   irq_exit();
-   /* Ack only at the end to avoid potential reentry */
ack_APIC_irq();
+   irq_exit();
 }
 
 extern void ioapic_zap_locks(void);

[tip:timers/core] tick/nohz: Prevent stopping the tick on an offline CPU

2016-09-13 Thread tip-bot for Wanpeng Li

Commit-ID:  57ccdf449f962ab5fc8cbf26479402f13bdb8be7
Gitweb: http://git.kernel.org/tip/57ccdf449f962ab5fc8cbf26479402f13bdb8be7
Author: Wanpeng Li 
AuthorDate: Wed, 7 Sep 2016 18:51:13 +0800
Committer:  Thomas Gleixner 
CommitDate: Tue, 13 Sep 2016 17:53:52 +0200

tick/nohz: Prevent stopping the tick on an offline CPU

can_stop_full_tick() has no check for offline cpus. So it allows to stop
the tick on an offline cpu from the interrupt return path, which is wrong
and subsequently makes irq_work_needs_cpu() warn about being called for an
offline cpu.

Commit f7ea0fd639c2c4 ("tick: Don't invoke tick_nohz_stop_sched_tick() if
the cpu is offline") added prevention for can_stop_idle_tick(), but forgot
to do the same in can_stop_full_tick(). Add it.

[ tglx: Massaged changelog ]

Signed-off-by: Wanpeng Li 
Cc: Peter Zijlstra 
Cc: Frederic Weisbecker 
Link: 
http://lkml.kernel.org/r/1473245473-4463-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Thomas Gleixner 

---
 kernel/time/tick-sched.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 2ec7c00..3bcb61b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -186,10 +186,13 @@ static bool check_tick_dependency(atomic_t *dep)
return false;
 }
 
-static bool can_stop_full_tick(struct tick_sched *ts)
+static bool can_stop_full_tick(int cpu, struct tick_sched *ts)
 {
WARN_ON_ONCE(!irqs_disabled());
 
+   if (unlikely(!cpu_online(cpu)))
+   return false;
+
if (check_tick_dependency(&tick_dep_mask))
return false;
 
@@ -843,7 +846,7 @@ static void tick_nohz_full_update_tick(struct tick_sched 
*ts)
if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
return;
 
-   if (can_stop_full_tick(ts))
+   if (can_stop_full_tick(cpu, ts))
tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
else if (ts->tick_stopped)
tick_nohz_restart_sched_tick(ts, ktime_get());

[tip:sched/core] sched/deadline: Fix the intention to re-evalute tick dependency for offline CPU

2016-09-05 Thread tip-bot for Wanpeng Li

Commit-ID:  61c7aca695b6fabe85d0fc424fe8ae2f66f267dd
Gitweb: http://git.kernel.org/tip/61c7aca695b6fabe85d0fc424fe8ae2f66f267dd
Author: Wanpeng Li 
AuthorDate: Wed, 31 Aug 2016 18:27:44 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 5 Sep 2016 13:29:45 +0200

sched/deadline: Fix the intention to re-evalute tick dependency for offline CPU

The dl task will be replenished after dl task timer fire and start a
new period. It will be enqueued and to re-evaluate its dependency on
the tick in order to restart it. However, if the CPU is hot-unplugged,
irq_work_queue will splash since the target CPU is offline.

As a result we get:

WARNING: CPU: 2 PID: 0 at kernel/irq_work.c:69 irq_work_queue_on+0xad/0xe0
Call Trace:
 dump_stack+0x99/0xd0
 __warn+0xd1/0xf0
 warn_slowpath_null+0x1d/0x20
 irq_work_queue_on+0xad/0xe0
 tick_nohz_full_kick_cpu+0x44/0x50
 tick_nohz_dep_set_cpu+0x74/0xb0
 enqueue_task_dl+0x226/0x480
 activate_task+0x5c/0xa0
 dl_task_timer+0x19b/0x2c0
 ? push_dl_task.part.31+0x190/0x190

This can be triggered by hot-unplugging the full dynticks CPU which dl
task is running on.

We enqueue the dl task on the offline CPU, because we need to do
replenish for start_dl_timer(). So, as Juri pointed out, we would
need to do is calling replenish_dl_entity() directly, instead of
enqueue_task_dl(). pi_se shouldn't be a problem as the task shouldn't
be boosted if it was throttled.

This patch fixes it by avoiding the whole enqueue+dequeue+enqueue story, by
first migrating (set_task_cpu()) and then doing 1 enqueue.

Suggested-by: Peter Zijlstra 
Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Frederic Weisbecker 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1472639264-3932-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 46 ++
 1 file changed, 18 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 18fb0b8..0c75bc6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -243,10 +243,8 @@ static struct rq *find_lock_later_rq(struct task_struct 
*task, struct rq *rq);
 static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct 
*p)
 {
struct rq *later_rq = NULL;
-   bool fallback = false;
 
later_rq = find_lock_later_rq(p, rq);
-
if (!later_rq) {
int cpu;
 
@@ -254,7 +252,6 @@ static struct rq *dl_task_offline_migration(struct rq *rq, 
struct task_struct *p
 * If we cannot preempt any rq, fall back to pick any
 * online cpu.
 */
-   fallback = true;
cpu = cpumask_any_and(cpu_active_mask, tsk_cpus_allowed(p));
if (cpu >= nr_cpu_ids) {
/*
@@ -274,16 +271,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, 
struct task_struct *p
double_lock_balance(rq, later_rq);
}
 
-   /*
-* By now the task is replenished and enqueued; migrate it.
-*/
-   deactivate_task(rq, p, 0);
set_task_cpu(p, later_rq->cpu);
-   activate_task(later_rq, p, 0);
-
-   if (!fallback)
-   resched_curr(later_rq);
-
double_unlock_balance(later_rq, rq);
 
return later_rq;
@@ -641,29 +629,31 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer 
*timer)
goto unlock;
}
 
-   enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
-   if (dl_task(rq->curr))
-   check_preempt_curr_dl(rq, p, 0);
-   else
-   resched_curr(rq);
-
 #ifdef CONFIG_SMP
-   /*
-* Perform balancing operations here; after the replenishments.  We
-* cannot drop rq->lock before this, otherwise the assertion in
-* start_dl_timer() about not missing updates is not true.
-*
-* If we find that the rq the task was on is no longer available, we
-* need to select a new rq.
-*
-* XXX figure out if select_task_rq_dl() deals with offline cpus.
-*/
if (unlikely(!rq->online)) {
+   /*
+* If the runqueue is no longer available, migrate the
+* task elsewhere. This necessarily changes rq.
+*/
lockdep_unpin_lock(&rq->lock, rf.cookie);
rq = dl_task_offline_migration(rq, p);
rf.cookie = lockdep_pin_lock(&rq->lock);
+
+   /*
+* Now that the task has been migrated to the new RQ and we
+* have that locked, proceed as normal and enqueue the task
+* there.
+*/
}
+#endif
 
+   enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
+   if (dl_task(rq->curr))
+   check_preempt_curr_dl(rq, p, 0);
+   else
+

[tip:timers/urgent] tick/nohz: Fix softlockup on scheduler stalls in kvm guest

2016-09-02 Thread tip-bot for Wanpeng Li

Commit-ID:  08d072599234c959b0b82b63fa252c129225a899
Gitweb: http://git.kernel.org/tip/08d072599234c959b0b82b63fa252c129225a899
Author: Wanpeng Li 
AuthorDate: Fri, 2 Sep 2016 14:38:23 +0800
Committer:  Thomas Gleixner 
CommitDate: Fri, 2 Sep 2016 10:25:40 +0200

tick/nohz: Fix softlockup on scheduler stalls in kvm guest

tick_nohz_start_idle() is prevented to be called if the idle tick can't 
be stopped since commit 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle 
enter"). As a result, after suspend/resume the host machine, full dynticks 
kvm guest will softlockup:

 NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
 Call Trace:
  default_idle+0x31/0x1a0
  arch_cpu_idle+0xf/0x20
  default_idle_call+0x2a/0x50
  cpu_startup_entry+0x39b/0x4d0
  rest_init+0x138/0x140
  ? rest_init+0x5/0x140
  start_kernel+0x4c1/0x4ce
  ? set_init_arg+0x55/0x55
  ? early_idt_handler_array+0x120/0x120
  x86_64_start_reservations+0x24/0x26
  x86_64_start_kernel+0x142/0x14f

In addition, cat /proc/stat | grep cpu in guest or host:

cpu  398 16 5049 15754 5490 0 1 46 0 0
cpu0 206 5 450 0 0 0 1 14 0 0
cpu1 81 0 3937 3149 1514 0 0 9 0 0
cpu2 45 6 332 6052 2243 0 0 11 0 0
cpu3 65 2 328 6552 1732 0 0 11 0 0

The idle and iowait states are weird 0 for cpu0(housekeeping). 

The bug is present in both guest and host kernels, and they both have 
cpu0's idle and iowait states issue, however, host kernel's suspend/resume 
path etc will touch watchdog to avoid the softlockup.

- The watchdog will not be touched in tick_nohz_stop_idle path (need be 
  touched since the scheduler stall is expected) if idle_active flags are 
  not detected.
- The idle and iowait states will not be accounted when exit idle loop 
  (resched or interrupt) if idle start time and idle_active flags are 
  not set. 

This patch fixes it by reverting commit 1f3b0f8243cb934 since can't stop 
idle tick doesn't mean can't be idle.

Fixes: 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter")
Signed-off-by: Wanpeng Li 
Cc: Sanjeev Yadav
Cc: Gaurav Jindal
Cc: sta...@vger.kernel.org
Cc: k...@vger.kernel.org
Cc: Radim Krčmář 
Cc: Peter Zijlstra 
Cc: Paolo Bonzini 
Link: 
http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Thomas Gleixner 

---
 kernel/time/tick-sched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 204fdc8..2ec7c00 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -908,10 +908,11 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts)
ktime_t now, expires;
int cpu = smp_processor_id();
 
+   now = tick_nohz_start_idle(ts);
+
if (can_stop_idle_tick(cpu, ts)) {
int was_stopped = ts->tick_stopped;
 
-   now = tick_nohz_start_idle(ts);
ts->idle_calls++;
 
expires = tick_nohz_stop_sched_tick(ts, now, cpu);

[tip:x86/urgent] x86/apic: Do not init irq remapping if ioapic is disabled

2016-08-24 Thread tip-bot for Wanpeng Li

Commit-ID:  2e63ad4bd5dd583871e6602f9d398b9322d358d9
Gitweb: http://git.kernel.org/tip/2e63ad4bd5dd583871e6602f9d398b9322d358d9
Author: Wanpeng Li 
AuthorDate: Tue, 23 Aug 2016 20:07:19 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 24 Aug 2016 09:45:40 +0200

x86/apic: Do not init irq remapping if ioapic is disabled

native_smp_prepare_cpus
  -> default_setup_apic_routing
-> enable_IR_x2apic
  -> irq_remapping_prepare
-> intel_prepare_irq_remapping
  -> intel_setup_irq_remapping

So IR table is setup even if "noapic" boot parameter is added. As a result we
crash later when the interrupt affinity is set due to a half initialized
remapping infrastructure.

Prevent remap initialization when IOAPIC is disabled.

Signed-off-by: Wanpeng Li 
Cc: Peter Zijlstra 
Cc: Joerg Roedel 
Link: 
http://lkml.kernel.org/r/1471954039-3942-1-git-send-email-wanpeng...@hotmail.com
Cc: sta...@vger.kernel.org
Signed-off-by: Thomas Gleixner 

---
 arch/x86/kernel/apic/apic.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index cea4fc1..50c95af 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1623,6 +1623,9 @@ void __init enable_IR_x2apic(void)
unsigned long flags;
int ret, ir_stat;
 
+   if (skip_ioapic_setup)
+   return;
+
ir_stat = irq_remapping_prepare();
if (ir_stat < 0 && !x2apic_supported())
return;

[tip:sched/core] sched/cputime: Resync steal time when guest & host lose sync

2016-08-18 Thread tip-bot for Wanpeng Li

Commit-ID:  03cbc732639ddcad15218c4b2046d255851ff1e3
Gitweb: http://git.kernel.org/tip/03cbc732639ddcad15218c4b2046d255851ff1e3
Author: Wanpeng Li 
AuthorDate: Wed, 17 Aug 2016 10:05:46 +0800
Committer:  Ingo Molnar 
CommitDate: Thu, 18 Aug 2016 11:19:48 +0200

sched/cputime: Resync steal time when guest & host lose sync

Commit:

  57430218317e ("sched/cputime: Count actually elapsed irq & softirq time")

... fixed a bug but also triggered a regression:

On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs
on host one by one until there is only one left, then observe CPU utilization
via 'top' in the guest, it shows:

  100% st for cpu0(housekeeping)
   75% st for other CPUs (nohz full mode)

However, w/o this commit it shows the correct 75% for all four CPUs.

When a guest is interrupted for a longer amount of time, missed clock ticks
are not redelivered later. Because of that, we should not limit the amount
of steal time accounted to the amount of time that the calling functions
think have passed.

However, the interval returned by account_other_time() is NOT rounded down
to the nearest jiffy, while the base interval in get_vtime_delta() it is
subtracted from is, so the max cputime limit is required to avoid underflow.

This patch fixes the regression by limiting the account_other_time() from
get_vtime_delta() to avoid underflow, and lets the other three call sites
(in account_other_time() and steal_account_process_time()) account however
much steal time the host told us elapsed.

Suggested-by: Rik van Riel 
Suggested-by: Paolo Bonzini 
Signed-off-by: Wanpeng Li 
Reviewed-by: Rik van Riel 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Radim Krcmar 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng...@hotmail.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar 
---
 kernel/sched/cputime.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 2ee83b2..a846cf8 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -263,6 +263,11 @@ void account_idle_time(cputime_t cputime)
cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
+/*
+ * When a guest is interrupted for a longer amount of time, missed clock
+ * ticks are not redelivered later. Due to that, this function may on
+ * occasion account more time than the calling functions think elapsed.
+ */
 static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 {
 #ifdef CONFIG_PARAVIRT
@@ -371,7 +376,7 @@ static void irqtime_account_process_tick(struct task_struct 
*p, int user_tick,
 * idle, or potentially user or system time. Due to rounding,
 * other time can exceed ticks occasionally.
 */
-   other = account_other_time(cputime);
+   other = account_other_time(ULONG_MAX);
if (other >= cputime)
return;
cputime -= other;
@@ -486,7 +491,7 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
}
 
cputime = cputime_one_jiffy;
-   steal = steal_account_process_time(cputime);
+   steal = steal_account_process_time(ULONG_MAX);
 
if (steal >= cputime)
return;
@@ -516,7 +521,7 @@ void account_idle_ticks(unsigned long ticks)
}
 
cputime = jiffies_to_cputime(ticks);
-   steal = steal_account_process_time(cputime);
+   steal = steal_account_process_time(ULONG_MAX);
 
if (steal >= cputime)
return;
@@ -699,6 +704,13 @@ static cputime_t get_vtime_delta(struct task_struct *tsk)
unsigned long now = READ_ONCE(jiffies);
cputime_t delta, other;
 
+   /*
+* Unlike tick based timing, vtime based timing never has lost
+* ticks, and no need for steal time accounting to make up for
+* lost ticks. Vtime accounts a rounded version of actual
+* elapsed time. Limit account_other_time to prevent rounding
+* errors from causing elapsed vtime to go negative.
+*/
delta = jiffies_to_cputime(now - tsk->vtime_snap);
other = account_other_time(delta);
WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);

[tip:sched/urgent] sched/cputime: Fix steal time accounting

2016-08-11 Thread tip-bot for Wanpeng Li

Commit-ID:  f9bcf1e0e0145323ba2cf72ecad5264ff3883eb1
Gitweb: http://git.kernel.org/tip/f9bcf1e0e0145323ba2cf72ecad5264ff3883eb1
Author: Wanpeng Li 
AuthorDate: Thu, 11 Aug 2016 13:36:35 +0800
Committer:  Ingo Molnar 
CommitDate: Thu, 11 Aug 2016 11:02:14 +0200

sched/cputime: Fix steal time accounting

Commit:

  57430218317 ("sched/cputime: Count actually elapsed irq & softirq time")

... didn't take steal time into consideration with passing the noirqtime
kernel parameter.

As Paolo pointed out before:

| Why not? If idle=poll, for example, any time the guest is suspended (and
| thus cannot poll) does count as stolen time.

This patch fixes it by reducing steal time from idle time accounting when
the noirqtime parameter is true. The average idle time drops from 56.8%
to 54.75% for nohz idle kvm guest(noirqtime, idle=poll, four vCPUs running
on one pCPU).

Signed-off-by: Wanpeng Li 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra (Intel) 
Cc: Peter Zijlstra 
Cc: Radim 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1470893795-3527-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/cputime.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 1934f65..8b9bcc5 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -508,13 +508,20 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
  */
 void account_idle_ticks(unsigned long ticks)
 {
-
+   cputime_t cputime, steal;
if (sched_clock_irqtime) {
irqtime_account_idle_ticks(ticks);
return;
}
 
-   account_idle_time(jiffies_to_cputime(ticks));
+   cputime = cputime_one_jiffy;
+   steal = steal_account_process_time(cputime);
+
+   if (steal >= cputime)
+   return;
+
+   cputime -= steal;
+   account_idle_time(cputime);
 }
 
 /*

[tip:locking/core] locking/pvqspinlock: Fix double hash race

2016-08-10 Thread tip-bot for Wanpeng Li

Commit-ID:  229ce631574761870a2ac938845fadbd07f35caa
Gitweb: http://git.kernel.org/tip/229ce631574761870a2ac938845fadbd07f35caa
Author: Wanpeng Li 
AuthorDate: Thu, 14 Jul 2016 16:15:56 +0800
Committer:  Ingo Molnar 
CommitDate: Wed, 10 Aug 2016 14:13:28 +0200

locking/pvqspinlock: Fix double hash race

When the lock holder vCPU is racing with the queue head:

   CPU 0 (lock holder)CPU1 (queue head)
   ====
   spin_lock();   spin_lock();
pv_kick_node():pv_wait_head_or_lock():
if (!lp) {
 lp = pv_hash(lock, pn);
 xchg(&l->locked, _Q_SLOW_VAL);
}
WRITE_ONCE(pn->state, vcpu_halted);
 cmpxchg(&pn->state,
  vcpu_halted, vcpu_hashed);
 WRITE_ONCE(l->locked, _Q_SLOW_VAL);
 (void)pv_hash(lock, pn);

In this case, lock holder inserts the pv_node of queue head into the
hash table and set _Q_SLOW_VAL unnecessary. This patch avoids it by
restoring/setting vcpu_hashed state after failing adaptive locking
spinning.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Pan Xinhui 
Cc: Andrew Morton 
Cc: Davidlohr Bueso 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Link: 
http://lkml.kernel.org/r/1468484156-4521-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/locking/qspinlock_paravirt.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index 37649e6..8a99abf 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -450,7 +450,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct 
mcs_spinlock *node)
goto gotlock;
}
}
-   WRITE_ONCE(pn->state, vcpu_halted);
+   WRITE_ONCE(pn->state, vcpu_hashed);
qstat_inc(qstat_pv_wait_head, true);
qstat_inc(qstat_pv_wait_again, waitcnt);
pv_wait(&l->locked, _Q_SLOW_VAL);

[tip:sched/core] sched/deadline: Fix lock pinning warning during CPU hotplug

2016-08-10 Thread tip-bot for Wanpeng Li

Commit-ID:  c0c8c9fa210c9a042060435f17e40ba4a76d6d6f
Gitweb: http://git.kernel.org/tip/c0c8c9fa210c9a042060435f17e40ba4a76d6d6f
Author: Wanpeng Li 
AuthorDate: Thu, 4 Aug 2016 09:42:20 +0800
Committer:  Ingo Molnar 
CommitDate: Wed, 10 Aug 2016 14:02:55 +0200

sched/deadline: Fix lock pinning warning during CPU hotplug

The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:

  WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3531 
lock_release+0x690/0x6a0
  releasing a pinned lock
  Call Trace:
   dump_stack+0x99/0xd0
   __warn+0xd1/0xf0
   ? dl_task_timer+0x1a1/0x2b0
   warn_slowpath_fmt+0x4f/0x60
   ? sched_clock+0x13/0x20
   lock_release+0x690/0x6a0
   ? enqueue_pushable_dl_task+0x9b/0xa0
   ? enqueue_task_dl+0x1ca/0x480
   _raw_spin_unlock+0x1f/0x40
   dl_task_timer+0x1a1/0x2b0
   ? push_dl_task.part.31+0x190/0x190
  WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3649 
lock_unpin_lock+0x181/0x1a0
  unpinning an unpinned lock
  Call Trace:
   dump_stack+0x99/0xd0
   __warn+0xd1/0xf0
   warn_slowpath_fmt+0x4f/0x60
   lock_unpin_lock+0x181/0x1a0
   dl_task_timer+0x127/0x2b0
   ? push_dl_task.part.31+0x190/0x190

As per the comment before this code, its safe to drop the RQ lock
here, and since we (potentially) change rq, unpin and repin to avoid
the splat.

Signed-off-by: Wanpeng Li 
[ Rewrote changelog. ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1470274940-17976-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fcb7f02..1ce8867 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -658,8 +658,11 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer 
*timer)
 *
 * XXX figure out if select_task_rq_dl() deals with offline cpus.
 */
-   if (unlikely(!rq->online))
+   if (unlikely(!rq->online)) {
+   lockdep_unpin_lock(&rq->lock, rf.cookie);
rq = dl_task_offline_migration(rq, p);
+   rf.cookie = lockdep_pin_lock(&rq->lock);
+   }
 
/*
 * Queueing this task back might have overloaded rq, check if we need

[tip:x86/timers] x86/tsc_msr: Fix rdmsr(MSR_PLATFORM_INFO) unsafe warning in KVM guest

2016-07-11 Thread tip-bot for Wanpeng Li

Commit-ID:  37c528ee1af7f24eb31f4195b8b7d4f23e6c716d
Gitweb: http://git.kernel.org/tip/37c528ee1af7f24eb31f4195b8b7d4f23e6c716d
Author: Wanpeng Li 
AuthorDate: Wed, 22 Jun 2016 09:28:28 +0800
Committer:  Ingo Molnar 
CommitDate: Mon, 11 Jul 2016 09:20:36 +0200

x86/tsc_msr: Fix rdmsr(MSR_PLATFORM_INFO) unsafe warning in KVM guest

After this commit:

  fc273eeef314 ("x86/tsc_msr: Extend to include Intel Core Architecture")

The following unsafe MSR reading warning triggers:

  WARNING: CPU: 0 PID: 0 at arch/x86/mm/extable.c:50 
ex_handler_rdmsr_unsafe+0x6a/0x70
  unchecked MSR access error: RDMSR from 0xce
Call Trace:
   dump_stack+0x67/0x99
   __warn+0xd1/0xf0
   warn_slowpath_fmt+0x4f/0x60
   ex_handler_rdmsr_unsafe+0x6a/0x70
   fixup_exception+0x39/0x50
   do_general_protection+0x93/0x1b0
   general_protection+0x22/0x30
   ? cpu_khz_from_msr+0xd8/0x1c0
   native_calibrate_cpu+0x30/0x5b0
   tsc_init+0x2b/0x297
   x86_late_time_init+0xf/0x11
   start_kernel+0x398/0x451
   ? set_init_arg+0x55/0x55
   x86_64_start_reservations+0x2f/0x31
   x86_64_start_kernel+0xea/0xed

As Radim pointed out before:

| MSR_PLATFORM_INFO: Intel changes it from family to family and there is
| no obvious overlap or default.  If we picked 0 (any other fixed value),
| then the guest would have to know that 0 doesn't mean that
| MSR_PLATFORM_INFO returned 0, but that KVM doesn't emulate this MSR and
| the value cannot be used.  This is very similar to handling a #GP in the
| guest, but also has a disadvantage, because KVM cannot say that
| MSR_PLATFORM_INFO is 0.  Simple emulation is not possible.

Fix it by using rdmsr_safe(MSR_PLATFORM_INFO) in KVM guest to
not trigger a #GP, then tsc will be calibrated by a fallback
method: PIT, HPET etc.

Reported-by: kernel test robot 
Signed-off-by: Wanpeng Li 
Acked-by: Paolo Bonzini 
Cc: Chen Yu 
Cc: H. Peter Anvin 
Cc: Len Brown 
Cc: Linus Torvalds 
Cc: Linux PM list 
Cc: Peter Zijlstra 
Cc: Radim Krčmář 
Cc: Rafael J. Wysocki 
Cc: Thomas Gleixner 
Cc: Zhang Rui 
Cc: jacob.jun@intel.com
Cc: k...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: l...@01.org
Link: 
http://lkml.kernel.org/r/1466558908-3524-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/tsc_msr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index e0c2b30..e6e465e 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -70,7 +70,7 @@ static int match_cpu(u8 family, u8 model)
  */
 unsigned long cpu_khz_from_msr(void)
 {
-   u32 lo, hi, ratio, freq_id, freq;
+   u32 lo, hi, freq_id, freq, ratio = 0;
unsigned long res;
int cpu_index;
 
@@ -123,8 +123,8 @@ unsigned long cpu_khz_from_msr(void)
}
 
 get_ratio:
-   rdmsr(MSR_PLATFORM_INFO, lo, hi);
-   ratio = (lo >> 8) & 0xff;
+   if (!rdmsr_safe(MSR_PLATFORM_INFO, &lo, &hi))
+   ratio = (lo >> 8) & 0xff;
 
 done:
/* TSC frequency = maximum resolved freq * maximum resolved bus ratio */

[tip:sched/core] sched/cputime: Add steal time support to full dynticks CPU time accounting

2016-06-14 Thread tip-bot for Wanpeng Li

Commit-ID:  807e5b80687c06715d62df51a5473b231e3e8b15
Gitweb: http://git.kernel.org/tip/807e5b80687c06715d62df51a5473b231e3e8b15
Author: Wanpeng Li 
AuthorDate: Mon, 13 Jun 2016 18:32:46 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 14 Jun 2016 11:13:16 +0200

sched/cputime: Add steal time support to full dynticks CPU time accounting

This patch adds guest steal-time support to full dynticks CPU
time accounting. After the following commit:

ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy 
granularity")

... time sampling became jiffy based, even if we do the sampling from the
context tracking code, so steal_account_process_tick() can be reused
to account how many 'ticks' are stolen-time, after the last accumulation.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Paolo Bonzini 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1465813966-3116-4-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/cputime.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..3d60e5d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -257,7 +257,7 @@ void account_idle_time(cputime_t cputime)
cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
-static __always_inline bool steal_account_process_tick(void)
+static __always_inline unsigned long steal_account_process_tick(unsigned long 
max_jiffies)
 {
 #ifdef CONFIG_PARAVIRT
if (static_key_false(¶virt_steal_enabled)) {
@@ -272,14 +272,14 @@ static __always_inline bool 
steal_account_process_tick(void)
 * time in jiffies. Lets cast the result to jiffies
 * granularity and account the rest on the next rounds.
 */
-   steal_jiffies = nsecs_to_jiffies(steal);
+   steal_jiffies = min(nsecs_to_jiffies(steal), max_jiffies);
this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies);
 
account_steal_time(jiffies_to_cputime(steal_jiffies));
return steal_jiffies;
}
 #endif
-   return false;
+   return 0;
 }
 
 /*
@@ -346,7 +346,7 @@ static void irqtime_account_process_tick(struct task_struct 
*p, int user_tick,
u64 cputime = (__force u64) cputime_one_jiffy;
u64 *cpustat = kcpustat_this_cpu->cpustat;
 
-   if (steal_account_process_tick())
+   if (steal_account_process_tick(ULONG_MAX))
return;
 
cputime *= ticks;
@@ -477,7 +477,7 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
return;
}
 
-   if (steal_account_process_tick())
+   if (steal_account_process_tick(ULONG_MAX))
return;
 
if (user_tick)
@@ -681,12 +681,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
unsigned long now = READ_ONCE(jiffies);
-   unsigned long delta = now - tsk->vtime_snap;
+   unsigned long delta_jiffies, steal_jiffies;
 
+   delta_jiffies = now - tsk->vtime_snap;
+   steal_jiffies = steal_account_process_tick(delta_jiffies);
WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
tsk->vtime_snap = now;
 
-   return jiffies_to_cputime(delta);
+   return jiffies_to_cputime(delta_jiffies - steal_jiffies);
 }
 
 static void __vtime_account_system(struct task_struct *tsk)

[tip:sched/core] sched/cputime: Fix prev steal time accouting during CPU hotplug

2016-06-14 Thread tip-bot for Wanpeng Li

Commit-ID:  3d89e5478bf550a50c99e93adf659369798263b0
Gitweb: http://git.kernel.org/tip/3d89e5478bf550a50c99e93adf659369798263b0
Author: Wanpeng Li 
AuthorDate: Mon, 13 Jun 2016 18:32:45 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 14 Jun 2016 11:13:15 +0200

sched/cputime: Fix prev steal time accouting during CPU hotplug

Commit:

  e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")

... set rq->prev_* to 0 after a CPU hotplug comes back, in order to
fix the case where (after CPU hotplug) steal time is smaller than
rq->prev_steal_time.

However, this should never happen. Steal time was only smaller because of the
KVM-specific bug fixed by the previous patch.  Worse, the previous patch
triggers a bug on CPU hot-unplug/plug operation: because
rq->prev_steal_time is cleared, all of the CPU's past steal time will be
accounted again on hot-plug.

Since the root cause has been fixed, we can just revert commit e9532e69b8d1.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Paolo Bonzini 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU 
hotplug")'
Link: 
http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c  |  1 -
 kernel/sched/sched.h | 13 -
 2 files changed, 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 13d0896..c1b537b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7227,7 +7227,6 @@ static void sched_rq_cpu_starting(unsigned int cpu)
struct rq *rq = cpu_rq(cpu);
 
rq->calc_load_update = calc_load_update;
-   account_reset_rq(rq);
update_max_interval();
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72f1f30..de607e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1809,16 +1809,3 @@ static inline void cpufreq_trigger_update(u64 time) {}
 #else /* arch_scale_freq_capacity */
 #define arch_scale_freq_invariant()(false)
 #endif
-
-static inline void account_reset_rq(struct rq *rq)
-{
-#ifdef CONFIG_IRQ_TIME_ACCOUNTING
-   rq->prev_irq_time = 0;
-#endif
-#ifdef CONFIG_PARAVIRT
-   rq->prev_steal_time = 0;
-#endif
-#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
-   rq->prev_steal_time_rq = 0;
-#endif
-}

[tip:sched/core] KVM: Fix steal clock warp during guest CPU hotplug

2016-06-14 Thread tip-bot for Wanpeng Li

Commit-ID:  2348140d58f4f4245e9635ea8f1a77e940a4d877
Gitweb: http://git.kernel.org/tip/2348140d58f4f4245e9635ea8f1a77e940a4d877
Author: Wanpeng Li 
AuthorDate: Mon, 13 Jun 2016 18:32:44 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 14 Jun 2016 11:13:14 +0200

KVM: Fix steal clock warp during guest CPU hotplug

Sometimes, after CPU hotplug you can observe a spike in stolen time
(100%) followed by the CPU being marked as 100% idle when it's actually
busy with a CPU hog task.  The trace looks like the following:

 cpuhp/1-12[001] d.h1   167.461657: account_process_tick: steal = 
1291385514, prev_steal_time = 0
 cpuhp/1-12[001] d.h1   167.461659: account_process_tick: steal_jiffies = 
1291
  -0 [001] d.h1   167.462663: account_process_tick: steal = 18732255, 
prev_steal_time = 129100
  -0 [001] d.h1   167.462664: account_process_tick: steal_jiffies = 
18446744072437

The sudden decrease of "steal" causes steal_jiffies to underflow.
The root cause is kvm_steal_time being reset to 0 after hot-plugging
back in a CPU.  Instead, the preexisting value can be used, which is
what the core scheduler code expects.

John Stultz also reported a similar issue after guest S3.

Suggested-by: Paolo Bonzini 
Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Paolo Bonzini 
Cc: Frederic Weisbecker 
Cc: John Stultz 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1465813966-3116-2-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/kvm.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index eea2a6f..1ef5e48 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -301,8 +301,6 @@ static void kvm_register_steal_time(void)
if (!has_steal_clock)
return;
 
-   memset(st, 0, sizeof(*st));
-
wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
pr_info("kvm-stealtime: cpu %d, msr %llx\n",
cpu, (unsigned long long) slow_virt_to_phys(st));

[tip:sched/core] sched/nohz: Fix affine unpinned timers mess

2016-05-12 Thread tip-bot for Wanpeng Li

Commit-ID:  444969223c81c7d0a95136b7b4cfdcfbc96ac5bd
Gitweb: http://git.kernel.org/tip/444969223c81c7d0a95136b7b4cfdcfbc96ac5bd
Author: Wanpeng Li 
AuthorDate: Wed, 4 May 2016 14:45:34 +0800
Committer:  Ingo Molnar 
CommitDate: Thu, 12 May 2016 09:55:32 +0200

sched/nohz: Fix affine unpinned timers mess

The following commit:

  9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")'

intended to affine unpinned timers to housekeepers:

  unpinned timers(full dynaticks, idle)   =>   nearest busy 
housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(full dynaticks, busy)   =>   nearest busy 
housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(houserkeepers, idle)=>   nearest busy 
housekeepers(otherwise, fallback to itself)

However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the
intention to:

  unpinned timers(full dynaticks, idle)   =>   any housekeepers(no mattter cpu 
topology)
  unpinned timers(full dynaticks, busy)   =>   any housekeepers(no mattter cpu 
topology)
  unpinned timers(housekeepers, idle) =>   any busy cpus(otherwise, 
fallback to any housekeepers)

This patch fixes it by checking if there are busy housekeepers nearby,
otherwise falls to any housekeepers/itself. After the patch:

  unpinned timers(full dynaticks, idle)   =>   nearest busy 
housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(full dynaticks, busy)   =>   nearest busy 
housekeepers(otherwise, fallback to any housekeepers)
  unpinned timers(housekeepers, idle) =>   nearest busy 
housekeepers(otherwise, fallback to itself)

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
[ Fixed the changelog. ]
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Fixes: 'commit 9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")'
Link: 
http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 636c4b9..6f6962a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -533,7 +533,10 @@ int get_nohz_timer_target(void)
rcu_read_lock();
for_each_domain(cpu, sd) {
for_each_cpu(i, sched_domain_span(sd)) {
-   if (!idle_cpu(i) && is_housekeeping_cpu(cpu)) {
+   if (cpu == i)
+   continue;
+
+   if (!idle_cpu(i) && is_housekeeping_cpu(i)) {
cpu = i;
goto unlock;
}

[tip:sched/core] sched/debug: Print out idle balance values even on !CONFIG_SCHEDSTATS kernels

2016-05-05 Thread tip-bot for Wanpeng Li

Commit-ID:  db6ea2fb094fb3a6afc36d3e4229bc162638ad24
Gitweb: http://git.kernel.org/tip/db6ea2fb094fb3a6afc36d3e4229bc162638ad24
Author: Wanpeng Li 
AuthorDate: Tue, 3 May 2016 12:38:25 +0800
Committer:  Ingo Molnar 
CommitDate: Thu, 5 May 2016 09:41:09 +0200

sched/debug: Print out idle balance values even on !CONFIG_SCHEDSTATS kernels

The max_idle_balance_cost and avg_idle values which are tracked and ar used to
capture short idle incidents, are not associated with schedstats, however the
information of these two values isn't printed out on !CONFIG_SCHEDSTATS kernels.

Fix this by moving the value printout out of the CONFIG_SCHEDSTATS section.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1462250305-4523-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/debug.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4fbc3bd..cf905f6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -626,15 +626,16 @@ do {  
\
 #undef P
 #undef PN
 
-#ifdef CONFIG_SCHEDSTATS
-#define P(n) SEQ_printf(m, "  .%-30s: %d\n", #n, rq->n);
-#define P64(n) SEQ_printf(m, "  .%-30s: %Ld\n", #n, rq->n);
-
 #ifdef CONFIG_SMP
+#define P64(n) SEQ_printf(m, "  .%-30s: %Ld\n", #n, rq->n);
P64(avg_idle);
P64(max_idle_balance_cost);
+#undef P64
 #endif
 
+#ifdef CONFIG_SCHEDSTATS
+#define P(n) SEQ_printf(m, "  .%-30s: %d\n", #n, rq->n);
+
if (schedstat_enabled()) {
P(yld_count);
P(sched_count);
@@ -644,7 +645,6 @@ do {
\
}
 
 #undef P
-#undef P64
 #endif
spin_lock_irqsave(&sched_debug_lock, flags);
print_cfs_stats(m, cpu);

[tip:sched/core] sched/cpufreq: Optimize cpufreq update kicker to avoid update multiple times

2016-04-28 Thread tip-bot for Wanpeng Li

Commit-ID:  594dd290cf5403a9a5818619dfff42d8e8e0518e
Gitweb: http://git.kernel.org/tip/594dd290cf5403a9a5818619dfff42d8e8e0518e
Author: Wanpeng Li 
AuthorDate: Fri, 22 Apr 2016 17:07:24 +0800
Committer:  Ingo Molnar 
CommitDate: Thu, 28 Apr 2016 10:39:54 +0200

sched/cpufreq: Optimize cpufreq update kicker to avoid update multiple times

Sometimes delta_exec is 0 due to update_curr() is called multiple times,
this is captured by:

u64 delta_exec = rq_clock_task(rq) - curr->se.exec_start;

This patch optimizes the cpufreq update kicker by bailing out when nothing
changed, it will benefit the upcoming schedutil, since otherwise it will
(over)react to the special util/max combination.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Peter Zijlstra 
Cc: Rafael J. Wysocki 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1461316044-9520-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 8 
 kernel/sched/rt.c   | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index affd97e..8f9b5af 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -717,10 +717,6 @@ static void update_curr_dl(struct rq *rq)
if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
 
-   /* Kick cpufreq (see the comment in linux/cpufreq.h). */
-   if (cpu_of(rq) == smp_processor_id())
-   cpufreq_trigger_update(rq_clock(rq));
-
/*
 * Consumed budget is computed considering the time as
 * observed by schedulable tasks (excluding time spent
@@ -736,6 +732,10 @@ static void update_curr_dl(struct rq *rq)
return;
}
 
+   /* kick cpufreq (see the comment in linux/cpufreq.h). */
+   if (cpu_of(rq) == smp_processor_id())
+   cpufreq_trigger_update(rq_clock(rq));
+
schedstat_set(curr->se.statistics.exec_max,
  max(curr->se.statistics.exec_max, delta_exec));
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index c41ea7a..19e1306 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -953,14 +953,14 @@ static void update_curr_rt(struct rq *rq)
if (curr->sched_class != &rt_sched_class)
return;
 
-   /* Kick cpufreq (see the comment in linux/cpufreq.h). */
-   if (cpu_of(rq) == smp_processor_id())
-   cpufreq_trigger_update(rq_clock(rq));
-
delta_exec = rq_clock_task(rq) - curr->se.exec_start;
if (unlikely((s64)delta_exec <= 0))
return;
 
+   /* Kick cpufreq (see the comment in linux/cpufreq.h). */
+   if (cpu_of(rq) == smp_processor_id())
+   cpufreq_trigger_update(rq_clock(rq));
+
schedstat_set(curr->se.statistics.exec_max,
  max(curr->se.statistics.exec_max, delta_exec));

[tip:timers/urgent] tick/nohz: Set the correct expiry when switching to nohz/lowres mode

2016-01-27 Thread tip-bot for Wanpeng Li

Commit-ID:  1ca8ec532fc2d986f1f4a319857bb18e0c9739b4
Gitweb: http://git.kernel.org/tip/1ca8ec532fc2d986f1f4a319857bb18e0c9739b4
Author: Wanpeng Li 
AuthorDate: Wed, 27 Jan 2016 19:26:07 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 27 Jan 2016 12:45:57 +0100

tick/nohz: Set the correct expiry when switching to nohz/lowres mode

commit 0ff53d096422 sets the next tick interrupt to the last jiffies update,
i.e. in the past, because the forward operation is invoked before the set
operation. There is no resulting damage (yet), but we get an extra pointless
tick interrupt.

Revert the order so we get the next tick interrupt in the future.

Fixes: commit 0ff53d096422 "tick: sched: Force tick interrupt and get rid of 
softirq magic"
Signed-off-by: Wanpeng Li 
Cc: Peter Zijlstra 
Cc: Frederic Weisbecker 
Cc: sta...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1453893967-3458-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Thomas Gleixner 
---
 kernel/time/tick-sched.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index cbe5d8d..de2d9fe 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -995,9 +995,9 @@ static void tick_nohz_switch_to_nohz(void)
/* Get the next period */
next = tick_init_jiffy_update();
 
-   hrtimer_forward_now(&ts->sched_timer, tick_period);
hrtimer_set_expires(&ts->sched_timer, next);
-   tick_program_event(next, 1);
+   hrtimer_forward_now(&ts->sched_timer, tick_period);
+   tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
tick_nohz_activate(ts, NOHZ_MODE_LOWRES);
 }

[tip:sched/core] sched/deadline: Fix the earliest_dl.next logic

2016-01-06 Thread tip-bot for Wanpeng Li

Commit-ID:  7d92de3a8285ab3dfd68aa3a99823acd5b190444
Gitweb: http://git.kernel.org/tip/7d92de3a8285ab3dfd68aa3a99823acd5b190444
Author: Wanpeng Li 
AuthorDate: Thu, 3 Dec 2015 17:42:10 +0800
Committer:  Ingo Molnar 
CommitDate: Wed, 6 Jan 2016 11:05:56 +0100

sched/deadline: Fix the earliest_dl.next logic

earliest_dl.next should cache deadline of the earliest ready task that
is also enqueued in the pushable rbtree, as pull algorithm uses this
information to find candidates for migration: if the earliest_dl.next
deadline of source rq is earlier than the earliest_dl.curr deadline of
destination rq, the task from the source rq can be pulled.

However, current implementation only guarantees that earliest_dl.next is
the deadline of the next ready task instead of the next pushable task;
which will result in potentially holding both rqs' lock and find nothing
to migrate because of affinity constraints. In addition, current logic
doesn't update the next candidate for pushing in pick_next_task_dl(),
even if the running task is never eligible.

This patch fixes both problems by updating earliest_dl.next when
pushable dl task is enqueued/dequeued, similar to what we already do for
RT.

Tested-by: Luca Abeni 
Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Juri Lelli 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1449135730-27202-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 59 ++---
 1 file changed, 7 insertions(+), 52 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8b0a15e..cd64c97 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -176,8 +176,10 @@ static void enqueue_pushable_dl_task(struct rq *rq, struct 
task_struct *p)
}
}
 
-   if (leftmost)
+   if (leftmost) {
dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
+   dl_rq->earliest_dl.next = p->dl.deadline;
+   }
 
rb_link_node(&p->pushable_dl_tasks, parent, link);
rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
@@ -195,6 +197,10 @@ static void dequeue_pushable_dl_task(struct rq *rq, struct 
task_struct *p)
 
next_node = rb_next(&p->pushable_dl_tasks);
dl_rq->pushable_dl_tasks_leftmost = next_node;
+   if (next_node) {
+   dl_rq->earliest_dl.next = rb_entry(next_node,
+   struct task_struct, 
pushable_dl_tasks)->dl.deadline;
+   }
}
 
rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
@@ -782,42 +788,14 @@ static void update_curr_dl(struct rq *rq)
 
 #ifdef CONFIG_SMP
 
-static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
-
-static inline u64 next_deadline(struct rq *rq)
-{
-   struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
-
-   if (next && dl_prio(next->prio))
-   return next->dl.deadline;
-   else
-   return 0;
-}
-
 static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 {
struct rq *rq = rq_of_dl_rq(dl_rq);
 
if (dl_rq->earliest_dl.curr == 0 ||
dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
-   /*
-* If the dl_rq had no -deadline tasks, or if the new task
-* has shorter deadline than the current one on dl_rq, we
-* know that the previous earliest becomes our next earliest,
-* as the new task becomes the earliest itself.
-*/
-   dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
dl_rq->earliest_dl.curr = deadline;
cpudl_set(&rq->rd->cpudl, rq->cpu, deadline, 1);
-   } else if (dl_rq->earliest_dl.next == 0 ||
-  dl_time_before(deadline, dl_rq->earliest_dl.next)) {
-   /*
-* On the other hand, if the new -deadline task has a
-* a later deadline than the earliest one on dl_rq, but
-* it is earlier than the next (if any), we must
-* recompute the next-earliest.
-*/
-   dl_rq->earliest_dl.next = next_deadline(rq);
}
 }
 
@@ -839,7 +817,6 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 
deadline)
 
entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
dl_rq->earliest_dl.curr = entry->deadline;
-   dl_rq->earliest_dl.next = next_deadline(rq);
cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline, 1);
}
 }
@@ -1274,28 +1251,6 @@ static int pick_dl_task(struct rq *rq, struct 
task_struct *p, int cpu)
return 0;
 }
 
-/* Returns the second earliest -deadline task, NULL otherwise */
-static struct task_st

[tip:sched/core] sched: 'Annotate' migrate_tasks()

2015-09-13 Thread tip-bot for Wanpeng Li

Commit-ID:  5473e0cc37c03c576adbda7591a6cc8e37c1bb7f
Gitweb: http://git.kernel.org/tip/5473e0cc37c03c576adbda7591a6cc8e37c1bb7f
Author: Wanpeng Li 
AuthorDate: Fri, 28 Aug 2015 14:55:56 +0800
Committer:  Ingo Molnar 
CommitDate: Fri, 11 Sep 2015 07:57:50 +0200

sched: 'Annotate' migrate_tasks()

Kernel testing triggered this warning:

| WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 
do_set_cpus_allowed+0x7e/0x80()
| Modules linked in:
| CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c7 #2
| Call Trace:
|   dump_stack+0x4b/0x75
|   warn_slowpath_common+0x8b/0xc0
|   warn_slowpath_null+0x22/0x30
|   do_set_cpus_allowed+0x7e/0x80
|   cpuset_cpus_allowed_fallback+0x7c/0x170
|   select_fallback_rq+0x221/0x280
|   migration_call+0xe3/0x250
|   notifier_call_chain+0x53/0x70
|   __raw_notifier_call_chain+0x1e/0x30
|   cpu_notify+0x28/0x50
|   take_cpu_down+0x22/0x40
|   multi_cpu_stop+0xd5/0x140
|   cpu_stopper_thread+0xbc/0x170
|   smpboot_thread_fn+0x174/0x2f0
|   kthread+0xc4/0xe0
|   ret_from_kernel_thread+0x21/0x30

As Peterz pointed out:

| So the normal rules for changing task_struct::cpus_allowed are holding
| both pi_lock and rq->lock, such that holding either stabilizes the mask.
|
| This is so that wakeup can happen without rq->lock and load-balance
| without pi_lock.
|
| From this we already get the relaxation that we can omit acquiring
| rq->lock if the task is not on the rq, because in that case
| load-balancing will not apply to it.
|
| ** these are the rules currently tested in do_set_cpus_allowed() **
|
| Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which
| unconditionally acquires both locks, we could get away with holding just
| rq->lock when on_rq for modification because that'd still exclude
| __set_cpus_allowed_ptr(), it would also work against
| __kthread_bind_mask() because that assumes !on_rq.
|
| That said, this is all somewhat fragile.
|
| Now, I don't think dropping rq->lock is quite as disastrous as it
| usually is because !cpu_active at this point, which means load-balance
| will not interfere, but that too is somewhat fragile.
|
| So we end up with a choice of two fragile..

This patch fixes it by following the rules for changing
task_struct::cpus_allowed with both pi_lock and rq->lock held.

Reported-by: kernel test robot 
Reported-by: Sasha Levin 
Signed-off-by: Wanpeng Li 
[ Modified changelog and patch. ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/blu436-smtp1660820490de202e3934ed380...@phx.gbl
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 29 ++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0902e4d..9b78670 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5183,24 +5183,47 @@ static void migrate_tasks(struct rq *dead_rq)
break;
 
/*
-* Ensure rq->lock covers the entire task selection
-* until the migration.
+* pick_next_task assumes pinned rq->lock.
 */
lockdep_pin_lock(&rq->lock);
next = pick_next_task(rq, &fake_task);
BUG_ON(!next);
next->sched_class->put_prev_task(rq, next);
 
+   /*
+* Rules for changing task_struct::cpus_allowed are holding
+* both pi_lock and rq->lock, such that holding either
+* stabilizes the mask.
+*
+* Drop rq->lock is not quite as disastrous as it usually is
+* because !cpu_active at this point, which means load-balance
+* will not interfere. Also, stop-machine.
+*/
+   lockdep_unpin_lock(&rq->lock);
+   raw_spin_unlock(&rq->lock);
+   raw_spin_lock(&next->pi_lock);
+   raw_spin_lock(&rq->lock);
+
+   /*
+* Since we're inside stop-machine, _nothing_ should have
+* changed the task, WARN if weird stuff happened, because in
+* that case the above rq->lock drop is a fail too.
+*/
+   if (WARN_ON(task_rq(next) != rq || !task_on_rq_queued(next))) {
+   raw_spin_unlock(&next->pi_lock);
+   continue;
+   }
+
/* Find suitable destination for @next, with force if needed. */
dest_cpu = select_fallback_rq(dead_rq->cpu, next);
 
-   lockdep_unpin_lock(&rq->lock);
rq = __migrate_task(rq, next, dest_cpu);
if (rq != dead_rq) {
raw_spin_unlock(&rq->lock);
rq = dead_rq;
raw_spin_lock(&rq->lock);
}
+   raw_spin_unlock(&next->pi_lock);
}
 
rq->stop = st

[tip:sched/core] sched: Remove superfluous resetting of the p-> dl_throttled flag

2015-06-19 Thread tip-bot for Wanpeng Li

Commit-ID:  6713c3aa7f63626c0cecf9c509fb48d885b2dd12
Gitweb: http://git.kernel.org/tip/6713c3aa7f63626c0cecf9c509fb48d885b2dd12
Author: Wanpeng Li 
AuthorDate: Wed, 13 May 2015 14:01:06 +0800
Committer:  Ingo Molnar 
CommitDate: Fri, 19 Jun 2015 10:06:47 +0200

sched: Remove superfluous resetting of the p->dl_throttled flag

Resetting the p->dl_throttled flag in rt_mutex_setprio() (for a task that is 
going
to be boosted) is superfluous, as the natural place to do so is in
replenish_dl_entity().

If the task was on the runqueue and it is boosted by a DL task, it will be 
enqueued
back with ENQUEUE_REPLENISH flag set, which can guarantee that dl_throttled is
reset in replenish_dl_entity().

This patch drops the resetting of throttled status in function 
rt_mutex_setprio().

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1431496867-4194-6-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1428c7c..10338ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3099,7 +3099,6 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
if (!dl_prio(p->normal_prio) ||
(pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
p->dl.dl_boosted = 1;
-   p->dl.dl_throttled = 0;
enqueue_flag = ENQUEUE_REPLENISH;
} else
p->dl.dl_boosted = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Drop duplicate init_sched_dl_class() declaration

2015-06-19 Thread tip-bot for Wanpeng Li

Commit-ID:  178a4d23e4e6a0a90b086dad86697676b49db60a
Gitweb: http://git.kernel.org/tip/178a4d23e4e6a0a90b086dad86697676b49db60a
Author: Wanpeng Li 
AuthorDate: Wed, 13 May 2015 14:01:05 +0800
Committer:  Ingo Molnar 
CommitDate: Fri, 19 Jun 2015 10:06:47 +0200

sched/deadline: Drop duplicate init_sched_dl_class() declaration

There are two init_sched_dl_class() declarations, this patch drops
the duplicate.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1431496867-4194-5-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/sched.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d854555..d62b288 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1290,7 +1290,6 @@ extern void update_max_interval(void);
 extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
-extern void init_sched_dl_class(void);
 
 extern void resched_curr(struct rq *rq);
 extern void resched_cpu(int cpu);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Make init_sched_dl_class() __init

2015-06-19 Thread tip-bot for Wanpeng Li

Commit-ID:  a6c0e746fb8f4ea6508f274314378325a6e1ec9b
Gitweb: http://git.kernel.org/tip/a6c0e746fb8f4ea6508f274314378325a6e1ec9b
Author: Wanpeng Li 
AuthorDate: Wed, 13 May 2015 14:01:02 +0800
Committer:  Ingo Molnar 
CommitDate: Fri, 19 Jun 2015 10:06:46 +0200

sched/deadline: Make init_sched_dl_class() __init

It's a bootstrap function, make init_sched_dl_class() __init.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1431496867-4194-2-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9cbe1c7..1c4bc31 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1685,7 +1685,7 @@ static void rq_offline_dl(struct rq *rq)
cpudl_clear_freecpu(&rq->rd->cpudl, rq->cpu);
 }
 
-void init_sched_dl_class(void)
+void __init init_sched_dl_class(void)
 {
unsigned int i;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target

2015-06-19 Thread tip-bot for Wanpeng Li

Commit-ID:  9d514262425691dddf942edea8bc9919e66fe140
Gitweb: http://git.kernel.org/tip/9d514262425691dddf942edea8bc9919e66fe140
Author: Wanpeng Li 
AuthorDate: Wed, 13 May 2015 14:01:03 +0800
Committer:  Ingo Molnar 
CommitDate: Fri, 19 Jun 2015 10:06:46 +0200

sched/deadline: Reduce rq lock contention by eliminating locking of 
non-feasible target

This patch adds a check that prevents futile attempts to move DL tasks
to a CPU with active tasks of equal or earlier deadline. The same
behavior as commit 80e3d87b2c55 ("sched/rt: Reduce rq lock contention
by eliminating locking of non-feasible target") for rt class.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1431496867-4194-3-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1c4bc31..98f7871 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1012,7 +1012,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int 
sd_flag, int flags)
(p->nr_cpus_allowed > 1)) {
int target = find_later_rq(p);
 
-   if (target != -1)
+   if (target != -1 &&
+   dl_time_before(p->dl.deadline,
+   cpu_rq(target)->dl.earliest_dl.curr))
cpu = target;
}
rcu_read_unlock();
@@ -1359,6 +1361,17 @@ static struct rq *find_lock_later_rq(struct task_struct 
*task, struct rq *rq)
 
later_rq = cpu_rq(cpu);
 
+   if (!dl_time_before(task->dl.deadline,
+   later_rq->dl.earliest_dl.curr)) {
+   /*
+* Target rq has tasks of equal or earlier deadline,
+* retrying does not release any lock and is unlikely
+* to yield a different result.
+*/
+   later_rq = NULL;
+   break;
+   }
+
/* Retry if something changed. */
if (double_lock_balance(rq, later_rq)) {
if (unlikely(task_rq(task) != rq ||
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Optimize pull_dl_task()

2015-06-19 Thread tip-bot for Wanpeng Li

Commit-ID:  8b5e770ed7c05a65ffd2d33a83c14572696236dc
Gitweb: http://git.kernel.org/tip/8b5e770ed7c05a65ffd2d33a83c14572696236dc
Author: Wanpeng Li 
AuthorDate: Wed, 13 May 2015 14:01:01 +0800
Committer:  Ingo Molnar 
CommitDate: Fri, 19 Jun 2015 10:06:45 +0200

sched/deadline: Optimize pull_dl_task()

pull_dl_task() uses pick_next_earliest_dl_task() to select a migration
candidate; this is sub-optimal since the next earliest task -- as per
the regular runqueue -- might not be migratable at all. This could
result in iterating the entire runqueue looking for a task.

Instead iterate the pushable queue -- this queue only contains tasks
that have at least 2 cpus set in their cpus_allowed mask.

Signed-off-by: Wanpeng Li 
[ Improved the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1431496867-4194-1-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 890ce95..9cbe1c7 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1230,6 +1230,32 @@ next_node:
return NULL;
 }
 
+/*
+ * Return the earliest pushable rq's task, which is suitable to be executed
+ * on the CPU, NULL otherwise:
+ */
+static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int 
cpu)
+{
+   struct rb_node *next_node = rq->dl.pushable_dl_tasks_leftmost;
+   struct task_struct *p = NULL;
+
+   if (!has_pushable_dl_tasks(rq))
+   return NULL;
+
+next_node:
+   if (next_node) {
+   p = rb_entry(next_node, struct task_struct, pushable_dl_tasks);
+
+   if (pick_dl_task(rq, p, cpu))
+   return p;
+
+   next_node = rb_next(next_node);
+   goto next_node;
+   }
+
+   return NULL;
+}
+
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
 
 static int find_later_rq(struct task_struct *task)
@@ -1514,7 +1540,7 @@ static int pull_dl_task(struct rq *this_rq)
if (src_rq->dl.dl_nr_running <= 1)
goto skip;
 
-   p = pick_next_earliest_dl_task(src_rq, this_cpu);
+   p = pick_earliest_pushable_dl_task(src_rq, this_cpu);
 
/*
 * We found a task to be pulled if:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Support DL task migration during CPU hotplug

2015-04-02 Thread tip-bot for Wanpeng Li

Commit-ID:  fa9c9d10e97e38d9903fad1829535175ad261f45
Gitweb: http://git.kernel.org/tip/fa9c9d10e97e38d9903fad1829535175ad261f45
Author: Wanpeng Li 
AuthorDate: Fri, 27 Mar 2015 07:08:35 +0800
Committer:  Ingo Molnar 
CommitDate: Thu, 2 Apr 2015 17:42:57 +0200

sched/deadline: Support DL task migration during CPU hotplug

I observed that DL tasks can't be migrated to other CPUs during CPU
hotplug, in addition, task may/may not be running again if CPU is
added back.

The root cause which I found is that DL tasks will be throtted and
removed from the DL rq after comsuming all their budget, which
leads to the situation that stop task can't pick them up from the
DL rq and migrate them to other CPUs during hotplug.

The method to reproduce:

  schedtool -E -t 5:10 -e ./test

Actually './test' is just a simple for loop. Then observe which CPU the
test task is on and offline it:

  echo 0 > /sys/devices/system/cpu/cpuN/online

This patch adds the DL task migration during CPU hotplug by finding a
most suitable later deadline rq after DL timer fires if current rq is
offline.

If it fails to find a suitable later deadline rq then it falls back to
any eligible online CPU in so that the deadline task will come back
to us, and the push/pull mechanism should then move it around properly.

Suggested-and-Acked-by: Juri Lelli 
Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
http://lkml.kernel.org/r/1427411315-4298-1-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 57 +
 1 file changed, 57 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9d3ad64..5e95145 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -218,6 +218,52 @@ static inline void set_post_schedule(struct rq *rq)
rq->post_schedule = has_pushable_dl_tasks(rq);
 }
 
+static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq);
+
+static void dl_task_offline_migration(struct rq *rq, struct task_struct *p)
+{
+   struct rq *later_rq = NULL;
+   bool fallback = false;
+
+   later_rq = find_lock_later_rq(p, rq);
+
+   if (!later_rq) {
+   int cpu;
+
+   /*
+* If we cannot preempt any rq, fall back to pick any
+* online cpu.
+*/
+   fallback = true;
+   cpu = cpumask_any_and(cpu_active_mask, tsk_cpus_allowed(p));
+   if (cpu >= nr_cpu_ids) {
+   /*
+* Fail to find any suitable cpu.
+* The task will never come back!
+*/
+   BUG_ON(dl_bandwidth_enabled());
+
+   /*
+* If admission control is disabled we
+* try a little harder to let the task
+* run.
+*/
+   cpu = cpumask_any(cpu_active_mask);
+   }
+   later_rq = cpu_rq(cpu);
+   double_lock_balance(rq, later_rq);
+   }
+
+   deactivate_task(rq, p, 0);
+   set_task_cpu(p, later_rq->cpu);
+   activate_task(later_rq, p, ENQUEUE_REPLENISH);
+
+   if (!fallback)
+   resched_curr(later_rq);
+
+   double_unlock_balance(rq, later_rq);
+}
+
 #else
 
 static inline
@@ -536,6 +582,17 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer 
*timer)
sched_clock_tick();
update_rq_clock(rq);
 
+#ifdef CONFIG_SMP
+   /*
+* If we find that the rq the task was on is no longer
+* available, we need to select a new rq.
+*/
+   if (unlikely(!rq->online)) {
+   dl_task_offline_migration(rq, p);
+   goto unlock;
+   }
+#endif
+
/*
 * If the throttle happened during sched-out; like:
 *
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Fix rt runtime corruption when dl fails its global constraints

2015-03-27 Thread tip-bot for Wanpeng Li

Commit-ID:  a1963b81deec54c113e770b0020e5f1c3188a087
Gitweb: http://git.kernel.org/tip/a1963b81deec54c113e770b0020e5f1c3188a087
Author: Wanpeng Li 
AuthorDate: Tue, 17 Mar 2015 19:15:31 +0800
Committer:  Ingo Molnar 
CommitDate: Fri, 27 Mar 2015 09:36:15 +0100

sched/deadline: Fix rt runtime corruption when dl fails its global constraints

One version of sched_rt_global_constaints() (the !rt-cgroup one)
changes state, therefore if we fail the later sched_dl_global_constraints()
call the state is left in an inconsistent state.

Fix this by changing the order of the calls.

Signed-off-by: Wanpeng Li 
[ Improved the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Juri Lelli 
Link: 
http://lkml.kernel.org/r/1426590931-4639-2-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 043e2a1..4b3b688 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7804,7 +7804,7 @@ static int sched_rt_global_constraints(void)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-static int sched_dl_global_constraints(void)
+static int sched_dl_global_validate(void)
 {
u64 runtime = global_rt_runtime();
u64 period = global_rt_period();
@@ -7905,11 +7905,11 @@ int sched_rt_handler(struct ctl_table *table, int write,
if (ret)
goto undo;
 
-   ret = sched_rt_global_constraints();
+   ret = sched_dl_global_validate();
if (ret)
goto undo;
 
-   ret = sched_dl_global_constraints();
+   ret = sched_rt_global_constraints();
if (ret)
goto undo;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Avoid a superfluous check

2015-03-27 Thread tip-bot for Wanpeng Li

Commit-ID:  bd4bde14b93cce8fa77765ff709e0be55abdba2c
Gitweb: http://git.kernel.org/tip/bd4bde14b93cce8fa77765ff709e0be55abdba2c
Author: Wanpeng Li 
AuthorDate: Tue, 17 Mar 2015 19:15:30 +0800
Committer:  Ingo Molnar 
CommitDate: Fri, 27 Mar 2015 09:36:12 +0100

sched/deadline: Avoid a superfluous check

Since commit 40767b0dc768 ("sched/deadline: Fix deadline parameter
modification handling") we clear the thottled state when switching
from a dl task, therefore we should never find it set in switching to
a dl task.

Signed-off-by: Wanpeng Li 
[ Improved the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Juri Lelli 
Link: 
http://lkml.kernel.org/r/1426590931-4639-1-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0a81a95..24c18dc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1665,14 +1665,6 @@ static void switched_to_dl(struct rq *rq, struct 
task_struct *p)
 {
int check_resched = 1;
 
-   /*
-* If p is throttled, don't consider the possibility
-* of preempting rq->curr, the check will be done right
-* after its runtime will get replenished.
-*/
-   if (unlikely(p->dl.dl_throttled))
-   return;
-
if (task_on_rq_queued(p) && rq->curr != p) {
 #ifdef CONFIG_SMP
if (p->nr_cpus_allowed > 1 && rq->dl.overloaded &&
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Add rq-> clock update skip for dl task yield

2015-03-10 Thread tip-bot for Wanpeng Li

Commit-ID:  44fb085bfa17628c6d2aaa6af6b292a8499e9cbd
Gitweb: http://git.kernel.org/tip/44fb085bfa17628c6d2aaa6af6b292a8499e9cbd
Author: Wanpeng Li 
AuthorDate: Tue, 10 Mar 2015 12:20:00 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 10 Mar 2015 05:46:50 +0100

sched/deadline: Add rq->clock update skip for dl task yield

This patch adds rq->clock update skip for SCHED_DEADLINE task yield,
to tell update_rq_clock() that we've just updated the clock, so that
we don't do a microscopic update in schedule() and double the
fastpath cost.

Signed-off-by: Wanpeng Li 
Cc: Juri Lelli 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/1425961200-3809-1-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3fa8fa6..0a81a95 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -914,6 +914,12 @@ static void yield_task_dl(struct rq *rq)
}
update_rq_clock(rq);
update_curr_dl(rq);
+   /*
+* Tell update_rq_clock() that we've just updated,
+* so we don't do microscopic update in schedule()
+* and double the fastpath cost.
+*/
+   rq_clock_skip_update(rq, true);
 }
 
 #ifdef CONFIG_SMP
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched: Fix hrtick_start() on UP

2015-02-04 Thread tip-bot for Wanpeng Li

Commit-ID:  868933359a3bdda25b562e9d41bce7071edc1b08
Gitweb: http://git.kernel.org/tip/868933359a3bdda25b562e9d41bce7071edc1b08
Author: Wanpeng Li 
AuthorDate: Wed, 26 Nov 2014 08:44:06 +0800
Committer:  Ingo Molnar 
CommitDate: Wed, 4 Feb 2015 07:52:28 +0100

sched: Fix hrtick_start() on UP

The commit 177ef2a6315e ("sched/deadline: Fix a precision problem in
the microseconds range") forgot to change the UP version of
hrtick_start(), do so now.

Signed-off-by: Wanpeng Li 
Fixes: 177ef2a6315e ("sched/deadline: Fix a precision problem in the 
microseconds range")
[ Fixed the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1416962647-76792-7-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d59652d..5091fd4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -492,6 +492,11 @@ static __init void init_hrtick(void)
  */
 void hrtick_start(struct rq *rq, u64 delay)
 {
+   /*
+* Don't schedule slices shorter than 1ns, that just
+* doesn't make sense. Rely on vruntime for fairness.
+*/
+   delay = max_t(u64, delay, 1LL);
__hrtimer_start_range_ns(&rq->hrtick_timer, ns_to_ktime(delay), 0,
HRTIMER_MODE_REL_PINNED, 0);
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Avoid pointless __setscheduler()

2015-02-04 Thread tip-bot for Wanpeng Li

Commit-ID:  75381608e8410a72ae8b4080849dc86b472c01fb
Gitweb: http://git.kernel.org/tip/75381608e8410a72ae8b4080849dc86b472c01fb
Author: Wanpeng Li 
AuthorDate: Wed, 26 Nov 2014 08:44:04 +0800
Committer:  Ingo Molnar 
CommitDate: Wed, 4 Feb 2015 07:52:27 +0100

sched/deadline: Avoid pointless __setscheduler()

There is no need to dequeue/enqueue and push/pull if there are
no scheduling parameters changed for the DL class.

Both fair and RT classes already check if parameters changed for
them to avoid unnecessary overhead. This patch add the parameters
changed test for the DL class in order to reduce overhead.

Signed-off-by: Wanpeng Li 
[ Fixed up the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1416962647-76792-5-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 50a5352..d59652d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3417,6 +3417,20 @@ static bool check_same_owner(struct task_struct *p)
return match;
 }
 
+static bool dl_param_changed(struct task_struct *p,
+   const struct sched_attr *attr)
+{
+   struct sched_dl_entity *dl_se = &p->dl;
+
+   if (dl_se->dl_runtime != attr->sched_runtime ||
+   dl_se->dl_deadline != attr->sched_deadline ||
+   dl_se->dl_period != attr->sched_period ||
+   dl_se->flags != attr->sched_flags)
+   return true;
+
+   return false;
+}
+
 static int __sched_setscheduler(struct task_struct *p,
const struct sched_attr *attr,
bool user)
@@ -3545,7 +3559,7 @@ recheck:
goto change;
if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
goto change;
-   if (dl_policy(policy))
+   if (dl_policy(policy) && dl_param_changed(p, attr))
goto change;
 
p->sched_reset_on_fork = reset_on_fork;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Fix hrtick for a non-leftmost task

2015-02-04 Thread tip-bot for Wanpeng Li

Commit-ID:  a7bebf488791aa1036f3e6629daf01d01f705dcb
Gitweb: http://git.kernel.org/tip/a7bebf488791aa1036f3e6629daf01d01f705dcb
Author: Wanpeng Li 
AuthorDate: Wed, 26 Nov 2014 08:44:01 +0800
Committer:  Ingo Molnar 
CommitDate: Wed, 4 Feb 2015 07:52:25 +0100

sched/deadline: Fix hrtick for a non-leftmost task

After update_curr_dl() the current task might not be the leftmost task
anymore. In that case do not start a new hrtick for it.

In this case NEED_RESCHED will be set and the next schedule will start
the hrtick for the new task if and when appropriate.

Signed-off-by: Wanpeng Li 
Acked-by: Juri Lelli 
[ Rewrote the changelog and comment. ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1416962647-76792-2-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e0e9c29..7b684f9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1073,7 +1073,13 @@ static void task_tick_dl(struct rq *rq, struct 
task_struct *p, int queued)
 {
update_curr_dl(rq);
 
-   if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
+   /*
+* Even when we have runtime, update_curr_dl() might have resulted in us
+* not being the leftmost task anymore. In that case NEED_RESCHED will
+* be set and schedule() will start a new hrtick for the next task.
+*/
+   if (hrtick_enabled(rq) && queued && p->dl.runtime > 0 &&
+   is_leftmost(p, &rq->dl))
start_hrtick_dl(rq, p);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK

2014-11-16 Thread tip-bot for Wanpeng Li

Commit-ID:  36ce98818a4df66c8134c31fd6e768b4119c7a90
Gitweb: http://git.kernel.org/tip/36ce98818a4df66c8134c31fd6e768b4119c7a90
Author: Wanpeng Li 
AuthorDate: Tue, 11 Nov 2014 09:52:26 +0800
Committer:  Ingo Molnar 
CommitDate: Sun, 16 Nov 2014 10:59:03 +0100

sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK

Introduce start_hrtick_dl for !CONFIG_SCHED_HRTICK to align with
the fair class.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1415670747-58726-1-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9594c12..e5db8c6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1013,6 +1013,10 @@ static void start_hrtick_dl(struct rq *rq, struct 
task_struct *p)
 {
hrtick_start(rq, p->dl.runtime);
 }
+#else /* !CONFIG_SCHED_HRTICK */
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+}
 #endif
 
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
@@ -1066,10 +1070,8 @@ struct task_struct *pick_next_task_dl(struct rq *rq, 
struct task_struct *prev)
/* Running task will never be pushed. */
dequeue_pushable_dl_task(rq, p);
 
-#ifdef CONFIG_SCHED_HRTICK
if (hrtick_enabled(rq))
start_hrtick_dl(rq, p);
-#endif
 
set_post_schedule(rq);
 
@@ -1088,10 +1090,8 @@ static void task_tick_dl(struct rq *rq, struct 
task_struct *p, int queued)
 {
update_curr_dl(rq);
 
-#ifdef CONFIG_SCHED_HRTICK
if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
start_hrtick_dl(rq, p);
-#endif
 }
 
 static void task_fork_dl(struct task_struct *p)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Fix rq-> dl.pushable_tasks bug in push_dl_task()

2014-11-16 Thread tip-bot for Wanpeng Li

Commit-ID:  c51b8ab5ad972df26fd9c0ffad34870e98273c4c
Gitweb: http://git.kernel.org/tip/c51b8ab5ad972df26fd9c0ffad34870e98273c4c
Author: Wanpeng Li 
AuthorDate: Thu, 6 Nov 2014 15:22:44 +0800
Committer:  Ingo Molnar 
CommitDate: Sun, 16 Nov 2014 10:58:57 +0100

sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()

Do not call dequeue_pushable_dl_task() when failing to push an eligible
task, as it remains pushable, merely not at this particular moment.

Actually the patch is the same behavior as commit 311e800e16f6 ("sched,
rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index bb1464b..9594c12 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1328,6 +1328,7 @@ static int push_dl_task(struct rq *rq)
 {
struct task_struct *next_task;
struct rq *later_rq;
+   int ret = 0;
 
if (!rq->dl.overloaded)
return 0;
@@ -1373,7 +1374,6 @@ retry:
 * The task is still there. We don't try
 * again, some other cpu will pull it when ready.
 */
-   dequeue_pushable_dl_task(rq, next_task);
goto out;
}
 
@@ -1389,6 +1389,7 @@ retry:
deactivate_task(rq, next_task, 0);
set_task_cpu(next_task, later_rq->cpu);
activate_task(later_rq, next_task, 0);
+   ret = 1;
 
resched_curr(later_rq);
 
@@ -1397,7 +1398,7 @@ retry:
 out:
put_task_struct(next_task);
 
-   return 1;
+   return ret;
 }
 
 static void push_dl_tasks(struct rq *rq)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/fair: Fix stale overloaded status in the busiest group finding logic

2014-11-16 Thread tip-bot for Wanpeng Li

Commit-ID:  cb0b9f2445cdf9893352e4548582a2892af7137c
Gitweb: http://git.kernel.org/tip/cb0b9f2445cdf9893352e4548582a2892af7137c
Author: Wanpeng Li 
AuthorDate: Wed, 5 Nov 2014 07:44:50 +0800
Committer:  Ingo Molnar 
CommitDate: Sun, 16 Nov 2014 10:58:56 +0100

sched/fair: Fix stale overloaded status in the busiest group finding logic

Commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() return
'true' on a busier sd") changes groups to be ranked in the order of
overloaded > imbalance > other, and busiest group is picked according
to this order.

sgs->group_capacity_factor is used to check if the group is overloaded.

When the child domain prefers tasks to go to siblings first, the
sgs->group_capacity_factor will be set lower than one in order to
move all the excess tasks away.

However, group overloaded status is not updated when
sgs->group_capacity_factor is set to lower than one, which leads to us
missing to find the busiest group.

This patch fixes it by updating group overloaded status when sg capacity
factor is set to one, in order to find the busiest group accurately.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Rik van Riel 
Cc: Vincent Guittot 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng...@linux.intel.com
[ Fixed the changelog. ]
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8bca292..df2cdf7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6352,8 +6352,10 @@ static inline void update_sd_lb_stats(struct lb_env 
*env, struct sd_lb_stats *sd
 * with a large weight task outweighs the tasks on the system).
 */
if (prefer_sibling && sds->local &&
-   sds->local_stat.group_has_free_capacity)
+   sds->local_stat.group_has_free_capacity) {
sgs->group_capacity_factor = 
min(sgs->group_capacity_factor, 1U);
+   sgs->group_type = group_classify(sg, sgs);
+   }
 
if (update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched: Move p-> nr_cpus_allowed check to select_task_rq()

2014-11-16 Thread tip-bot for Wanpeng Li

Commit-ID:  6c1d9410f007a26d13173cf17204cfd965f49b83
Gitweb: http://git.kernel.org/tip/6c1d9410f007a26d13173cf17204cfd965f49b83
Author: Wanpeng Li 
AuthorDate: Wed, 5 Nov 2014 09:14:37 +0800
Committer:  Ingo Molnar 
CommitDate: Sun, 16 Nov 2014 10:58:55 +0100

sched: Move p->nr_cpus_allowed check to select_task_rq()

Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.

Suggested-and-Acked-by: Steven Rostedt 
Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: "pang.xunlei" 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 3 ++-
 kernel/sched/deadline.c | 3 ---
 kernel/sched/fair.c | 3 ---
 kernel/sched/rt.c   | 3 ---
 4 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ccdce1..d44d0c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1411,7 +1411,8 @@ out:
 static inline
 int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int 
wake_flags)
 {
-   cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
+   if (p->nr_cpus_allowed > 1)
+   cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, 
wake_flags);
 
/*
 * In order not to call set_task_cpu() on a blocking task we need
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b091179..bb1464b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -928,9 +928,6 @@ select_task_rq_dl(struct task_struct *p, int cpu, int 
sd_flag, int flags)
struct task_struct *curr;
struct rq *rq;
 
-   if (p->nr_cpus_allowed == 1)
-   goto out;
-
if (sd_flag != SD_BALANCE_WAKE)
goto out;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d11c57d..8bca292 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,9 +4730,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
 
-   if (p->nr_cpus_allowed == 1)
-   return prev_cpu;
-
if (sd_flag & SD_BALANCE_WAKE)
want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f1bb92f..ee15f5a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1301,9 +1301,6 @@ select_task_rq_rt(struct task_struct *p, int cpu, int 
sd_flag, int flags)
struct task_struct *curr;
struct rq *rq;
 
-   if (p->nr_cpus_allowed == 1)
-   goto out;
-
/* For anything but wake ups, just return the task_cpu */
if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
goto out;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Don' t check CONFIG_SMP in switched_from_dl()

2014-11-04 Thread tip-bot for Wanpeng Li

Commit-ID:  cad3bb32e181c286c46ec12b2deb1f26a6f9835d
Gitweb: http://git.kernel.org/tip/cad3bb32e181c286c46ec12b2deb1f26a6f9835d
Author: Wanpeng Li 
AuthorDate: Fri, 31 Oct 2014 06:39:36 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 4 Nov 2014 07:17:56 +0100

sched/deadline: Don't check CONFIG_SMP in switched_from_dl()

There are both UP and SMP version of pull_dl_task(), so don't need
to check CONFIG_SMP in switched_from_dl();

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1414708776-124078-6-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 362ab1f..f3d7776 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1637,7 +1637,6 @@ static void switched_from_dl(struct rq *rq, struct 
task_struct *p)
 
__dl_clear_params(p);
 
-#ifdef CONFIG_SMP
/*
 * Since this might be the only -deadline task on the rq,
 * this is the right place to try to pull some other one
@@ -1648,7 +1647,6 @@ static void switched_from_dl(struct rq *rq, struct 
task_struct *p)
 
if (pull_dl_task(rq))
resched_curr(rq);
-#endif
 }
 
 /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Fix artificial overrun introduced by yield_task_dl()

2014-11-04 Thread tip-bot for Wanpeng Li

Commit-ID:  804968809c321066cca028d4cbd533a420f964bc
Gitweb: http://git.kernel.org/tip/804968809c321066cca028d4cbd533a420f964bc
Author: Wanpeng Li 
AuthorDate: Fri, 31 Oct 2014 06:39:32 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 4 Nov 2014 07:17:53 +0100

sched/deadline: Fix artificial overrun introduced by yield_task_dl()

The yield semantic of deadline class is to reduce remaining runtime to
zero, and then update_curr_dl() will stop it. However, comsumed bandwidth
is reduced from the budget of yield task again even if it has already been
set to zero which leads to artificial overrun. This patch fix it by make
sure we don't steal some more time from the task that yielded in 
update_curr_dl().

Suggested-by: Juri Lelli 
Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1414708776-124078-2-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9d483e8..c047a94 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -628,7 +628,7 @@ static void update_curr_dl(struct rq *rq)
 
sched_rt_avg_update(rq, delta_exec);
 
-   dl_se->runtime -= delta_exec;
+   dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec;
if (dl_runtime_exceeded(rq, dl_se)) {
__dequeue_task_dl(rq, curr, 0);
if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted)))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Reschedule from switched_from_dl () after a successful pull

2014-11-04 Thread tip-bot for Wanpeng Li

Commit-ID:  cd66091162d34f589631a23bbe0ed214798245b4
Gitweb: http://git.kernel.org/tip/cd66091162d34f589631a23bbe0ed214798245b4
Author: Wanpeng Li 
AuthorDate: Fri, 31 Oct 2014 06:39:35 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 4 Nov 2014 07:17:55 +0100

sched/deadline: Reschedule from switched_from_dl() after a successful pull

In switched_from_dl() we have to issue a resched if we successfully
pulled some task from other cpus. This patch also aligns the behavior
with -rt.

Suggested-by: Juri Lelli 
Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1414708776-124078-5-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e7779b3..362ab1f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1643,8 +1643,11 @@ static void switched_from_dl(struct rq *rq, struct 
task_struct *p)
 * this is the right place to try to pull some other one
 * from an overloaded cpu, if any.
 */
-   if (!rq->dl.dl_nr_running)
-   pull_dl_task(rq);
+   if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
+   return;
+
+   if (pull_dl_task(rq))
+   resched_curr(rq);
 #endif
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Push task away if the deadline is equal to curr during wakeup

2014-11-04 Thread tip-bot for Wanpeng Li

Commit-ID:  6b0a563f3a534827c1b56e53c3fd0fccec3c7895
Gitweb: http://git.kernel.org/tip/6b0a563f3a534827c1b56e53c3fd0fccec3c7895
Author: Wanpeng Li 
AuthorDate: Fri, 31 Oct 2014 06:39:34 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 4 Nov 2014 07:17:55 +0100

sched/deadline: Push task away if the deadline is equal to curr during wakeup

This patch pushes task away if the dealine of the task is equal
to current during wake up. The same behavior as rt class.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1414708776-124078-4-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8867a67..e7779b3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1506,7 +1506,7 @@ static void task_woken_dl(struct rq *rq, struct 
task_struct *p)
p->nr_cpus_allowed > 1 &&
dl_task(rq->curr) &&
(rq->curr->nr_cpus_allowed < 2 ||
-dl_entity_preempt(&rq->curr->dl, &p->dl))) {
+!dl_entity_preempt(&p->dl, &rq->curr->dl))) {
push_dl_tasks(rq);
}
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/rt: Clean up check_preempt_equal_prio()

2014-11-04 Thread tip-bot for Wanpeng Li

Commit-ID:  308a623a40ce168eb234ea82c2bd13ff85a098d9
Gitweb: http://git.kernel.org/tip/308a623a40ce168eb234ea82c2bd13ff85a098d9
Author: Wanpeng Li 
AuthorDate: Fri, 31 Oct 2014 06:39:31 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 4 Nov 2014 07:17:52 +0100

sched/rt: Clean up check_preempt_equal_prio()

This patch checks if current can be pushed/pulled somewhere else
in advance to make logic clear, the same behavior as dl class.

- If current can't be migrated, useless to reschedule, let's hope
  task can move out.
- If task is migratable, so let's not schedule it and see if it
  can be pushed or pulled somewhere else.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Cc: Kirill Tkhai 
Cc: Steven Rostedt 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1414708776-124078-1-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/rt.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d024e6c..3d14312 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1351,16 +1351,22 @@ out:
 
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 {
-   if (rq->curr->nr_cpus_allowed == 1)
+   /*
+* Current can't be migrated, useless to reschedule,
+* let's hope p can move out.
+*/
+   if (rq->curr->nr_cpus_allowed == 1 ||
+   !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
return;
 
+   /*
+* p is migratable, so let's not schedule it and
+* see if it is pushed or pulled somewhere else.
+*/
if (p->nr_cpus_allowed != 1
&& cpupri_find(&rq->rd->cpupri, p, NULL))
return;
 
-   if (!cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
-   return;
-
/*
 * There appears to be other cpus that can accept
 * current and none to run 'p', so lets reschedule
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Add deadline rq status print

2014-11-04 Thread tip-bot for Wanpeng Li

Commit-ID:  acb32132ec0433c03bed750f3e9508dc29db0328
Gitweb: http://git.kernel.org/tip/acb32132ec0433c03bed750f3e9508dc29db0328
Author: Wanpeng Li 
AuthorDate: Fri, 31 Oct 2014 06:39:33 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 4 Nov 2014 07:17:54 +0100

sched/deadline: Add deadline rq status print

This patch add deadline rq status print.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 9 +
 kernel/sched/debug.c| 7 +++
 kernel/sched/sched.h| 1 +
 3 files changed, 17 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c047a94..8867a67 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1747,3 +1747,12 @@ const struct sched_class dl_sched_class = {
.switched_from  = switched_from_dl,
.switched_to= switched_to_dl,
 };
+
+#ifdef CONFIG_SCHED_DEBUG
+extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
+
+void print_dl_stats(struct seq_file *m, int cpu)
+{
+   print_dl_rq(m, cpu, &cpu_rq(cpu)->dl);
+}
+#endif /* CONFIG_SCHED_DEBUG */
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ce33780..eeb6046 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -261,6 +261,12 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq 
*rt_rq)
 #undef P
 }
 
+void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
+{
+   SEQ_printf(m, "\ndl_rq[%d]:\n", cpu);
+   SEQ_printf(m, "  .%-30s: %ld\n", "dl_nr_running", dl_rq->dl_nr_running);
+}
+
 extern __read_mostly int sched_clock_running;
 
 static void print_cpu(struct seq_file *m, int cpu)
@@ -329,6 +335,7 @@ do {
\
spin_lock_irqsave(&sched_debug_lock, flags);
print_cfs_stats(m, cpu);
print_rt_stats(m, cpu);
+   print_dl_stats(m, cpu);
 
print_rq(m, rq, cpu);
spin_unlock_irqrestore(&sched_debug_lock, flags);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 49b941f..7e5c1ee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1537,6 +1537,7 @@ extern struct sched_entity *__pick_first_entity(struct 
cfs_rq *cfs_rq);
 extern struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq);
 extern void print_cfs_stats(struct seq_file *m, int cpu);
 extern void print_rt_stats(struct seq_file *m, int cpu);
+extern void print_dl_stats(struct seq_file *m, int cpu);
 
 extern void init_cfs_rq(struct cfs_rq *cfs_rq);
 extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Don' t balance during wakeup if wakee is pinned

2014-10-28 Thread tip-bot for Wanpeng Li

Commit-ID:  f4e9d94a5bf60193d45f92b136e3d166be3ec8d5
Gitweb: http://git.kernel.org/tip/f4e9d94a5bf60193d45f92b136e3d166be3ec8d5
Author: Wanpeng Li 
AuthorDate: Tue, 14 Oct 2014 10:22:40 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 28 Oct 2014 10:48:02 +0100

sched/deadline: Don't balance during wakeup if wakee is pinned

Use nr_cpus_allowed to bail from select_task_rq() when only one cpu
can be used, and saves some cycles for pinned tasks.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1413253360-5318-2-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fab3bf8..2e31a30 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -933,6 +933,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int 
sd_flag, int flags)
struct task_struct *curr;
struct rq *rq;
 
+   if (p->nr_cpus_allowed == 1)
+   goto out;
+
if (sd_flag != SD_BALANCE_WAKE)
goto out;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Don't check SD_BALANCE_FORK

2014-10-28 Thread tip-bot for Wanpeng Li

Commit-ID:  1d7e974cbf2fce2683f34ff33c173fd7ef5478c7
Gitweb: http://git.kernel.org/tip/1d7e974cbf2fce2683f34ff33c173fd7ef5478c7
Author: Wanpeng Li 
AuthorDate: Tue, 14 Oct 2014 10:22:39 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 28 Oct 2014 10:48:01 +0100

sched/deadline: Don't check SD_BALANCE_FORK

There is no need to do balance during fork since SCHED_DEADLINE
tasks can't fork. This patch avoid the SD_BALANCE_FORK check.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
http://lkml.kernel.org/r/1413253360-5318-1-git-send-email-wanpeng...@linux.intel.com
Cc: Linus Torvalds 
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8aaa971..fab3bf8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -933,7 +933,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int 
sd_flag, int flags)
struct task_struct *curr;
struct rq *rq;
 
-   if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
+   if (sd_flag != SD_BALANCE_WAKE)
goto out;
 
rq = cpu_rq(cpu);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/deadline: Do not try to push tasks if pinned task switches to dl

2014-10-28 Thread tip-bot for Wanpeng Li

Commit-ID:  d9aade7ae1d283097a3f626790e7c325a5c69007
Gitweb: http://git.kernel.org/tip/d9aade7ae1d283097a3f626790e7c325a5c69007
Author: Wanpeng Li 
AuthorDate: Wed, 22 Oct 2014 08:36:43 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 28 Oct 2014 10:47:57 +0100

sched/deadline: Do not try to push tasks if pinned task switches to dl

As Kirill mentioned (https://lkml.org/lkml/2013/1/29/118):

 | If rq has already had 2 or more pushable tasks and we try to add a
 | pinned task then call of push_rt_task will just waste a time.

Just switched pinned task is not able to be pushed. If the rq has had
several dl tasks before they have already been considered as candidates
to be pushed (or pulled). This patch implements the same behavior as rt
class which introduced by commit 10447917551e ("sched/rt: Do not try to
push tasks if pinned task switches to RT").

Suggested-by: Kirill V Tkhai 
Acked-by: Juri Lelli 
Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Steven Rostedt 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1413938203-224610-1-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/deadline.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5285332..9d1e76a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1622,7 +1622,8 @@ static void switched_to_dl(struct rq *rq, struct 
task_struct *p)
 
if (task_on_rq_queued(p) && rq->curr != p) {
 #ifdef CONFIG_SMP
-   if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
+   if (p->nr_cpus_allowed > 1 && rq->dl.overloaded &&
+   push_dl_task(rq) && rq != task_rq(p))
/* Only reschedule if pushing failed */
check_resched = 0;
 #endif /* CONFIG_SMP */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/urgent] sched: Fix unreleased llc_shared_mask bit during CPU hotplug

2014-09-24 Thread tip-bot for Wanpeng Li

Commit-ID:  03bd4e1f7265548832a76e7919a81f3137c44fd1
Gitweb: http://git.kernel.org/tip/03bd4e1f7265548832a76e7919a81f3137c44fd1
Author: Wanpeng Li 
AuthorDate: Wed, 24 Sep 2014 16:38:05 +0800
Committer:  Ingo Molnar 
CommitDate: Wed, 24 Sep 2014 15:13:20 +0200

sched: Fix unreleased llc_shared_mask bit during CPU hotplug

The following bug can be triggered by hot adding and removing a large number of
xen domain0's vcpus repeatedly:

BUG: unable to handle kernel NULL pointer dereference at 
0004 IP: [..] find_busiest_group
PGD 5a9d5067 PUD 13067 PMD 0
Oops:  [#3] SMP
[...]
Call Trace:
load_balance
? _raw_spin_unlock_irqrestore
idle_balance
__schedule
schedule
schedule_timeout
? lock_timer_base
schedule_timeout_uninterruptible
msleep
lock_device_hotplug_sysfs
online_store
dev_attr_store
sysfs_write_file
vfs_write
SyS_write
system_call_fastpath

Last level cache shared mask is built during CPU up and the
build_sched_domain() routine takes advantage of it to setup
the sched domain CPU topology.

However, llc_shared_mask is not released during CPU disable,
which leads to an invalid sched domainCPU topology.

This patch fix it by releasing the llc_shared_mask correctly
during CPU disable.

Yasuaki also reported that this can happen on real hardware:

  https://lkml.org/lkml/2014/7/22/1018

His case is here:

==
Here is an example on my system.
My system has 4 sockets and each socket has 15 cores and HT is
enabled. In this case, each core of sockes is numbered as
follows:

 | CPU#
Socket#0 | 0-14 , 60-74
Socket#1 | 15-29, 75-89
Socket#2 | 30-44, 90-104
Socket#3 | 45-59, 105-119

Then llc_shared_mask of CPU#30 has 0x3fff8001fffc000.

It means that last level cache of Socket#2 is shared with
CPU#30-44 and 90-104.

When hot-removing socket#2 and #3, each core of sockets is
numbered as follows:

 | CPU#
Socket#0 | 0-14 , 60-74
Socket#1 | 15-29, 75-89

But llc_shared_mask is not cleared. So llc_shared_mask of CPU#30
remains having 0x3fff8001fffc000.

After that, when hot-adding socket#2 and #3, each core of
sockets is numbered as follows:

 | CPU#
Socket#0 | 0-14 , 60-74
Socket#1 | 15-29, 75-89
Socket#2 | 30-59
Socket#3 | 90-119

Then llc_shared_mask of CPU#30 becomes
0x3fff8000fffc000. It means that last level cache of
Socket#2 is shared with CPU#30-59 and 90-104. So the mask has
the wrong value.

Signed-off-by: Wanpeng Li 
Tested-by: Linn Crosetto 
Reviewed-by: Borislav Petkov 
Reviewed-by: Toshi Kani 
Reviewed-by: Yasuaki Ishimatsu 
Cc: 
Cc: David Rientjes 
Cc: Prarit Bhargava 
Cc: Steven Rostedt 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/1411547885-48165-1-git-send-email-wanpeng...@linux.intel.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/smpboot.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 2d872e0..42a2dca 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1284,6 +1284,9 @@ static void remove_siblinginfo(int cpu)
 
for_each_cpu(sibling, cpu_sibling_mask(cpu))
cpumask_clear_cpu(cpu, cpu_sibling_mask(sibling));
+   for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
+   cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
+   cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(cpu_sibling_mask(cpu));
cpumask_clear(cpu_core_mask(cpu));
c->phys_proc_id = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/numa: Fix period_slot recalculation

2013-12-18 Thread tip-bot for Wanpeng Li

Commit-ID:  e777b63bbd589248eb151a3191ee92331a701385
Gitweb: http://git.kernel.org/tip/e777b63bbd589248eb151a3191ee92331a701385
Author: Wanpeng Li 
AuthorDate: Thu, 12 Dec 2013 15:23:26 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 17 Dec 2013 15:24:41 +0100

sched/numa: Fix period_slot recalculation

The original code is as intended and was meant to scale the difference
between the NUMA_PERIOD_THRESHOLD and local/remote ratio when adjusting
the scan period. The period_slot recalculation can be dropped.

Signed-off-by: Wanpeng Li 
Reviewed-by: Naoya Horiguchi 
Acked-by: Mel Gorman 
Acked-by: David Rientjes 
Signed-off-by: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Rik van Riel 
Link: 
http://lkml.kernel.org/r/1386833006-6600-4-git-send-email-liw...@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 37892d7..4316af2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1342,7 +1342,6 @@ static void update_task_scan_period(struct task_struct *p,
 * scanning faster if shared accesses dominate as it may
 * simply bounce migrations uselessly
 */
-   period_slot = DIV_ROUND_UP(diff, NUMA_PERIOD_SLOTS);
ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + 
shared));
diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/numa: Use wrapper function task_faults_idx to calculate index in group_faults

2013-12-18 Thread tip-bot for Wanpeng Li

Commit-ID:  82897b4fd3920ffd2456731d4f2ae1406558ef4c
Gitweb: http://git.kernel.org/tip/82897b4fd3920ffd2456731d4f2ae1406558ef4c
Author: Wanpeng Li 
AuthorDate: Thu, 12 Dec 2013 15:23:25 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 17 Dec 2013 15:24:40 +0100

sched/numa: Use wrapper function task_faults_idx to calculate index in 
group_faults

Use wrapper function task_faults_idx to calculate index in group_faults.

Signed-off-by: Wanpeng Li 
Reviewed-by: Naoya Horiguchi 
Acked-by: Mel Gorman 
Acked-by: David Rientjes 
Signed-off-by: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Rik van Riel 
Link: 
http://lkml.kernel.org/r/1386833006-6600-3-git-send-email-liw...@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebdb08b..37892d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -921,7 +921,8 @@ static inline unsigned long group_faults(struct task_struct 
*p, int nid)
if (!p->numa_group)
return 0;
 
-   return p->numa_group->faults[2*nid] + p->numa_group->faults[2*nid+1];
+   return p->numa_group->faults[task_faults_idx(nid, 0)] +
+   p->numa_group->faults[task_faults_idx(nid, 1)];
 }
 
 /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/numa: Drop sysctl_numa_balancing_settle_count sysctl

2013-12-18 Thread tip-bot for Wanpeng Li

Commit-ID:  1bd53a7efdc988163ec4c25f656df38dbe500632
Gitweb: http://git.kernel.org/tip/1bd53a7efdc988163ec4c25f656df38dbe500632
Author: Wanpeng Li 
AuthorDate: Thu, 12 Dec 2013 15:23:23 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 17 Dec 2013 15:24:38 +0100

sched/numa: Drop sysctl_numa_balancing_settle_count sysctl

commit 887c290e (sched/numa: Decide whether to favour task or group weights
based on swap candidate relationships) drop the check against
sysctl_numa_balancing_settle_count, this patch remove the sysctl.

Signed-off-by: Wanpeng Li 
Acked-by: Mel Gorman 
Reviewed-by: Rik van Riel 
Acked-by: David Rientjes 
Signed-off-by: Peter Zijlstra 
Cc: Andrew Morton 
Cc: Naoya Horiguchi 
Link: 
http://lkml.kernel.org/r/1386833006-6600-1-git-send-email-liw...@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar 
---
 Documentation/sysctl/kernel.txt | 5 -
 include/linux/sched/sysctl.h| 1 -
 kernel/sched/fair.c | 9 -
 kernel/sysctl.c | 7 ---
 4 files changed, 22 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 26b7ee4..6d48640 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -428,11 +428,6 @@ rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.
 
-numa_balancing_settle_count is how many scan periods must complete before
-the schedule balancer stops pushing the task towards a preferred node. This
-gives the scheduler a chance to place the task on an alternative node if the
-preferred node is overloaded.
-
 numa_balancing_migrate_deferred is how many page migrations get skipped
 unconditionally, after a page migration is skipped because a page is shared
 with other tasks. This reduces page migration overhead, and determines
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 41467f8..31e0193 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -48,7 +48,6 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
-extern unsigned int sysctl_numa_balancing_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a9185f7..fcb6c17 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -872,15 +872,6 @@ static unsigned int task_scan_max(struct task_struct *p)
return max(smin, smax);
 }
 
-/*
- * Once a preferred node is selected the scheduler balancer will prefer moving
- * a task to that node for sysctl_numa_balancing_settle_count number of PTE
- * scans. This will give the process the chance to accumulate more faults on
- * the preferred node but still allow the scheduler to move the task again if
- * the nodes CPUs are overloaded.
- */
-unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;
-
 static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
 {
rq->nr_numa_running += (p->numa_preferred_nid != -1);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a6047..c8da99f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -385,13 +385,6 @@ static struct ctl_table kern_table[] = {
.proc_handler   = proc_dointvec,
},
{
-   .procname   = "numa_balancing_settle_count",
-   .data   = &sysctl_numa_balancing_settle_count,
-   .maxlen = sizeof(unsigned int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   },
-   {
.procname   = "numa_balancing_migrate_deferred",
.data   = &sysctl_numa_balancing_migrate_deferred,
.maxlen = sizeof(unsigned int),
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/numa: Use wrapper function task_node to get node which task is on

2013-12-18 Thread tip-bot for Wanpeng Li

Commit-ID:  de1b301a19754778ddd9f908d266ffe1c010b2cf
Gitweb: http://git.kernel.org/tip/de1b301a19754778ddd9f908d266ffe1c010b2cf
Author: Wanpeng Li 
AuthorDate: Thu, 12 Dec 2013 15:23:24 +0800
Committer:  Ingo Molnar 
CommitDate: Tue, 17 Dec 2013 15:24:39 +0100

sched/numa: Use wrapper function task_node to get node which task is on

Use wrapper function task_node to get node which task is on.

Signed-off-by: Wanpeng Li 
Acked-by: Mel Gorman 
Reviewed-by: Naoya Horiguchi 
Reviewed-by: Rik van Riel 
Acked-by: David Rientjes 
Signed-off-by: Peter Zijlstra 
Cc: Andrew Morton 
Link: 
http://lkml.kernel.org/r/1386833006-6600-2-git-send-email-liw...@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/debug.c | 2 +-
 kernel/sched/fair.c  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 5c34d18..374fe04 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -139,7 +139,7 @@ print_task(struct seq_file *m, struct rq *rq, struct 
task_struct *p)
0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
 #endif
 #ifdef CONFIG_NUMA_BALANCING
-   SEQ_printf(m, " %d", cpu_to_node(task_cpu(p)));
+   SEQ_printf(m, " %d", task_node(p));
 #endif
 #ifdef CONFIG_CGROUP_SCHED
SEQ_printf(m, " %s", task_group_path(task_group(p)));
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fcb6c17..ebdb08b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1202,7 +1202,7 @@ static int task_numa_migrate(struct task_struct *p)
 * elsewhere, so there is no point in (re)trying.
 */
if (unlikely(!sd)) {
-   p->numa_preferred_nid = cpu_to_node(task_cpu(p));
+   p->numa_preferred_nid = task_node(p);
return -EINVAL;
}
 
@@ -1269,7 +1269,7 @@ static void numa_migrate_preferred(struct task_struct *p)
p->numa_migrate_retry = jiffies + HZ;
 
/* Success if task is already running on preferred CPU */
-   if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+   if (task_node(p) == p->numa_preferred_nid)
return;
 
/* Otherwise, try migrate to a CPU on the preferred node */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched/numa: Drop idx field of task_numa_env struct

2013-12-10 Thread tip-bot for Wanpeng Li

Commit-ID:  40ea2b42d7c44386cf81d5636d574193da2c8df2
Gitweb: http://git.kernel.org/tip/40ea2b42d7c44386cf81d5636d574193da2c8df2
Author: Wanpeng Li 
AuthorDate: Thu, 5 Dec 2013 19:10:17 +0800
Committer:  Ingo Molnar 
CommitDate: Thu, 5 Dec 2013 13:38:36 +0100

sched/numa: Drop idx field of task_numa_env struct

Drop unused idx field of task_numa_env struct.

Signed-off-by: Wanpeng Li 
Reviewed-by: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Naoya Horiguchi 
Cc: linux...@kvack.org
Link: 
http://lkml.kernel.org/r/1386241817-5051-2-git-send-email-liw...@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a566c07..49aa01f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1037,7 +1037,7 @@ struct task_numa_env {
 
struct numa_stats src_stats, dst_stats;
 
-   int imbalance_pct, idx;
+   int imbalance_pct;
 
struct task_struct *best_task;
long best_imp;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

61 matches

Mail list logo