[tip:sched/core] sched/isolation: Prefer housekeeping CPU in local node
Commit-ID: e0e8d4911ed2695b12c3a01c15634000ede9bc73 Gitweb: https://git.kernel.org/tip/e0e8d4911ed2695b12c3a01c15634000ede9bc73 Author: Wanpeng Li AuthorDate: Fri, 28 Jun 2019 16:51:41 +0800 Committer: Ingo Molnar CommitDate: Thu, 25 Jul 2019 15:51:55 +0200 sched/isolation: Prefer housekeeping CPU in local node In real product setup, there will be houseeking CPUs in each nodes, it is prefer to do housekeeping from local node, fallback to global online cpumask if failed to find houseeking CPU from local node. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Frederic Weisbecker Reviewed-by: Srikar Dronamraju Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/1561711901-4755-2-git-send-email-wanpen...@tencent.com Signed-off-by: Ingo Molnar --- kernel/sched/isolation.c | 12 ++-- kernel/sched/sched.h | 8 +--- kernel/sched/topology.c | 20 3 files changed, 35 insertions(+), 5 deletions(-) diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index ccb28085b114..9fcb2a695a41 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -22,9 +22,17 @@ EXPORT_SYMBOL_GPL(housekeeping_enabled); int housekeeping_any_cpu(enum hk_flags flags) { - if (static_branch_unlikely(&housekeeping_overridden)) - if (housekeeping_flags & flags) + int cpu; + + if (static_branch_unlikely(&housekeeping_overridden)) { + if (housekeeping_flags & flags) { + cpu = sched_numa_find_closest(housekeeping_mask, smp_processor_id()); + if (cpu < nr_cpu_ids) + return cpu; + return cpumask_any_and(housekeeping_mask, cpu_online_mask); + } + } return smp_processor_id(); } EXPORT_SYMBOL_GPL(housekeeping_any_cpu); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index aaca0e743776..16126efd14ed 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1262,16 +1262,18 @@ enum numa_topology_type { extern enum numa_topology_type sched_numa_topology_type; extern int sched_max_numa_distance; extern bool find_numa_distance(int distance); -#endif - -#ifdef CONFIG_NUMA extern void sched_init_numa(void); extern void sched_domains_numa_masks_set(unsigned int cpu); extern void sched_domains_numa_masks_clear(unsigned int cpu); +extern int sched_numa_find_closest(const struct cpumask *cpus, int cpu); #else static inline void sched_init_numa(void) { } static inline void sched_domains_numa_masks_set(unsigned int cpu) { } static inline void sched_domains_numa_masks_clear(unsigned int cpu) { } +static inline int sched_numa_find_closest(const struct cpumask *cpus, int cpu) +{ + return nr_cpu_ids; +} #endif #ifdef CONFIG_NUMA_BALANCING diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index f751ce0b783e..4eea2c9bc732 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1724,6 +1724,26 @@ void sched_domains_numa_masks_clear(unsigned int cpu) } } +/* + * sched_numa_find_closest() - given the NUMA topology, find the cpu + * closest to @cpu from @cpumask. + * cpumask: cpumask to find a cpu from + * cpu: cpu to be close to + * + * returns: cpu, or nr_cpu_ids when nothing found. + */ +int sched_numa_find_closest(const struct cpumask *cpus, int cpu) +{ + int i, j = cpu_to_node(cpu); + + for (i = 0; i < sched_domains_numa_levels; i++) { + cpu = cpumask_any_and(cpus, sched_domains_numa_masks[i][j]); + if (cpu < nr_cpu_ids) + return cpu; + } + return nr_cpu_ids; +} + #endif /* CONFIG_NUMA */ static int __sdt_alloc(const struct cpumask *cpu_map)
[tip:sched/urgent] sched/cputime: Don't use smp_processor_id() in preemptible context
Commit-ID: 0e4097c3354e2f5a5ad8affd9dc7f7f7d00bb6b9 Gitweb: http://git.kernel.org/tip/0e4097c3354e2f5a5ad8affd9dc7f7f7d00bb6b9 Author: Wanpeng Li AuthorDate: Sun, 9 Jul 2017 00:40:28 -0700 Committer: Ingo Molnar CommitDate: Fri, 14 Jul 2017 10:27:15 +0200 sched/cputime: Don't use smp_processor_id() in preemptible context Recent kernels trigger this warning: BUG: using smp_processor_id() in preemptible [] code: 99-trinity/181 caller is debug_smp_processor_id+0x17/0x19 CPU: 0 PID: 181 Comm: 99-trinity Not tainted 4.12.0-01059-g2a42eb9 #1 Call Trace: dump_stack+0x82/0xb8 check_preemption_disabled() debug_smp_processor_id() vtime_delta() task_cputime() thread_group_cputime() thread_group_cputime_adjusted() wait_consider_task() do_wait() SYSC_wait4() do_syscall_64() entry_SYSCALL64_slow_path() As Frederic pointed out: | Although those sched_clock_cpu() things seem to only matter when the | sched_clock() is unstable. And that stability is a condition for nohz_full | to work anyway. So probably sched_clock() alone would be enough. This patch fixes it by replacing sched_clock_cpu() with sched_clock() to avoid calling smp_processor_id() in a preemptible context. Reported-by: Xiaolong Ye Signed-off-by: Wanpeng Li Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Luiz Capitulino Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1499586028-7402-1-git-send-email-wanpeng...@hotmail.com [ Prettified the changelog. ] Signed-off-by: Ingo Molnar --- kernel/sched/cputime.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 6e3ea4a..14d2dbf 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -683,7 +683,7 @@ static u64 vtime_delta(struct vtime *vtime) { unsigned long long clock; - clock = sched_clock_cpu(smp_processor_id()); + clock = sched_clock(); if (clock < vtime->starttime) return 0; @@ -814,7 +814,7 @@ void arch_vtime_task_switch(struct task_struct *prev) write_seqcount_begin(&vtime->seqcount); vtime->state = VTIME_SYS; - vtime->starttime = sched_clock_cpu(smp_processor_id()); + vtime->starttime = sched_clock(); write_seqcount_end(&vtime->seqcount); } @@ -826,7 +826,7 @@ void vtime_init_idle(struct task_struct *t, int cpu) local_irq_save(flags); write_seqcount_begin(&vtime->seqcount); vtime->state = VTIME_SYS; - vtime->starttime = sched_clock_cpu(cpu); + vtime->starttime = sched_clock(); write_seqcount_end(&vtime->seqcount); local_irq_restore(flags); }
[tip:sched/urgent] sched/cputime: Accumulate vtime on top of nsec clocksource
Commit-ID: 2a42eb9594a1480b4ead9e036e06ee1290e5fa6d Gitweb: http://git.kernel.org/tip/2a42eb9594a1480b4ead9e036e06ee1290e5fa6d Author: Wanpeng Li AuthorDate: Thu, 29 Jun 2017 19:15:11 +0200 Committer: Ingo Molnar CommitDate: Wed, 5 Jul 2017 09:54:15 +0200 sched/cputime: Accumulate vtime on top of nsec clocksource Currently the cputime source used by vtime is jiffies. When we cross a context boundary and jiffies have changed since the last snapshot, the pending cputime is accounted to the switching out context. This system works ok if the ticks are not aligned across CPUs. If they instead are aligned (ie: all fire at the same time) and the CPUs run in userspace, the jiffies change is only observed on tick exit and therefore the user cputime is accounted as system cputime. This is because the CPU that maintains timekeeping fires its tick at the same time as the others. It updates jiffies in the middle of the tick and the other CPUs see that update on IRQ exit: CPU 0 (timekeeper) CPU 1 --- - jiffies = N ... run in userspace for a jiffy tick entry tick entry (sees jiffies = N) set jiffies = N + 1 tick exittick exit (sees jiffies = N + 1) account 1 jiffy as stime Fix this with using a nanosec clock source instead of jiffies. The cputime is then accumulated and flushed everytime the pending delta reaches a jiffy in order to mitigate the accounting overhead. [ fweisbec: changelog, rebase on struct vtime, field renames, add delta on cputime readers, keep idle vtime as-is (low overhead accounting), harmonize clock sources. ] Suggested-by: Thomas Gleixner Reported-by: Luiz Capitulino Tested-by: Luiz Capitulino Signed-off-by: Wanpeng Li Signed-off-by: Frederic Weisbecker Reviewed-by: Thomas Gleixner Acked-by: Rik van Riel Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Wanpeng Li Link: http://lkml.kernel.org/r/1498756511-11714-6-git-send-email-fweis...@gmail.com Signed-off-by: Ingo Molnar --- include/linux/sched.h | 3 +++ kernel/sched/cputime.c | 64 +- 2 files changed, 45 insertions(+), 22 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index eeff8a0..4818126 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -236,6 +236,9 @@ struct vtime { seqcount_t seqcount; unsigned long long starttime; enum vtime_statestate; + u64 utime; + u64 stime; + u64 gtime; }; struct sched_info { diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 9ee725e..6e3ea4a 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -681,18 +681,19 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN static u64 vtime_delta(struct vtime *vtime) { - unsigned long now = READ_ONCE(jiffies); + unsigned long long clock; - if (time_before(now, (unsigned long)vtime->starttime)) + clock = sched_clock_cpu(smp_processor_id()); + if (clock < vtime->starttime) return 0; - return jiffies_to_nsecs(now - vtime->starttime); + return clock - vtime->starttime; } static u64 get_vtime_delta(struct vtime *vtime) { - unsigned long now = READ_ONCE(jiffies); - u64 delta, other; + u64 delta = vtime_delta(vtime); + u64 other; /* * Unlike tick based timing, vtime based timing never has lost @@ -701,17 +702,31 @@ static u64 get_vtime_delta(struct vtime *vtime) * elapsed time. Limit account_other_time to prevent rounding * errors from causing elapsed vtime to go negative. */ - delta = jiffies_to_nsecs(now - vtime->starttime); other = account_other_time(delta); WARN_ON_ONCE(vtime->state == VTIME_INACTIVE); - vtime->starttime = now; + vtime->starttime += delta; return delta - other; } -static void __vtime_account_system(struct task_struct *tsk) +static void __vtime_account_system(struct task_struct *tsk, + struct vtime *vtime) { - account_system_time(tsk, irq_count(), get_vtime_delta(&tsk->vtime)); + vtime->stime += get_vtime_delta(vtime); + if (vtime->stime >= TICK_NSEC) { + account_system_time(tsk, irq_count(), vtime->stime); + vtime->stime = 0; + } +} + +static void vtime_account_guest(struct task_struct *tsk, + struct vtime *vtime) +{ + vtime->gtime += get_vtime_delta(vtime); + if (vtime->gtime >= TICK_NSEC) { + account_guest_time(tsk, vtime->gtime); + vtime->gtime = 0; + } }
[tip:sched/core] sched/core: Fix rq lock pinning warning after call balance callbacks
Commit-ID: d7921a5ddab8d30e06e321f37eec629f23797486 Gitweb: http://git.kernel.org/tip/d7921a5ddab8d30e06e321f37eec629f23797486 Author: Wanpeng Li AuthorDate: Thu, 16 Mar 2017 19:45:19 -0700 Committer: Ingo Molnar CommitDate: Thu, 23 Mar 2017 07:44:51 +0100 sched/core: Fix rq lock pinning warning after call balance callbacks This can be reproduced by running rt-migrate-test: WARNING: CPU: 2 PID: 2195 at kernel/locking/lockdep.c:3670 lock_unpin_lock() unpinning an unpinned lock ... Call Trace: dump_stack() __warn() warn_slowpath_fmt() lock_unpin_lock() __balance_callback() __schedule() schedule() futex_wait_queue_me() futex_wait() do_futex() SyS_futex() do_syscall_64() entry_SYSCALL64_slow_path() Revert the rq_lock_irqsave() usage here, the whole point of the balance_callback() was to allow dropping rq->lock. Reported-by: Fengguang Wu Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: 8a8c69c32778 ("sched/core: Add rq->lock wrappers") Link: http://lkml.kernel.org/r/1489718719-3951-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c762f62..ab9f6ac 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2776,9 +2776,9 @@ static void __balance_callback(struct rq *rq) { struct callback_head *head, *next; void (*func)(struct rq *rq); - struct rq_flags rf; + unsigned long flags; - rq_lock_irqsave(rq, &rf); + raw_spin_lock_irqsave(&rq->lock, flags); head = rq->balance_callback; rq->balance_callback = NULL; while (head) { @@ -2789,7 +2789,7 @@ static void __balance_callback(struct rq *rq) func(rq); } - rq_unlock_irqrestore(rq, &rf); + raw_spin_unlock_irqrestore(&rq->lock, flags); } static inline void balance_callback(struct rq *rq)
[tip:sched/core] sched/deadline: Add missing update_rq_clock() in dl_task_timer()
Commit-ID: dcc3b5ffe1b32771c9a22e2c916fb94c4fcf5b79 Gitweb: http://git.kernel.org/tip/dcc3b5ffe1b32771c9a22e2c916fb94c4fcf5b79 Author: Wanpeng Li AuthorDate: Mon, 6 Mar 2017 21:51:28 -0800 Committer: Ingo Molnar CommitDate: Thu, 16 Mar 2017 09:20:59 +0100 sched/deadline: Add missing update_rq_clock() in dl_task_timer() The following warning can be triggered by hot-unplugging the CPU on which an active SCHED_DEADLINE task is running on: [ cut here ] WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40 rq->clock_update_flags < RQCF_ACT_SKIP CPU: 7 PID: 0 Comm: swapper/7 Tainted: GB 4.11.0-rc1+ #24 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016 Call Trace: dump_stack+0x85/0xc4 __warn+0x172/0x1b0 warn_slowpath_fmt+0xb4/0xf0 ? __warn+0x1b0/0x1b0 ? debug_check_no_locks_freed+0x2c0/0x2c0 ? cpudl_set+0x3d/0x2b0 replenish_dl_entity+0x71e/0xc40 enqueue_task_dl+0x2ea/0x12e0 ? dl_task_timer+0x777/0x990 ? __hrtimer_run_queues+0x270/0xa50 dl_task_timer+0x316/0x990 ? enqueue_task_dl+0x12e0/0x12e0 ? enqueue_task_dl+0x12e0/0x12e0 __hrtimer_run_queues+0x270/0xa50 ? hrtimer_cancel+0x20/0x20 ? hrtimer_interrupt+0x119/0x600 hrtimer_interrupt+0x19c/0x600 ? trace_hardirqs_off+0xd/0x10 local_apic_timer_interrupt+0x74/0xe0 smp_apic_timer_interrupt+0x76/0xa0 apic_timer_interrupt+0x93/0xa0 The DL task will be migrated to a suitable later deadline rq once the DL timer fires and currnet rq is offline. The rq clock of the new rq should be updated. This patch fixes it by updating the rq clock after holding the new rq's rq lock. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Matt Fleming Cc: Juri Lelli Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 99b2c33..c6db3fd 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -638,6 +638,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) lockdep_unpin_lock(&rq->lock, rf.cookie); rq = dl_task_offline_migration(rq, p); rf.cookie = lockdep_pin_lock(&rq->lock); + update_rq_clock(rq); /* * Now that the task has been migrated to the new RQ and we
[tip:sched/urgent] sched/fair: Update rq clock before changing a task's CPU affinity
Commit-ID: a499c3ead88ccf147fc50689e85a530ad923ce36 Gitweb: http://git.kernel.org/tip/a499c3ead88ccf147fc50689e85a530ad923ce36 Author: Wanpeng Li AuthorDate: Tue, 21 Feb 2017 23:52:55 -0800 Committer: Ingo Molnar CommitDate: Fri, 24 Feb 2017 08:58:33 +0100 sched/fair: Update rq clock before changing a task's CPU affinity This is triggered during boot when CONFIG_SCHED_DEBUG is enabled: [ cut here ] WARNING: CPU: 6 PID: 81 at kernel/sched/sched.h:812 set_next_entity+0x11d/0x380 rq->clock_update_flags < RQCF_ACT_SKIP CPU: 6 PID: 81 Comm: torture_shuffle Not tainted 4.10.0+ #1 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016 Call Trace: dump_stack+0x85/0xc2 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 set_next_entity+0x11d/0x380 set_curr_task_fair+0x2b/0x60 do_set_cpus_allowed+0x139/0x180 __set_cpus_allowed_ptr+0x113/0x260 set_cpus_allowed_ptr+0x10/0x20 torture_shuffle+0xfd/0x180 kthread+0x10f/0x150 ? torture_shutdown_init+0x60/0x60 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x31/0x40 ---[ end trace dd94d92344cea9c6 ]--- The task is running && !queued, so there is no rq clock update before calling set_curr_task(). This patch fixes it by updating rq clock after holding rq->lock/pi_lock just as what other dequeue + put_prev + enqueue + set_curr story does. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Matt Fleming Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1487749975-5994-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d6e828..cc1e3e0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1087,6 +1087,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, int ret = 0; rq = task_rq_lock(p, &rf); + update_rq_clock(rq); if (p->flags & PF_KTHREAD) { /*
[tip:x86/apic] x86/apic: Prevent tracing on apic_msr_write_eoi()
Commit-ID: 8ca225520e278e41396dab0524989f4848626f83 Gitweb: http://git.kernel.org/tip/8ca225520e278e41396dab0524989f4848626f83 Author: Wanpeng Li AuthorDate: Mon, 7 Nov 2016 11:13:40 +0800 Committer: Thomas Gleixner CommitDate: Wed, 9 Nov 2016 22:03:14 +0100 x86/apic: Prevent tracing on apic_msr_write_eoi() The following RCU lockdep warning led to adding irq_enter()/irq_exit() into smp_reschedule_interrupt(): RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0 RCU used illegally from extended quiescent state! no locks held by swapper/1/0. do_trace_write_msr native_write_msr native_apic_msr_eoi_write smp_reschedule_interrupt reschedule_interrupt As Peterz pointed out: | So now we're making a very frequent interrupt slower because of debug | code. | | The thing is, many many smp_reschedule_interrupt() invocations don't | actually execute anything much at all and are only sent to tickle the | return to user path (which does the actual preemption). | | Having to do the whole irq_enter/irq_exit dance just for this unlikely | debug case totally blows. Use the wrmsr_notrace() variant in native_apic_msr_write_eoi, annotate the kvm variant with notrace and add a native_apic_eoi callback to the apic structure so KVM guests are covered as well. This allows to revert the irq_enter/irq_exit dance in smp_reschedule_interrupt(). Suggested-by: Peter Zijlstra Suggested-by: Paolo Bonzini Signed-off-by: Wanpeng Li Acked-by: Paolo Bonzini Cc: k...@vger.kernel.org Cc: Mike Galbraith Cc: Borislav Petkov Link: http://lkml.kernel.org/r/1478488420-5982-3-git-send-email-wanpeng...@hotmail.com Signed-off-by: Thomas Gleixner --- arch/x86/include/asm/apic.h | 3 ++- arch/x86/kernel/apic/apic.c | 1 + arch/x86/kernel/kvm.c | 4 ++-- arch/x86/kernel/smp.c | 2 -- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index f5aaf6c..a5a0bcf 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -196,7 +196,7 @@ static inline void native_apic_msr_write(u32 reg, u32 v) static inline void native_apic_msr_eoi_write(u32 reg, u32 v) { - wrmsr(APIC_BASE_MSR + (APIC_EOI >> 4), APIC_EOI_ACK, 0); + wrmsr_notrace(APIC_BASE_MSR + (APIC_EOI >> 4), APIC_EOI_ACK, 0); } static inline u32 native_apic_msr_read(u32 reg) @@ -332,6 +332,7 @@ struct apic { * on write for EOI. */ void (*eoi_write)(u32 reg, u32 v); + void (*native_eoi_write)(u32 reg, u32 v); u64 (*icr_read)(void); void (*icr_write)(u32 low, u32 high); void (*wait_icr_idle)(void); diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 88c657b..2686894 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -2263,6 +2263,7 @@ void __init apic_set_eoi_write(void (*eoi_write)(u32 reg, u32 v)) for (drv = __apicdrivers; drv < __apicdrivers_end; drv++) { /* Should happen once for each apic */ WARN_ON((*drv)->eoi_write == eoi_write); + (*drv)->native_eoi_write = (*drv)->eoi_write; (*drv)->eoi_write = eoi_write; } } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..aad52f1 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -308,7 +308,7 @@ static void kvm_register_steal_time(void) static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED; -static void kvm_guest_apic_eoi_write(u32 reg, u32 val) +static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 val) { /** * This relies on __test_and_clear_bit to modify the memory @@ -319,7 +319,7 @@ static void kvm_guest_apic_eoi_write(u32 reg, u32 val) */ if (__test_and_clear_bit(KVM_PV_EOI_BIT, this_cpu_ptr(&kvm_apic_eoi))) return; - apic_write(APIC_EOI, APIC_EOI_ACK); + apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK); } static void kvm_guest_cpu_init(void) diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c index c00cb64..68f8cc2 100644 --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -261,10 +261,8 @@ static inline void __smp_reschedule_interrupt(void) __visible void smp_reschedule_interrupt(struct pt_regs *regs) { - irq_enter(); ack_APIC_irq(); __smp_reschedule_interrupt(); - irq_exit(); /* * KVM uses this interrupt to force a cpu out of guest mode */
[tip:x86/apic] x86/msr: Add wrmsr_notrace()
Commit-ID: b2c5ea4f759190ee9f75687a00035c1a66d0d743 Gitweb: http://git.kernel.org/tip/b2c5ea4f759190ee9f75687a00035c1a66d0d743 Author: Wanpeng Li AuthorDate: Mon, 7 Nov 2016 11:13:39 +0800 Committer: Thomas Gleixner CommitDate: Wed, 9 Nov 2016 22:03:14 +0100 x86/msr: Add wrmsr_notrace() Required to remove the extra irq_enter()/irq_exit() in smp_reschedule_interrupt(). Suggested-by: Peter Zijlstra Suggested-by: Paolo Bonzini Signed-off-by: Wanpeng Li Acked-by: Paolo Bonzini Reviewed-by: Borislav Petkov Cc: k...@vger.kernel.org Cc: Mike Galbraith Link: http://lkml.kernel.org/r/1478488420-5982-2-git-send-email-wanpeng...@hotmail.com Signed-off-by: Thomas Gleixner --- arch/x86/include/asm/msr.h | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index b5fee97..9b0a232 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -115,17 +115,29 @@ static inline unsigned long long native_read_msr_safe(unsigned int msr, } /* Can be uninlined because referenced by paravirt */ -notrace static inline void native_write_msr(unsigned int msr, +static notrace inline void __native_write_msr_notrace(unsigned int msr, unsigned low, unsigned high) { asm volatile("1: wrmsr\n" "2:\n" _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wrmsr_unsafe) : : "c" (msr), "a"(low), "d" (high) : "memory"); +} + +/* Can be uninlined because referenced by paravirt */ +static notrace inline void native_write_msr(unsigned int msr, + unsigned low, unsigned high) +{ + __native_write_msr_notrace(msr, low, high); if (msr_tracepoint_active(__tracepoint_write_msr)) do_trace_write_msr(msr, ((u64)high << 32 | low), 0); } +static inline void wrmsr_notrace(unsigned msr, unsigned low, unsigned high) +{ + __native_write_msr_notrace(msr, low, high); +} + /* Can be uninlined because referenced by paravirt */ notrace static inline int native_write_msr_safe(unsigned int msr, unsigned low, unsigned high)
[tip:x86/urgent] x86/smp: Add irq_enter/exit() in smp_reschedule_interrupt()
Commit-ID: 1ec6ec14a2943f6f611fc1d5fb2d4eaa85bd9d72 Gitweb: http://git.kernel.org/tip/1ec6ec14a2943f6f611fc1d5fb2d4eaa85bd9d72 Author: Wanpeng Li AuthorDate: Fri, 14 Oct 2016 09:48:53 +0800 Committer: Thomas Gleixner CommitDate: Fri, 14 Oct 2016 14:14:20 +0200 x86/smp: Add irq_enter/exit() in smp_reschedule_interrupt() === [ INFO: suspicious RCU usage. ] 4.8.0+ #24 Not tainted --- ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage! other info that might help us debug this: RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0 RCU used illegally from extended quiescent state! no locks held by swapper/1/0. [] do_trace_write_msr+0x135/0x140 [] native_write_msr+0x20/0x30 [] native_apic_msr_eoi_write+0x1d/0x30 [] smp_reschedule_interrupt+0x1d/0x30 [] reschedule_interrupt+0x96/0xa0 Reschedule interrupt may be called in cpu idle state. This causes lockdep check warning above. Add irq_enter/exit() in smp_reschedule_interrupt(), irq_enter() tells the RCU subsystems to end the extended quiescent state, so the following trace call in ack_APIC_irq() works correctly. Signed-off-by: Wanpeng Li Cc: Peter Zijlstra Cc: Mike Galbraith Link: http://lkml.kernel.org/r/1476409733-5133-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Thomas Gleixner --- arch/x86/kernel/smp.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c index 658777c..ac2ee87 100644 --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -259,8 +259,10 @@ static inline void __smp_reschedule_interrupt(void) __visible void smp_reschedule_interrupt(struct pt_regs *regs) { + irq_enter(); ack_APIC_irq(); __smp_reschedule_interrupt(); + irq_exit(); /* * KVM uses this interrupt to force a cpu out of guest mode */
[tip:sched/urgent] sched/fair: Fix sched domains NULL dereference in select_idle_sibling()
Commit-ID: 9cfb38a7ba5a9c27c1af8093fb1af4b699c0a441 Gitweb: http://git.kernel.org/tip/9cfb38a7ba5a9c27c1af8093fb1af4b699c0a441 Author: Wanpeng Li AuthorDate: Sun, 9 Oct 2016 08:04:03 +0800 Committer: Ingo Molnar CommitDate: Tue, 11 Oct 2016 10:40:06 +0200 sched/fair: Fix sched domains NULL dereference in select_idle_sibling() Commit: 10e2f1acd01 ("sched/core: Rewrite and improve select_idle_siblings()") ... improved select_idle_sibling(), but also triggered a regression (crash) during CPU-hotplug: BUG: unable to handle kernel NULL pointer dereference at 0078 IP: [] select_idle_sibling+0x1c2/0x4f0 Call Trace: select_task_rq_fair+0x749/0x930 ? select_task_rq_fair+0xb4/0x930 ? __lock_is_held+0x54/0x70 try_to_wake_up+0x19a/0x5b0 default_wake_function+0x12/0x20 autoremove_wake_function+0x12/0x40 __wake_up_common+0x55/0x90 __wake_up+0x39/0x50 wake_up_klogd_work_func+0x40/0x60 irq_work_run_list+0x57/0x80 irq_work_run+0x2c/0x30 smp_irq_work_interrupt+0x2e/0x40 irq_work_interrupt+0x96/0xa0 ? _raw_spin_unlock_irqrestore+0x45/0x80 try_to_wake_up+0x4a/0x5b0 wake_up_state+0x10/0x20 __kthread_unpark+0x67/0x70 kthread_unpark+0x22/0x30 cpuhp_online_idle+0x3e/0x70 cpu_startup_entry+0x6a/0x450 start_secondary+0x154/0x180 This can be reproduced by running the ftrace test case of kselftest, the test case will hot-unplug the CPU and the CPU will attach to the NULL sched-domain during scheduler teardown. The step 2 for the rewrite select_idle_siblings(): | Step 2) tracks the average cost of the scan and compares this to the | average idle time guestimate for the CPU doing the wakeup. If the CPU which doing the wakeup is the going hot-unplug CPU, then NULL sched domain will be dereferenced to acquire the average cost of the scan. This patch fix it by failing the search of an idle CPU in the LLC process if this sched domain is NULL. Tested-by: Catalin Marinas Signed-off-by: Wanpeng Li Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1475971443-3187-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 502e95a..8b03fb5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5471,13 +5471,18 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd */ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) { - struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); - u64 avg_idle = this_rq()->avg_idle; - u64 avg_cost = this_sd->avg_scan_cost; + struct sched_domain *this_sd; + u64 avg_cost, avg_idle = this_rq()->avg_idle; u64 time, cost; s64 delta; int cpu, wrap; + this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); + if (!this_sd) + return -1; + + avg_cost = this_sd->avg_scan_cost; + /* * Due to large variance we need a large fuzz factor; hackbench in * particularly is sensitive here.
[tip:x86/urgent] x86/entry/64: Fix context tracking state warning when load_gs_index fails
Commit-ID: 2fa5f04f85730d0c4f49f984b7efeb4f8d5bd1fc Gitweb: http://git.kernel.org/tip/2fa5f04f85730d0c4f49f984b7efeb4f8d5bd1fc Author: Wanpeng Li AuthorDate: Fri, 30 Sep 2016 09:01:06 +0800 Committer: Ingo Molnar CommitDate: Fri, 30 Sep 2016 13:53:12 +0200 x86/entry/64: Fix context tracking state warning when load_gs_index fails This warning: WARNING: CPU: 0 PID: 3331 at arch/x86/entry/common.c:45 enter_from_user_mode+0x32/0x50 CPU: 0 PID: 3331 Comm: ldt_gdt_64 Not tainted 4.8.0-rc7+ #13 Call Trace: dump_stack+0x99/0xd0 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 enter_from_user_mode+0x32/0x50 error_entry+0x6d/0xc0 ? general_protection+0x12/0x30 ? native_load_gs_index+0xd/0x20 ? do_set_thread_area+0x19c/0x1f0 SyS_set_thread_area+0x24/0x30 do_int80_syscall_32+0x7c/0x220 entry_INT80_compat+0x38/0x50 ... can be reproduced by running the GS testcase of the ldt_gdt test unit in the x86 selftests. do_int80_syscall_32() will call enter_form_user_mode() to convert context tracking state from user state to kernel state. The load_gs_index() call can fail with user gsbase, gsbase will be fixed up and proceed if this happen. However, enter_from_user_mode() will be called again in the fixed up path though it is context tracking kernel state currently. This patch fixes it by just fixing up gsbase and telling lockdep that IRQs are off once load_gs_index() failed with user gsbase. Signed-off-by: Wanpeng Li Acked-by: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1475197266-3440-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- arch/x86/entry/entry_64.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index d172c61..02fff3e 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -1002,7 +1002,6 @@ ENTRY(error_entry) testb $3, CS+8(%rsp) jz .Lerror_kernelspace -.Lerror_entry_from_usermode_swapgs: /* * We entered from user mode or we're pretending to have entered * from user mode due to an IRET fault. @@ -1045,7 +1044,8 @@ ENTRY(error_entry) * gsbase and proceed. We'll fix up the exception and land in * .Lgs_change's error handler with kernel gsbase. */ - jmp .Lerror_entry_from_usermode_swapgs + SWAPGS + jmp .Lerror_entry_done .Lbstep_iret: /* Fix truncated RIP */
[tip:x86/apic] x86/apic: Order irq_enter/exit() calls correctly vs. ack_APIC_irq()
Commit-ID: b0f48706a176b71a6e54f399d7404bbeeaa7cfab Gitweb: http://git.kernel.org/tip/b0f48706a176b71a6e54f399d7404bbeeaa7cfab Author: Wanpeng Li AuthorDate: Sun, 18 Sep 2016 19:34:51 +0800 Committer: Thomas Gleixner CommitDate: Tue, 20 Sep 2016 00:31:19 +0200 x86/apic: Order irq_enter/exit() calls correctly vs. ack_APIC_irq() === [ INFO: suspicious RCU usage. ] 4.8.0-rc6+ #5 Not tainted --- ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage! other info that might help us debug this: RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0 RCU used illegally from extended quiescent state! no locks held by swapper/2/0. stack backtrace: CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.8.0-rc6+ #5 Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015 8d1bd6003f10 94446949 8d1bd4a68000 0001 8d1bd6003f40 940e9247 8d1bbdfcf3d0 080b 8d1bd6003f70 Call Trace: [] dump_stack+0x99/0xd0 [] lockdep_rcu_suspicious+0xe7/0x120 [] do_trace_write_msr+0x135/0x140 [] native_write_msr+0x20/0x30 [] native_apic_msr_eoi_write+0x1d/0x30 [] smp_trace_call_function_interrupt+0x1e/0x270 [] trace_call_function_interrupt+0x96/0xa0 [] ? cpuidle_enter_state+0xe4/0x360 [] ? cpuidle_enter_state+0xcf/0x360 [] cpuidle_enter+0x17/0x20 [] cpu_startup_entry+0x338/0x4d0 [] start_secondary+0x154/0x180 This can be reproduced readily by running ftrace test case of kselftest. Move the irq_enter() call before ack_APIC_irq(), because irq_enter() tells the RCU susbstems to end the extended quiescent state, so that the following trace call in ack_APIC_irq() works correctly. The same applies to exiting_ack_irq() which calls ack_APIC_irq() after irq_exit(). [ tglx: Massaged changelog ] Signed-off-by: Wanpeng Li Cc: Peter Zijlstra Cc: Wanpeng Li Link: http://lkml.kernel.org/r/1474198491-3738-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Thomas Gleixner --- arch/x86/include/asm/apic.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 1243577..f5aaf6c 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -650,8 +650,8 @@ static inline void entering_ack_irq(void) static inline void ipi_entering_ack_irq(void) { - ack_APIC_irq(); irq_enter(); + ack_APIC_irq(); } static inline void exiting_irq(void) @@ -661,9 +661,8 @@ static inline void exiting_irq(void) static inline void exiting_ack_irq(void) { - irq_exit(); - /* Ack only at the end to avoid potential reentry */ ack_APIC_irq(); + irq_exit(); } extern void ioapic_zap_locks(void);
[tip:timers/core] tick/nohz: Prevent stopping the tick on an offline CPU
Commit-ID: 57ccdf449f962ab5fc8cbf26479402f13bdb8be7 Gitweb: http://git.kernel.org/tip/57ccdf449f962ab5fc8cbf26479402f13bdb8be7 Author: Wanpeng Li AuthorDate: Wed, 7 Sep 2016 18:51:13 +0800 Committer: Thomas Gleixner CommitDate: Tue, 13 Sep 2016 17:53:52 +0200 tick/nohz: Prevent stopping the tick on an offline CPU can_stop_full_tick() has no check for offline cpus. So it allows to stop the tick on an offline cpu from the interrupt return path, which is wrong and subsequently makes irq_work_needs_cpu() warn about being called for an offline cpu. Commit f7ea0fd639c2c4 ("tick: Don't invoke tick_nohz_stop_sched_tick() if the cpu is offline") added prevention for can_stop_idle_tick(), but forgot to do the same in can_stop_full_tick(). Add it. [ tglx: Massaged changelog ] Signed-off-by: Wanpeng Li Cc: Peter Zijlstra Cc: Frederic Weisbecker Link: http://lkml.kernel.org/r/1473245473-4463-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Thomas Gleixner --- kernel/time/tick-sched.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 2ec7c00..3bcb61b 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -186,10 +186,13 @@ static bool check_tick_dependency(atomic_t *dep) return false; } -static bool can_stop_full_tick(struct tick_sched *ts) +static bool can_stop_full_tick(int cpu, struct tick_sched *ts) { WARN_ON_ONCE(!irqs_disabled()); + if (unlikely(!cpu_online(cpu))) + return false; + if (check_tick_dependency(&tick_dep_mask)) return false; @@ -843,7 +846,7 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts) if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE) return; - if (can_stop_full_tick(ts)) + if (can_stop_full_tick(cpu, ts)) tick_nohz_stop_sched_tick(ts, ktime_get(), cpu); else if (ts->tick_stopped) tick_nohz_restart_sched_tick(ts, ktime_get());
[tip:sched/core] sched/deadline: Fix the intention to re-evalute tick dependency for offline CPU
Commit-ID: 61c7aca695b6fabe85d0fc424fe8ae2f66f267dd Gitweb: http://git.kernel.org/tip/61c7aca695b6fabe85d0fc424fe8ae2f66f267dd Author: Wanpeng Li AuthorDate: Wed, 31 Aug 2016 18:27:44 +0800 Committer: Ingo Molnar CommitDate: Mon, 5 Sep 2016 13:29:45 +0200 sched/deadline: Fix the intention to re-evalute tick dependency for offline CPU The dl task will be replenished after dl task timer fire and start a new period. It will be enqueued and to re-evaluate its dependency on the tick in order to restart it. However, if the CPU is hot-unplugged, irq_work_queue will splash since the target CPU is offline. As a result we get: WARNING: CPU: 2 PID: 0 at kernel/irq_work.c:69 irq_work_queue_on+0xad/0xe0 Call Trace: dump_stack+0x99/0xd0 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 irq_work_queue_on+0xad/0xe0 tick_nohz_full_kick_cpu+0x44/0x50 tick_nohz_dep_set_cpu+0x74/0xb0 enqueue_task_dl+0x226/0x480 activate_task+0x5c/0xa0 dl_task_timer+0x19b/0x2c0 ? push_dl_task.part.31+0x190/0x190 This can be triggered by hot-unplugging the full dynticks CPU which dl task is running on. We enqueue the dl task on the offline CPU, because we need to do replenish for start_dl_timer(). So, as Juri pointed out, we would need to do is calling replenish_dl_entity() directly, instead of enqueue_task_dl(). pi_se shouldn't be a problem as the task shouldn't be boosted if it was throttled. This patch fixes it by avoiding the whole enqueue+dequeue+enqueue story, by first migrating (set_task_cpu()) and then doing 1 enqueue. Suggested-by: Peter Zijlstra Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Frederic Weisbecker Cc: Juri Lelli Cc: Linus Torvalds Cc: Luca Abeni Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1472639264-3932-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 46 ++ 1 file changed, 18 insertions(+), 28 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 18fb0b8..0c75bc6 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -243,10 +243,8 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq); static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p) { struct rq *later_rq = NULL; - bool fallback = false; later_rq = find_lock_later_rq(p, rq); - if (!later_rq) { int cpu; @@ -254,7 +252,6 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p * If we cannot preempt any rq, fall back to pick any * online cpu. */ - fallback = true; cpu = cpumask_any_and(cpu_active_mask, tsk_cpus_allowed(p)); if (cpu >= nr_cpu_ids) { /* @@ -274,16 +271,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p double_lock_balance(rq, later_rq); } - /* -* By now the task is replenished and enqueued; migrate it. -*/ - deactivate_task(rq, p, 0); set_task_cpu(p, later_rq->cpu); - activate_task(later_rq, p, 0); - - if (!fallback) - resched_curr(later_rq); - double_unlock_balance(later_rq, rq); return later_rq; @@ -641,29 +629,31 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) goto unlock; } - enqueue_task_dl(rq, p, ENQUEUE_REPLENISH); - if (dl_task(rq->curr)) - check_preempt_curr_dl(rq, p, 0); - else - resched_curr(rq); - #ifdef CONFIG_SMP - /* -* Perform balancing operations here; after the replenishments. We -* cannot drop rq->lock before this, otherwise the assertion in -* start_dl_timer() about not missing updates is not true. -* -* If we find that the rq the task was on is no longer available, we -* need to select a new rq. -* -* XXX figure out if select_task_rq_dl() deals with offline cpus. -*/ if (unlikely(!rq->online)) { + /* +* If the runqueue is no longer available, migrate the +* task elsewhere. This necessarily changes rq. +*/ lockdep_unpin_lock(&rq->lock, rf.cookie); rq = dl_task_offline_migration(rq, p); rf.cookie = lockdep_pin_lock(&rq->lock); + + /* +* Now that the task has been migrated to the new RQ and we +* have that locked, proceed as normal and enqueue the task +* there. +*/ } +#endif + enqueue_task_dl(rq, p, ENQUEUE_REPLENISH); + if (dl_task(rq->curr)) + check_preempt_curr_dl(rq, p, 0); + else +
[tip:timers/urgent] tick/nohz: Fix softlockup on scheduler stalls in kvm guest
Commit-ID: 08d072599234c959b0b82b63fa252c129225a899 Gitweb: http://git.kernel.org/tip/08d072599234c959b0b82b63fa252c129225a899 Author: Wanpeng Li AuthorDate: Fri, 2 Sep 2016 14:38:23 +0800 Committer: Thomas Gleixner CommitDate: Fri, 2 Sep 2016 10:25:40 +0200 tick/nohz: Fix softlockup on scheduler stalls in kvm guest tick_nohz_start_idle() is prevented to be called if the idle tick can't be stopped since commit 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter"). As a result, after suspend/resume the host machine, full dynticks kvm guest will softlockup: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0] Call Trace: default_idle+0x31/0x1a0 arch_cpu_idle+0xf/0x20 default_idle_call+0x2a/0x50 cpu_startup_entry+0x39b/0x4d0 rest_init+0x138/0x140 ? rest_init+0x5/0x140 start_kernel+0x4c1/0x4ce ? set_init_arg+0x55/0x55 ? early_idt_handler_array+0x120/0x120 x86_64_start_reservations+0x24/0x26 x86_64_start_kernel+0x142/0x14f In addition, cat /proc/stat | grep cpu in guest or host: cpu 398 16 5049 15754 5490 0 1 46 0 0 cpu0 206 5 450 0 0 0 1 14 0 0 cpu1 81 0 3937 3149 1514 0 0 9 0 0 cpu2 45 6 332 6052 2243 0 0 11 0 0 cpu3 65 2 328 6552 1732 0 0 11 0 0 The idle and iowait states are weird 0 for cpu0(housekeeping). The bug is present in both guest and host kernels, and they both have cpu0's idle and iowait states issue, however, host kernel's suspend/resume path etc will touch watchdog to avoid the softlockup. - The watchdog will not be touched in tick_nohz_stop_idle path (need be touched since the scheduler stall is expected) if idle_active flags are not detected. - The idle and iowait states will not be accounted when exit idle loop (resched or interrupt) if idle start time and idle_active flags are not set. This patch fixes it by reverting commit 1f3b0f8243cb934 since can't stop idle tick doesn't mean can't be idle. Fixes: 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter") Signed-off-by: Wanpeng Li Cc: Sanjeev Yadav Cc: Gaurav Jindal Cc: sta...@vger.kernel.org Cc: k...@vger.kernel.org Cc: Radim Krčmář Cc: Peter Zijlstra Cc: Paolo Bonzini Link: http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Thomas Gleixner --- kernel/time/tick-sched.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 204fdc8..2ec7c00 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -908,10 +908,11 @@ static void __tick_nohz_idle_enter(struct tick_sched *ts) ktime_t now, expires; int cpu = smp_processor_id(); + now = tick_nohz_start_idle(ts); + if (can_stop_idle_tick(cpu, ts)) { int was_stopped = ts->tick_stopped; - now = tick_nohz_start_idle(ts); ts->idle_calls++; expires = tick_nohz_stop_sched_tick(ts, now, cpu);
[tip:x86/urgent] x86/apic: Do not init irq remapping if ioapic is disabled
Commit-ID: 2e63ad4bd5dd583871e6602f9d398b9322d358d9 Gitweb: http://git.kernel.org/tip/2e63ad4bd5dd583871e6602f9d398b9322d358d9 Author: Wanpeng Li AuthorDate: Tue, 23 Aug 2016 20:07:19 +0800 Committer: Thomas Gleixner CommitDate: Wed, 24 Aug 2016 09:45:40 +0200 x86/apic: Do not init irq remapping if ioapic is disabled native_smp_prepare_cpus -> default_setup_apic_routing -> enable_IR_x2apic -> irq_remapping_prepare -> intel_prepare_irq_remapping -> intel_setup_irq_remapping So IR table is setup even if "noapic" boot parameter is added. As a result we crash later when the interrupt affinity is set due to a half initialized remapping infrastructure. Prevent remap initialization when IOAPIC is disabled. Signed-off-by: Wanpeng Li Cc: Peter Zijlstra Cc: Joerg Roedel Link: http://lkml.kernel.org/r/1471954039-3942-1-git-send-email-wanpeng...@hotmail.com Cc: sta...@vger.kernel.org Signed-off-by: Thomas Gleixner --- arch/x86/kernel/apic/apic.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index cea4fc1..50c95af 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -1623,6 +1623,9 @@ void __init enable_IR_x2apic(void) unsigned long flags; int ret, ir_stat; + if (skip_ioapic_setup) + return; + ir_stat = irq_remapping_prepare(); if (ir_stat < 0 && !x2apic_supported()) return;
[tip:sched/core] sched/cputime: Resync steal time when guest & host lose sync
Commit-ID: 03cbc732639ddcad15218c4b2046d255851ff1e3 Gitweb: http://git.kernel.org/tip/03cbc732639ddcad15218c4b2046d255851ff1e3 Author: Wanpeng Li AuthorDate: Wed, 17 Aug 2016 10:05:46 +0800 Committer: Ingo Molnar CommitDate: Thu, 18 Aug 2016 11:19:48 +0200 sched/cputime: Resync steal time when guest & host lose sync Commit: 57430218317e ("sched/cputime: Count actually elapsed irq & softirq time") ... fixed a bug but also triggered a regression: On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs on host one by one until there is only one left, then observe CPU utilization via 'top' in the guest, it shows: 100% st for cpu0(housekeeping) 75% st for other CPUs (nohz full mode) However, w/o this commit it shows the correct 75% for all four CPUs. When a guest is interrupted for a longer amount of time, missed clock ticks are not redelivered later. Because of that, we should not limit the amount of steal time accounted to the amount of time that the calling functions think have passed. However, the interval returned by account_other_time() is NOT rounded down to the nearest jiffy, while the base interval in get_vtime_delta() it is subtracted from is, so the max cputime limit is required to avoid underflow. This patch fixes the regression by limiting the account_other_time() from get_vtime_delta() to avoid underflow, and lets the other three call sites (in account_other_time() and steal_account_process_time()) account however much steal time the host told us elapsed. Suggested-by: Rik van Riel Suggested-by: Paolo Bonzini Signed-off-by: Wanpeng Li Reviewed-by: Rik van Riel Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Radim Krcmar Cc: Thomas Gleixner Cc: k...@vger.kernel.org Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng...@hotmail.com [ Improved the changelog. ] Signed-off-by: Ingo Molnar --- kernel/sched/cputime.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 2ee83b2..a846cf8 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -263,6 +263,11 @@ void account_idle_time(cputime_t cputime) cpustat[CPUTIME_IDLE] += (__force u64) cputime; } +/* + * When a guest is interrupted for a longer amount of time, missed clock + * ticks are not redelivered later. Due to that, this function may on + * occasion account more time than the calling functions think elapsed. + */ static __always_inline cputime_t steal_account_process_time(cputime_t maxtime) { #ifdef CONFIG_PARAVIRT @@ -371,7 +376,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick, * idle, or potentially user or system time. Due to rounding, * other time can exceed ticks occasionally. */ - other = account_other_time(cputime); + other = account_other_time(ULONG_MAX); if (other >= cputime) return; cputime -= other; @@ -486,7 +491,7 @@ void account_process_tick(struct task_struct *p, int user_tick) } cputime = cputime_one_jiffy; - steal = steal_account_process_time(cputime); + steal = steal_account_process_time(ULONG_MAX); if (steal >= cputime) return; @@ -516,7 +521,7 @@ void account_idle_ticks(unsigned long ticks) } cputime = jiffies_to_cputime(ticks); - steal = steal_account_process_time(cputime); + steal = steal_account_process_time(ULONG_MAX); if (steal >= cputime) return; @@ -699,6 +704,13 @@ static cputime_t get_vtime_delta(struct task_struct *tsk) unsigned long now = READ_ONCE(jiffies); cputime_t delta, other; + /* +* Unlike tick based timing, vtime based timing never has lost +* ticks, and no need for steal time accounting to make up for +* lost ticks. Vtime accounts a rounded version of actual +* elapsed time. Limit account_other_time to prevent rounding +* errors from causing elapsed vtime to go negative. +*/ delta = jiffies_to_cputime(now - tsk->vtime_snap); other = account_other_time(delta); WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
[tip:sched/urgent] sched/cputime: Fix steal time accounting
Commit-ID: f9bcf1e0e0145323ba2cf72ecad5264ff3883eb1 Gitweb: http://git.kernel.org/tip/f9bcf1e0e0145323ba2cf72ecad5264ff3883eb1 Author: Wanpeng Li AuthorDate: Thu, 11 Aug 2016 13:36:35 +0800 Committer: Ingo Molnar CommitDate: Thu, 11 Aug 2016 11:02:14 +0200 sched/cputime: Fix steal time accounting Commit: 57430218317 ("sched/cputime: Count actually elapsed irq & softirq time") ... didn't take steal time into consideration with passing the noirqtime kernel parameter. As Paolo pointed out before: | Why not? If idle=poll, for example, any time the guest is suspended (and | thus cannot poll) does count as stolen time. This patch fixes it by reducing steal time from idle time accounting when the noirqtime parameter is true. The average idle time drops from 56.8% to 54.75% for nohz idle kvm guest(noirqtime, idle=poll, four vCPUs running on one pCPU). Signed-off-by: Wanpeng Li Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Paolo Bonzini Cc: Peter Zijlstra (Intel) Cc: Peter Zijlstra Cc: Radim Cc: Rik van Riel Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1470893795-3527-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/cputime.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 1934f65..8b9bcc5 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -508,13 +508,20 @@ void account_process_tick(struct task_struct *p, int user_tick) */ void account_idle_ticks(unsigned long ticks) { - + cputime_t cputime, steal; if (sched_clock_irqtime) { irqtime_account_idle_ticks(ticks); return; } - account_idle_time(jiffies_to_cputime(ticks)); + cputime = cputime_one_jiffy; + steal = steal_account_process_time(cputime); + + if (steal >= cputime) + return; + + cputime -= steal; + account_idle_time(cputime); } /*
[tip:locking/core] locking/pvqspinlock: Fix double hash race
Commit-ID: 229ce631574761870a2ac938845fadbd07f35caa Gitweb: http://git.kernel.org/tip/229ce631574761870a2ac938845fadbd07f35caa Author: Wanpeng Li AuthorDate: Thu, 14 Jul 2016 16:15:56 +0800 Committer: Ingo Molnar CommitDate: Wed, 10 Aug 2016 14:13:28 +0200 locking/pvqspinlock: Fix double hash race When the lock holder vCPU is racing with the queue head: CPU 0 (lock holder)CPU1 (queue head) ==== spin_lock(); spin_lock(); pv_kick_node():pv_wait_head_or_lock(): if (!lp) { lp = pv_hash(lock, pn); xchg(&l->locked, _Q_SLOW_VAL); } WRITE_ONCE(pn->state, vcpu_halted); cmpxchg(&pn->state, vcpu_halted, vcpu_hashed); WRITE_ONCE(l->locked, _Q_SLOW_VAL); (void)pv_hash(lock, pn); In this case, lock holder inserts the pv_node of queue head into the hash table and set _Q_SLOW_VAL unnecessary. This patch avoids it by restoring/setting vcpu_hashed state after failing adaptive locking spinning. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Pan Xinhui Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Waiman Long Link: http://lkml.kernel.org/r/1468484156-4521-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/locking/qspinlock_paravirt.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index 37649e6..8a99abf 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -450,7 +450,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node) goto gotlock; } } - WRITE_ONCE(pn->state, vcpu_halted); + WRITE_ONCE(pn->state, vcpu_hashed); qstat_inc(qstat_pv_wait_head, true); qstat_inc(qstat_pv_wait_again, waitcnt); pv_wait(&l->locked, _Q_SLOW_VAL);
[tip:sched/core] sched/deadline: Fix lock pinning warning during CPU hotplug
Commit-ID: c0c8c9fa210c9a042060435f17e40ba4a76d6d6f Gitweb: http://git.kernel.org/tip/c0c8c9fa210c9a042060435f17e40ba4a76d6d6f Author: Wanpeng Li AuthorDate: Thu, 4 Aug 2016 09:42:20 +0800 Committer: Ingo Molnar CommitDate: Wed, 10 Aug 2016 14:02:55 +0200 sched/deadline: Fix lock pinning warning during CPU hotplug The following warning can be triggered by hot-unplugging the CPU on which an active SCHED_DEADLINE task is running on: WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3531 lock_release+0x690/0x6a0 releasing a pinned lock Call Trace: dump_stack+0x99/0xd0 __warn+0xd1/0xf0 ? dl_task_timer+0x1a1/0x2b0 warn_slowpath_fmt+0x4f/0x60 ? sched_clock+0x13/0x20 lock_release+0x690/0x6a0 ? enqueue_pushable_dl_task+0x9b/0xa0 ? enqueue_task_dl+0x1ca/0x480 _raw_spin_unlock+0x1f/0x40 dl_task_timer+0x1a1/0x2b0 ? push_dl_task.part.31+0x190/0x190 WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3649 lock_unpin_lock+0x181/0x1a0 unpinning an unpinned lock Call Trace: dump_stack+0x99/0xd0 __warn+0xd1/0xf0 warn_slowpath_fmt+0x4f/0x60 lock_unpin_lock+0x181/0x1a0 dl_task_timer+0x127/0x2b0 ? push_dl_task.part.31+0x190/0x190 As per the comment before this code, its safe to drop the RQ lock here, and since we (potentially) change rq, unpin and repin to avoid the splat. Signed-off-by: Wanpeng Li [ Rewrote changelog. ] Signed-off-by: Peter Zijlstra (Intel) Cc: Juri Lelli Cc: Linus Torvalds Cc: Luca Abeni Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1470274940-17976-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index fcb7f02..1ce8867 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -658,8 +658,11 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) * * XXX figure out if select_task_rq_dl() deals with offline cpus. */ - if (unlikely(!rq->online)) + if (unlikely(!rq->online)) { + lockdep_unpin_lock(&rq->lock, rf.cookie); rq = dl_task_offline_migration(rq, p); + rf.cookie = lockdep_pin_lock(&rq->lock); + } /* * Queueing this task back might have overloaded rq, check if we need
[tip:x86/timers] x86/tsc_msr: Fix rdmsr(MSR_PLATFORM_INFO) unsafe warning in KVM guest
Commit-ID: 37c528ee1af7f24eb31f4195b8b7d4f23e6c716d Gitweb: http://git.kernel.org/tip/37c528ee1af7f24eb31f4195b8b7d4f23e6c716d Author: Wanpeng Li AuthorDate: Wed, 22 Jun 2016 09:28:28 +0800 Committer: Ingo Molnar CommitDate: Mon, 11 Jul 2016 09:20:36 +0200 x86/tsc_msr: Fix rdmsr(MSR_PLATFORM_INFO) unsafe warning in KVM guest After this commit: fc273eeef314 ("x86/tsc_msr: Extend to include Intel Core Architecture") The following unsafe MSR reading warning triggers: WARNING: CPU: 0 PID: 0 at arch/x86/mm/extable.c:50 ex_handler_rdmsr_unsafe+0x6a/0x70 unchecked MSR access error: RDMSR from 0xce Call Trace: dump_stack+0x67/0x99 __warn+0xd1/0xf0 warn_slowpath_fmt+0x4f/0x60 ex_handler_rdmsr_unsafe+0x6a/0x70 fixup_exception+0x39/0x50 do_general_protection+0x93/0x1b0 general_protection+0x22/0x30 ? cpu_khz_from_msr+0xd8/0x1c0 native_calibrate_cpu+0x30/0x5b0 tsc_init+0x2b/0x297 x86_late_time_init+0xf/0x11 start_kernel+0x398/0x451 ? set_init_arg+0x55/0x55 x86_64_start_reservations+0x2f/0x31 x86_64_start_kernel+0xea/0xed As Radim pointed out before: | MSR_PLATFORM_INFO: Intel changes it from family to family and there is | no obvious overlap or default. If we picked 0 (any other fixed value), | then the guest would have to know that 0 doesn't mean that | MSR_PLATFORM_INFO returned 0, but that KVM doesn't emulate this MSR and | the value cannot be used. This is very similar to handling a #GP in the | guest, but also has a disadvantage, because KVM cannot say that | MSR_PLATFORM_INFO is 0. Simple emulation is not possible. Fix it by using rdmsr_safe(MSR_PLATFORM_INFO) in KVM guest to not trigger a #GP, then tsc will be calibrated by a fallback method: PIT, HPET etc. Reported-by: kernel test robot Signed-off-by: Wanpeng Li Acked-by: Paolo Bonzini Cc: Chen Yu Cc: H. Peter Anvin Cc: Len Brown Cc: Linus Torvalds Cc: Linux PM list Cc: Peter Zijlstra Cc: Radim Krčmář Cc: Rafael J. Wysocki Cc: Thomas Gleixner Cc: Zhang Rui Cc: jacob.jun@intel.com Cc: k...@vger.kernel.org Cc: linux-a...@vger.kernel.org Cc: l...@01.org Link: http://lkml.kernel.org/r/1466558908-3524-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/tsc_msr.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c index e0c2b30..e6e465e 100644 --- a/arch/x86/kernel/tsc_msr.c +++ b/arch/x86/kernel/tsc_msr.c @@ -70,7 +70,7 @@ static int match_cpu(u8 family, u8 model) */ unsigned long cpu_khz_from_msr(void) { - u32 lo, hi, ratio, freq_id, freq; + u32 lo, hi, freq_id, freq, ratio = 0; unsigned long res; int cpu_index; @@ -123,8 +123,8 @@ unsigned long cpu_khz_from_msr(void) } get_ratio: - rdmsr(MSR_PLATFORM_INFO, lo, hi); - ratio = (lo >> 8) & 0xff; + if (!rdmsr_safe(MSR_PLATFORM_INFO, &lo, &hi)) + ratio = (lo >> 8) & 0xff; done: /* TSC frequency = maximum resolved freq * maximum resolved bus ratio */
[tip:sched/core] sched/cputime: Add steal time support to full dynticks CPU time accounting
Commit-ID: 807e5b80687c06715d62df51a5473b231e3e8b15 Gitweb: http://git.kernel.org/tip/807e5b80687c06715d62df51a5473b231e3e8b15 Author: Wanpeng Li AuthorDate: Mon, 13 Jun 2016 18:32:46 +0800 Committer: Ingo Molnar CommitDate: Tue, 14 Jun 2016 11:13:16 +0200 sched/cputime: Add steal time support to full dynticks CPU time accounting This patch adds guest steal-time support to full dynticks CPU time accounting. After the following commit: ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity") ... time sampling became jiffy based, even if we do the sampling from the context tracking code, so steal_account_process_tick() can be reused to account how many 'ticks' are stolen-time, after the last accumulation. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Acked-by: Paolo Bonzini Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Radim Krčmář Cc: Rik van Riel Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1465813966-3116-4-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/cputime.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 75f98c5..3d60e5d 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -257,7 +257,7 @@ void account_idle_time(cputime_t cputime) cpustat[CPUTIME_IDLE] += (__force u64) cputime; } -static __always_inline bool steal_account_process_tick(void) +static __always_inline unsigned long steal_account_process_tick(unsigned long max_jiffies) { #ifdef CONFIG_PARAVIRT if (static_key_false(¶virt_steal_enabled)) { @@ -272,14 +272,14 @@ static __always_inline bool steal_account_process_tick(void) * time in jiffies. Lets cast the result to jiffies * granularity and account the rest on the next rounds. */ - steal_jiffies = nsecs_to_jiffies(steal); + steal_jiffies = min(nsecs_to_jiffies(steal), max_jiffies); this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies); account_steal_time(jiffies_to_cputime(steal_jiffies)); return steal_jiffies; } #endif - return false; + return 0; } /* @@ -346,7 +346,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick, u64 cputime = (__force u64) cputime_one_jiffy; u64 *cpustat = kcpustat_this_cpu->cpustat; - if (steal_account_process_tick()) + if (steal_account_process_tick(ULONG_MAX)) return; cputime *= ticks; @@ -477,7 +477,7 @@ void account_process_tick(struct task_struct *p, int user_tick) return; } - if (steal_account_process_tick()) + if (steal_account_process_tick(ULONG_MAX)) return; if (user_tick) @@ -681,12 +681,14 @@ static cputime_t vtime_delta(struct task_struct *tsk) static cputime_t get_vtime_delta(struct task_struct *tsk) { unsigned long now = READ_ONCE(jiffies); - unsigned long delta = now - tsk->vtime_snap; + unsigned long delta_jiffies, steal_jiffies; + delta_jiffies = now - tsk->vtime_snap; + steal_jiffies = steal_account_process_tick(delta_jiffies); WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); tsk->vtime_snap = now; - return jiffies_to_cputime(delta); + return jiffies_to_cputime(delta_jiffies - steal_jiffies); } static void __vtime_account_system(struct task_struct *tsk)
[tip:sched/core] sched/cputime: Fix prev steal time accouting during CPU hotplug
Commit-ID: 3d89e5478bf550a50c99e93adf659369798263b0 Gitweb: http://git.kernel.org/tip/3d89e5478bf550a50c99e93adf659369798263b0 Author: Wanpeng Li AuthorDate: Mon, 13 Jun 2016 18:32:45 +0800 Committer: Ingo Molnar CommitDate: Tue, 14 Jun 2016 11:13:15 +0200 sched/cputime: Fix prev steal time accouting during CPU hotplug Commit: e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug") ... set rq->prev_* to 0 after a CPU hotplug comes back, in order to fix the case where (after CPU hotplug) steal time is smaller than rq->prev_steal_time. However, this should never happen. Steal time was only smaller because of the KVM-specific bug fixed by the previous patch. Worse, the previous patch triggers a bug on CPU hot-unplug/plug operation: because rq->prev_steal_time is cleared, all of the CPU's past steal time will be accounted again on hot-plug. Since the root cause has been fixed, we can just revert commit e9532e69b8d1. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Acked-by: Paolo Bonzini Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Radim Krčmář Cc: Rik van Riel Cc: Thomas Gleixner Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")' Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 1 - kernel/sched/sched.h | 13 - 2 files changed, 14 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 13d0896..c1b537b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7227,7 +7227,6 @@ static void sched_rq_cpu_starting(unsigned int cpu) struct rq *rq = cpu_rq(cpu); rq->calc_load_update = calc_load_update; - account_reset_rq(rq); update_max_interval(); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 72f1f30..de607e4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1809,16 +1809,3 @@ static inline void cpufreq_trigger_update(u64 time) {} #else /* arch_scale_freq_capacity */ #define arch_scale_freq_invariant()(false) #endif - -static inline void account_reset_rq(struct rq *rq) -{ -#ifdef CONFIG_IRQ_TIME_ACCOUNTING - rq->prev_irq_time = 0; -#endif -#ifdef CONFIG_PARAVIRT - rq->prev_steal_time = 0; -#endif -#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING - rq->prev_steal_time_rq = 0; -#endif -}
[tip:sched/core] KVM: Fix steal clock warp during guest CPU hotplug
Commit-ID: 2348140d58f4f4245e9635ea8f1a77e940a4d877 Gitweb: http://git.kernel.org/tip/2348140d58f4f4245e9635ea8f1a77e940a4d877 Author: Wanpeng Li AuthorDate: Mon, 13 Jun 2016 18:32:44 +0800 Committer: Ingo Molnar CommitDate: Tue, 14 Jun 2016 11:13:14 +0200 KVM: Fix steal clock warp during guest CPU hotplug Sometimes, after CPU hotplug you can observe a spike in stolen time (100%) followed by the CPU being marked as 100% idle when it's actually busy with a CPU hog task. The trace looks like the following: cpuhp/1-12[001] d.h1 167.461657: account_process_tick: steal = 1291385514, prev_steal_time = 0 cpuhp/1-12[001] d.h1 167.461659: account_process_tick: steal_jiffies = 1291 -0 [001] d.h1 167.462663: account_process_tick: steal = 18732255, prev_steal_time = 129100 -0 [001] d.h1 167.462664: account_process_tick: steal_jiffies = 18446744072437 The sudden decrease of "steal" causes steal_jiffies to underflow. The root cause is kvm_steal_time being reset to 0 after hot-plugging back in a CPU. Instead, the preexisting value can be used, which is what the core scheduler code expects. John Stultz also reported a similar issue after guest S3. Suggested-by: Paolo Bonzini Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Acked-by: Paolo Bonzini Cc: Frederic Weisbecker Cc: John Stultz Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Radim Krčmář Cc: Rik van Riel Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1465813966-3116-2-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/kvm.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index eea2a6f..1ef5e48 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -301,8 +301,6 @@ static void kvm_register_steal_time(void) if (!has_steal_clock) return; - memset(st, 0, sizeof(*st)); - wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED)); pr_info("kvm-stealtime: cpu %d, msr %llx\n", cpu, (unsigned long long) slow_virt_to_phys(st));
[tip:sched/core] sched/nohz: Fix affine unpinned timers mess
Commit-ID: 444969223c81c7d0a95136b7b4cfdcfbc96ac5bd Gitweb: http://git.kernel.org/tip/444969223c81c7d0a95136b7b4cfdcfbc96ac5bd Author: Wanpeng Li AuthorDate: Wed, 4 May 2016 14:45:34 +0800 Committer: Ingo Molnar CommitDate: Thu, 12 May 2016 09:55:32 +0200 sched/nohz: Fix affine unpinned timers mess The following commit: 9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")' intended to affine unpinned timers to housekeepers: unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(houserkeepers, idle)=> nearest busy housekeepers(otherwise, fallback to itself) However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the intention to: unpinned timers(full dynaticks, idle) => any housekeepers(no mattter cpu topology) unpinned timers(full dynaticks, busy) => any housekeepers(no mattter cpu topology) unpinned timers(housekeepers, idle) => any busy cpus(otherwise, fallback to any housekeepers) This patch fixes it by checking if there are busy housekeepers nearby, otherwise falls to any housekeepers/itself. After the patch: unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(housekeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself) Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) [ Fixed the changelog. ] Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Fixes: 'commit 9642d18eee2c ("nohz: Affine unpinned timers to housekeepers")' Link: http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 636c4b9..6f6962a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -533,7 +533,10 @@ int get_nohz_timer_target(void) rcu_read_lock(); for_each_domain(cpu, sd) { for_each_cpu(i, sched_domain_span(sd)) { - if (!idle_cpu(i) && is_housekeeping_cpu(cpu)) { + if (cpu == i) + continue; + + if (!idle_cpu(i) && is_housekeeping_cpu(i)) { cpu = i; goto unlock; }
[tip:sched/core] sched/debug: Print out idle balance values even on !CONFIG_SCHEDSTATS kernels
Commit-ID: db6ea2fb094fb3a6afc36d3e4229bc162638ad24 Gitweb: http://git.kernel.org/tip/db6ea2fb094fb3a6afc36d3e4229bc162638ad24 Author: Wanpeng Li AuthorDate: Tue, 3 May 2016 12:38:25 +0800 Committer: Ingo Molnar CommitDate: Thu, 5 May 2016 09:41:09 +0200 sched/debug: Print out idle balance values even on !CONFIG_SCHEDSTATS kernels The max_idle_balance_cost and avg_idle values which are tracked and ar used to capture short idle incidents, are not associated with schedstats, however the information of these two values isn't printed out on !CONFIG_SCHEDSTATS kernels. Fix this by moving the value printout out of the CONFIG_SCHEDSTATS section. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1462250305-4523-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/debug.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 4fbc3bd..cf905f6 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -626,15 +626,16 @@ do { \ #undef P #undef PN -#ifdef CONFIG_SCHEDSTATS -#define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n); -#define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n); - #ifdef CONFIG_SMP +#define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n); P64(avg_idle); P64(max_idle_balance_cost); +#undef P64 #endif +#ifdef CONFIG_SCHEDSTATS +#define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n); + if (schedstat_enabled()) { P(yld_count); P(sched_count); @@ -644,7 +645,6 @@ do { \ } #undef P -#undef P64 #endif spin_lock_irqsave(&sched_debug_lock, flags); print_cfs_stats(m, cpu);
[tip:sched/core] sched/cpufreq: Optimize cpufreq update kicker to avoid update multiple times
Commit-ID: 594dd290cf5403a9a5818619dfff42d8e8e0518e Gitweb: http://git.kernel.org/tip/594dd290cf5403a9a5818619dfff42d8e8e0518e Author: Wanpeng Li AuthorDate: Fri, 22 Apr 2016 17:07:24 +0800 Committer: Ingo Molnar CommitDate: Thu, 28 Apr 2016 10:39:54 +0200 sched/cpufreq: Optimize cpufreq update kicker to avoid update multiple times Sometimes delta_exec is 0 due to update_curr() is called multiple times, this is captured by: u64 delta_exec = rq_clock_task(rq) - curr->se.exec_start; This patch optimizes the cpufreq update kicker by bailing out when nothing changed, it will benefit the upcoming schedutil, since otherwise it will (over)react to the special util/max combination. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1461316044-9520-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 8 kernel/sched/rt.c | 8 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index affd97e..8f9b5af 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -717,10 +717,6 @@ static void update_curr_dl(struct rq *rq) if (!dl_task(curr) || !on_dl_rq(dl_se)) return; - /* Kick cpufreq (see the comment in linux/cpufreq.h). */ - if (cpu_of(rq) == smp_processor_id()) - cpufreq_trigger_update(rq_clock(rq)); - /* * Consumed budget is computed considering the time as * observed by schedulable tasks (excluding time spent @@ -736,6 +732,10 @@ static void update_curr_dl(struct rq *rq) return; } + /* kick cpufreq (see the comment in linux/cpufreq.h). */ + if (cpu_of(rq) == smp_processor_id()) + cpufreq_trigger_update(rq_clock(rq)); + schedstat_set(curr->se.statistics.exec_max, max(curr->se.statistics.exec_max, delta_exec)); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index c41ea7a..19e1306 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -953,14 +953,14 @@ static void update_curr_rt(struct rq *rq) if (curr->sched_class != &rt_sched_class) return; - /* Kick cpufreq (see the comment in linux/cpufreq.h). */ - if (cpu_of(rq) == smp_processor_id()) - cpufreq_trigger_update(rq_clock(rq)); - delta_exec = rq_clock_task(rq) - curr->se.exec_start; if (unlikely((s64)delta_exec <= 0)) return; + /* Kick cpufreq (see the comment in linux/cpufreq.h). */ + if (cpu_of(rq) == smp_processor_id()) + cpufreq_trigger_update(rq_clock(rq)); + schedstat_set(curr->se.statistics.exec_max, max(curr->se.statistics.exec_max, delta_exec));
[tip:timers/urgent] tick/nohz: Set the correct expiry when switching to nohz/lowres mode
Commit-ID: 1ca8ec532fc2d986f1f4a319857bb18e0c9739b4 Gitweb: http://git.kernel.org/tip/1ca8ec532fc2d986f1f4a319857bb18e0c9739b4 Author: Wanpeng Li AuthorDate: Wed, 27 Jan 2016 19:26:07 +0800 Committer: Thomas Gleixner CommitDate: Wed, 27 Jan 2016 12:45:57 +0100 tick/nohz: Set the correct expiry when switching to nohz/lowres mode commit 0ff53d096422 sets the next tick interrupt to the last jiffies update, i.e. in the past, because the forward operation is invoked before the set operation. There is no resulting damage (yet), but we get an extra pointless tick interrupt. Revert the order so we get the next tick interrupt in the future. Fixes: commit 0ff53d096422 "tick: sched: Force tick interrupt and get rid of softirq magic" Signed-off-by: Wanpeng Li Cc: Peter Zijlstra Cc: Frederic Weisbecker Cc: sta...@vger.kernel.org Link: http://lkml.kernel.org/r/1453893967-3458-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Thomas Gleixner --- kernel/time/tick-sched.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index cbe5d8d..de2d9fe 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -995,9 +995,9 @@ static void tick_nohz_switch_to_nohz(void) /* Get the next period */ next = tick_init_jiffy_update(); - hrtimer_forward_now(&ts->sched_timer, tick_period); hrtimer_set_expires(&ts->sched_timer, next); - tick_program_event(next, 1); + hrtimer_forward_now(&ts->sched_timer, tick_period); + tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1); tick_nohz_activate(ts, NOHZ_MODE_LOWRES); }
[tip:sched/core] sched/deadline: Fix the earliest_dl.next logic
Commit-ID: 7d92de3a8285ab3dfd68aa3a99823acd5b190444 Gitweb: http://git.kernel.org/tip/7d92de3a8285ab3dfd68aa3a99823acd5b190444 Author: Wanpeng Li AuthorDate: Thu, 3 Dec 2015 17:42:10 +0800 Committer: Ingo Molnar CommitDate: Wed, 6 Jan 2016 11:05:56 +0100 sched/deadline: Fix the earliest_dl.next logic earliest_dl.next should cache deadline of the earliest ready task that is also enqueued in the pushable rbtree, as pull algorithm uses this information to find candidates for migration: if the earliest_dl.next deadline of source rq is earlier than the earliest_dl.curr deadline of destination rq, the task from the source rq can be pulled. However, current implementation only guarantees that earliest_dl.next is the deadline of the next ready task instead of the next pushable task; which will result in potentially holding both rqs' lock and find nothing to migrate because of affinity constraints. In addition, current logic doesn't update the next candidate for pushing in pick_next_task_dl(), even if the running task is never eligible. This patch fixes both problems by updating earliest_dl.next when pushable dl task is enqueued/dequeued, similar to what we already do for RT. Tested-by: Luca Abeni Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Acked-by: Juri Lelli Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1449135730-27202-1-git-send-email-wanpeng...@hotmail.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 59 ++--- 1 file changed, 7 insertions(+), 52 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 8b0a15e..cd64c97 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -176,8 +176,10 @@ static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p) } } - if (leftmost) + if (leftmost) { dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks; + dl_rq->earliest_dl.next = p->dl.deadline; + } rb_link_node(&p->pushable_dl_tasks, parent, link); rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root); @@ -195,6 +197,10 @@ static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p) next_node = rb_next(&p->pushable_dl_tasks); dl_rq->pushable_dl_tasks_leftmost = next_node; + if (next_node) { + dl_rq->earliest_dl.next = rb_entry(next_node, + struct task_struct, pushable_dl_tasks)->dl.deadline; + } } rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root); @@ -782,42 +788,14 @@ static void update_curr_dl(struct rq *rq) #ifdef CONFIG_SMP -static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu); - -static inline u64 next_deadline(struct rq *rq) -{ - struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu); - - if (next && dl_prio(next->prio)) - return next->dl.deadline; - else - return 0; -} - static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) { struct rq *rq = rq_of_dl_rq(dl_rq); if (dl_rq->earliest_dl.curr == 0 || dl_time_before(deadline, dl_rq->earliest_dl.curr)) { - /* -* If the dl_rq had no -deadline tasks, or if the new task -* has shorter deadline than the current one on dl_rq, we -* know that the previous earliest becomes our next earliest, -* as the new task becomes the earliest itself. -*/ - dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr; dl_rq->earliest_dl.curr = deadline; cpudl_set(&rq->rd->cpudl, rq->cpu, deadline, 1); - } else if (dl_rq->earliest_dl.next == 0 || - dl_time_before(deadline, dl_rq->earliest_dl.next)) { - /* -* On the other hand, if the new -deadline task has a -* a later deadline than the earliest one on dl_rq, but -* it is earlier than the next (if any), we must -* recompute the next-earliest. -*/ - dl_rq->earliest_dl.next = next_deadline(rq); } } @@ -839,7 +817,6 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) entry = rb_entry(leftmost, struct sched_dl_entity, rb_node); dl_rq->earliest_dl.curr = entry->deadline; - dl_rq->earliest_dl.next = next_deadline(rq); cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline, 1); } } @@ -1274,28 +1251,6 @@ static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu) return 0; } -/* Returns the second earliest -deadline task, NULL otherwise */ -static struct task_st
[tip:sched/core] sched: 'Annotate' migrate_tasks()
Commit-ID: 5473e0cc37c03c576adbda7591a6cc8e37c1bb7f Gitweb: http://git.kernel.org/tip/5473e0cc37c03c576adbda7591a6cc8e37c1bb7f Author: Wanpeng Li AuthorDate: Fri, 28 Aug 2015 14:55:56 +0800 Committer: Ingo Molnar CommitDate: Fri, 11 Sep 2015 07:57:50 +0200 sched: 'Annotate' migrate_tasks() Kernel testing triggered this warning: | WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80() | Modules linked in: | CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c7 #2 | Call Trace: | dump_stack+0x4b/0x75 | warn_slowpath_common+0x8b/0xc0 | warn_slowpath_null+0x22/0x30 | do_set_cpus_allowed+0x7e/0x80 | cpuset_cpus_allowed_fallback+0x7c/0x170 | select_fallback_rq+0x221/0x280 | migration_call+0xe3/0x250 | notifier_call_chain+0x53/0x70 | __raw_notifier_call_chain+0x1e/0x30 | cpu_notify+0x28/0x50 | take_cpu_down+0x22/0x40 | multi_cpu_stop+0xd5/0x140 | cpu_stopper_thread+0xbc/0x170 | smpboot_thread_fn+0x174/0x2f0 | kthread+0xc4/0xe0 | ret_from_kernel_thread+0x21/0x30 As Peterz pointed out: | So the normal rules for changing task_struct::cpus_allowed are holding | both pi_lock and rq->lock, such that holding either stabilizes the mask. | | This is so that wakeup can happen without rq->lock and load-balance | without pi_lock. | | From this we already get the relaxation that we can omit acquiring | rq->lock if the task is not on the rq, because in that case | load-balancing will not apply to it. | | ** these are the rules currently tested in do_set_cpus_allowed() ** | | Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which | unconditionally acquires both locks, we could get away with holding just | rq->lock when on_rq for modification because that'd still exclude | __set_cpus_allowed_ptr(), it would also work against | __kthread_bind_mask() because that assumes !on_rq. | | That said, this is all somewhat fragile. | | Now, I don't think dropping rq->lock is quite as disastrous as it | usually is because !cpu_active at this point, which means load-balance | will not interfere, but that too is somewhat fragile. | | So we end up with a choice of two fragile.. This patch fixes it by following the rules for changing task_struct::cpus_allowed with both pi_lock and rq->lock held. Reported-by: kernel test robot Reported-by: Sasha Levin Signed-off-by: Wanpeng Li [ Modified changelog and patch. ] Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/blu436-smtp1660820490de202e3934ed380...@phx.gbl Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 29 ++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0902e4d..9b78670 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5183,24 +5183,47 @@ static void migrate_tasks(struct rq *dead_rq) break; /* -* Ensure rq->lock covers the entire task selection -* until the migration. +* pick_next_task assumes pinned rq->lock. */ lockdep_pin_lock(&rq->lock); next = pick_next_task(rq, &fake_task); BUG_ON(!next); next->sched_class->put_prev_task(rq, next); + /* +* Rules for changing task_struct::cpus_allowed are holding +* both pi_lock and rq->lock, such that holding either +* stabilizes the mask. +* +* Drop rq->lock is not quite as disastrous as it usually is +* because !cpu_active at this point, which means load-balance +* will not interfere. Also, stop-machine. +*/ + lockdep_unpin_lock(&rq->lock); + raw_spin_unlock(&rq->lock); + raw_spin_lock(&next->pi_lock); + raw_spin_lock(&rq->lock); + + /* +* Since we're inside stop-machine, _nothing_ should have +* changed the task, WARN if weird stuff happened, because in +* that case the above rq->lock drop is a fail too. +*/ + if (WARN_ON(task_rq(next) != rq || !task_on_rq_queued(next))) { + raw_spin_unlock(&next->pi_lock); + continue; + } + /* Find suitable destination for @next, with force if needed. */ dest_cpu = select_fallback_rq(dead_rq->cpu, next); - lockdep_unpin_lock(&rq->lock); rq = __migrate_task(rq, next, dest_cpu); if (rq != dead_rq) { raw_spin_unlock(&rq->lock); rq = dead_rq; raw_spin_lock(&rq->lock); } + raw_spin_unlock(&next->pi_lock); } rq->stop = st
[tip:sched/core] sched: Remove superfluous resetting of the p-> dl_throttled flag
Commit-ID: 6713c3aa7f63626c0cecf9c509fb48d885b2dd12 Gitweb: http://git.kernel.org/tip/6713c3aa7f63626c0cecf9c509fb48d885b2dd12 Author: Wanpeng Li AuthorDate: Wed, 13 May 2015 14:01:06 +0800 Committer: Ingo Molnar CommitDate: Fri, 19 Jun 2015 10:06:47 +0200 sched: Remove superfluous resetting of the p->dl_throttled flag Resetting the p->dl_throttled flag in rt_mutex_setprio() (for a task that is going to be boosted) is superfluous, as the natural place to do so is in replenish_dl_entity(). If the task was on the runqueue and it is boosted by a DL task, it will be enqueued back with ENQUEUE_REPLENISH flag set, which can guarantee that dl_throttled is reset in replenish_dl_entity(). This patch drops the resetting of throttled status in function rt_mutex_setprio(). Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Borislav Petkov Cc: H. Peter Anvin Cc: Juri Lelli Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1431496867-4194-6-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1428c7c..10338ce 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3099,7 +3099,6 @@ void rt_mutex_setprio(struct task_struct *p, int prio) if (!dl_prio(p->normal_prio) || (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) { p->dl.dl_boosted = 1; - p->dl.dl_throttled = 0; enqueue_flag = ENQUEUE_REPLENISH; } else p->dl.dl_boosted = 0; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Drop duplicate init_sched_dl_class() declaration
Commit-ID: 178a4d23e4e6a0a90b086dad86697676b49db60a Gitweb: http://git.kernel.org/tip/178a4d23e4e6a0a90b086dad86697676b49db60a Author: Wanpeng Li AuthorDate: Wed, 13 May 2015 14:01:05 +0800 Committer: Ingo Molnar CommitDate: Fri, 19 Jun 2015 10:06:47 +0200 sched/deadline: Drop duplicate init_sched_dl_class() declaration There are two init_sched_dl_class() declarations, this patch drops the duplicate. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Borislav Petkov Cc: H. Peter Anvin Cc: Juri Lelli Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1431496867-4194-5-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/sched.h | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d854555..d62b288 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1290,7 +1290,6 @@ extern void update_max_interval(void); extern void init_sched_dl_class(void); extern void init_sched_rt_class(void); extern void init_sched_fair_class(void); -extern void init_sched_dl_class(void); extern void resched_curr(struct rq *rq); extern void resched_cpu(int cpu); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Make init_sched_dl_class() __init
Commit-ID: a6c0e746fb8f4ea6508f274314378325a6e1ec9b Gitweb: http://git.kernel.org/tip/a6c0e746fb8f4ea6508f274314378325a6e1ec9b Author: Wanpeng Li AuthorDate: Wed, 13 May 2015 14:01:02 +0800 Committer: Ingo Molnar CommitDate: Fri, 19 Jun 2015 10:06:46 +0200 sched/deadline: Make init_sched_dl_class() __init It's a bootstrap function, make init_sched_dl_class() __init. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Borislav Petkov Cc: H. Peter Anvin Cc: Juri Lelli Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1431496867-4194-2-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 9cbe1c7..1c4bc31 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1685,7 +1685,7 @@ static void rq_offline_dl(struct rq *rq) cpudl_clear_freecpu(&rq->rd->cpudl, rq->cpu); } -void init_sched_dl_class(void) +void __init init_sched_dl_class(void) { unsigned int i; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
Commit-ID: 9d514262425691dddf942edea8bc9919e66fe140 Gitweb: http://git.kernel.org/tip/9d514262425691dddf942edea8bc9919e66fe140 Author: Wanpeng Li AuthorDate: Wed, 13 May 2015 14:01:03 +0800 Committer: Ingo Molnar CommitDate: Fri, 19 Jun 2015 10:06:46 +0200 sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target This patch adds a check that prevents futile attempts to move DL tasks to a CPU with active tasks of equal or earlier deadline. The same behavior as commit 80e3d87b2c55 ("sched/rt: Reduce rq lock contention by eliminating locking of non-feasible target") for rt class. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Borislav Petkov Cc: H. Peter Anvin Cc: Juri Lelli Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1431496867-4194-3-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 1c4bc31..98f7871 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1012,7 +1012,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) (p->nr_cpus_allowed > 1)) { int target = find_later_rq(p); - if (target != -1) + if (target != -1 && + dl_time_before(p->dl.deadline, + cpu_rq(target)->dl.earliest_dl.curr)) cpu = target; } rcu_read_unlock(); @@ -1359,6 +1361,17 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq) later_rq = cpu_rq(cpu); + if (!dl_time_before(task->dl.deadline, + later_rq->dl.earliest_dl.curr)) { + /* +* Target rq has tasks of equal or earlier deadline, +* retrying does not release any lock and is unlikely +* to yield a different result. +*/ + later_rq = NULL; + break; + } + /* Retry if something changed. */ if (double_lock_balance(rq, later_rq)) { if (unlikely(task_rq(task) != rq || -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Optimize pull_dl_task()
Commit-ID: 8b5e770ed7c05a65ffd2d33a83c14572696236dc Gitweb: http://git.kernel.org/tip/8b5e770ed7c05a65ffd2d33a83c14572696236dc Author: Wanpeng Li AuthorDate: Wed, 13 May 2015 14:01:01 +0800 Committer: Ingo Molnar CommitDate: Fri, 19 Jun 2015 10:06:45 +0200 sched/deadline: Optimize pull_dl_task() pull_dl_task() uses pick_next_earliest_dl_task() to select a migration candidate; this is sub-optimal since the next earliest task -- as per the regular runqueue -- might not be migratable at all. This could result in iterating the entire runqueue looking for a task. Instead iterate the pushable queue -- this queue only contains tasks that have at least 2 cpus set in their cpus_allowed mask. Signed-off-by: Wanpeng Li [ Improved the changelog. ] Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Borislav Petkov Cc: H. Peter Anvin Cc: Juri Lelli Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1431496867-4194-1-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 28 +++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 890ce95..9cbe1c7 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1230,6 +1230,32 @@ next_node: return NULL; } +/* + * Return the earliest pushable rq's task, which is suitable to be executed + * on the CPU, NULL otherwise: + */ +static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu) +{ + struct rb_node *next_node = rq->dl.pushable_dl_tasks_leftmost; + struct task_struct *p = NULL; + + if (!has_pushable_dl_tasks(rq)) + return NULL; + +next_node: + if (next_node) { + p = rb_entry(next_node, struct task_struct, pushable_dl_tasks); + + if (pick_dl_task(rq, p, cpu)) + return p; + + next_node = rb_next(next_node); + goto next_node; + } + + return NULL; +} + static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl); static int find_later_rq(struct task_struct *task) @@ -1514,7 +1540,7 @@ static int pull_dl_task(struct rq *this_rq) if (src_rq->dl.dl_nr_running <= 1) goto skip; - p = pick_next_earliest_dl_task(src_rq, this_cpu); + p = pick_earliest_pushable_dl_task(src_rq, this_cpu); /* * We found a task to be pulled if: -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Support DL task migration during CPU hotplug
Commit-ID: fa9c9d10e97e38d9903fad1829535175ad261f45 Gitweb: http://git.kernel.org/tip/fa9c9d10e97e38d9903fad1829535175ad261f45 Author: Wanpeng Li AuthorDate: Fri, 27 Mar 2015 07:08:35 +0800 Committer: Ingo Molnar CommitDate: Thu, 2 Apr 2015 17:42:57 +0200 sched/deadline: Support DL task migration during CPU hotplug I observed that DL tasks can't be migrated to other CPUs during CPU hotplug, in addition, task may/may not be running again if CPU is added back. The root cause which I found is that DL tasks will be throtted and removed from the DL rq after comsuming all their budget, which leads to the situation that stop task can't pick them up from the DL rq and migrate them to other CPUs during hotplug. The method to reproduce: schedtool -E -t 5:10 -e ./test Actually './test' is just a simple for loop. Then observe which CPU the test task is on and offline it: echo 0 > /sys/devices/system/cpu/cpuN/online This patch adds the DL task migration during CPU hotplug by finding a most suitable later deadline rq after DL timer fires if current rq is offline. If it fails to find a suitable later deadline rq then it falls back to any eligible online CPU in so that the deadline task will come back to us, and the push/pull mechanism should then move it around properly. Suggested-and-Acked-by: Juri Lelli Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Link: http://lkml.kernel.org/r/1427411315-4298-1-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 57 + 1 file changed, 57 insertions(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 9d3ad64..5e95145 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -218,6 +218,52 @@ static inline void set_post_schedule(struct rq *rq) rq->post_schedule = has_pushable_dl_tasks(rq); } +static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq); + +static void dl_task_offline_migration(struct rq *rq, struct task_struct *p) +{ + struct rq *later_rq = NULL; + bool fallback = false; + + later_rq = find_lock_later_rq(p, rq); + + if (!later_rq) { + int cpu; + + /* +* If we cannot preempt any rq, fall back to pick any +* online cpu. +*/ + fallback = true; + cpu = cpumask_any_and(cpu_active_mask, tsk_cpus_allowed(p)); + if (cpu >= nr_cpu_ids) { + /* +* Fail to find any suitable cpu. +* The task will never come back! +*/ + BUG_ON(dl_bandwidth_enabled()); + + /* +* If admission control is disabled we +* try a little harder to let the task +* run. +*/ + cpu = cpumask_any(cpu_active_mask); + } + later_rq = cpu_rq(cpu); + double_lock_balance(rq, later_rq); + } + + deactivate_task(rq, p, 0); + set_task_cpu(p, later_rq->cpu); + activate_task(later_rq, p, ENQUEUE_REPLENISH); + + if (!fallback) + resched_curr(later_rq); + + double_unlock_balance(rq, later_rq); +} + #else static inline @@ -536,6 +582,17 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) sched_clock_tick(); update_rq_clock(rq); +#ifdef CONFIG_SMP + /* +* If we find that the rq the task was on is no longer +* available, we need to select a new rq. +*/ + if (unlikely(!rq->online)) { + dl_task_offline_migration(rq, p); + goto unlock; + } +#endif + /* * If the throttle happened during sched-out; like: * -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Fix rt runtime corruption when dl fails its global constraints
Commit-ID: a1963b81deec54c113e770b0020e5f1c3188a087 Gitweb: http://git.kernel.org/tip/a1963b81deec54c113e770b0020e5f1c3188a087 Author: Wanpeng Li AuthorDate: Tue, 17 Mar 2015 19:15:31 +0800 Committer: Ingo Molnar CommitDate: Fri, 27 Mar 2015 09:36:15 +0100 sched/deadline: Fix rt runtime corruption when dl fails its global constraints One version of sched_rt_global_constaints() (the !rt-cgroup one) changes state, therefore if we fail the later sched_dl_global_constraints() call the state is left in an inconsistent state. Fix this by changing the order of the calls. Signed-off-by: Wanpeng Li [ Improved the changelog. ] Signed-off-by: Peter Zijlstra (Intel) Acked-by: Juri Lelli Link: http://lkml.kernel.org/r/1426590931-4639-2-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 043e2a1..4b3b688 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7804,7 +7804,7 @@ static int sched_rt_global_constraints(void) } #endif /* CONFIG_RT_GROUP_SCHED */ -static int sched_dl_global_constraints(void) +static int sched_dl_global_validate(void) { u64 runtime = global_rt_runtime(); u64 period = global_rt_period(); @@ -7905,11 +7905,11 @@ int sched_rt_handler(struct ctl_table *table, int write, if (ret) goto undo; - ret = sched_rt_global_constraints(); + ret = sched_dl_global_validate(); if (ret) goto undo; - ret = sched_dl_global_constraints(); + ret = sched_rt_global_constraints(); if (ret) goto undo; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Avoid a superfluous check
Commit-ID: bd4bde14b93cce8fa77765ff709e0be55abdba2c Gitweb: http://git.kernel.org/tip/bd4bde14b93cce8fa77765ff709e0be55abdba2c Author: Wanpeng Li AuthorDate: Tue, 17 Mar 2015 19:15:30 +0800 Committer: Ingo Molnar CommitDate: Fri, 27 Mar 2015 09:36:12 +0100 sched/deadline: Avoid a superfluous check Since commit 40767b0dc768 ("sched/deadline: Fix deadline parameter modification handling") we clear the thottled state when switching from a dl task, therefore we should never find it set in switching to a dl task. Signed-off-by: Wanpeng Li [ Improved the changelog. ] Signed-off-by: Peter Zijlstra (Intel) Acked-by: Juri Lelli Link: http://lkml.kernel.org/r/1426590931-4639-1-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 8 1 file changed, 8 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 0a81a95..24c18dc 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1665,14 +1665,6 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p) { int check_resched = 1; - /* -* If p is throttled, don't consider the possibility -* of preempting rq->curr, the check will be done right -* after its runtime will get replenished. -*/ - if (unlikely(p->dl.dl_throttled)) - return; - if (task_on_rq_queued(p) && rq->curr != p) { #ifdef CONFIG_SMP if (p->nr_cpus_allowed > 1 && rq->dl.overloaded && -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Add rq-> clock update skip for dl task yield
Commit-ID: 44fb085bfa17628c6d2aaa6af6b292a8499e9cbd Gitweb: http://git.kernel.org/tip/44fb085bfa17628c6d2aaa6af6b292a8499e9cbd Author: Wanpeng Li AuthorDate: Tue, 10 Mar 2015 12:20:00 +0800 Committer: Ingo Molnar CommitDate: Tue, 10 Mar 2015 05:46:50 +0100 sched/deadline: Add rq->clock update skip for dl task yield This patch adds rq->clock update skip for SCHED_DEADLINE task yield, to tell update_rq_clock() that we've just updated the clock, so that we don't do a microscopic update in schedule() and double the fastpath cost. Signed-off-by: Wanpeng Li Cc: Juri Lelli Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/1425961200-3809-1-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 3fa8fa6..0a81a95 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -914,6 +914,12 @@ static void yield_task_dl(struct rq *rq) } update_rq_clock(rq); update_curr_dl(rq); + /* +* Tell update_rq_clock() that we've just updated, +* so we don't do microscopic update in schedule() +* and double the fastpath cost. +*/ + rq_clock_skip_update(rq, true); } #ifdef CONFIG_SMP -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched: Fix hrtick_start() on UP
Commit-ID: 868933359a3bdda25b562e9d41bce7071edc1b08 Gitweb: http://git.kernel.org/tip/868933359a3bdda25b562e9d41bce7071edc1b08 Author: Wanpeng Li AuthorDate: Wed, 26 Nov 2014 08:44:06 +0800 Committer: Ingo Molnar CommitDate: Wed, 4 Feb 2015 07:52:28 +0100 sched: Fix hrtick_start() on UP The commit 177ef2a6315e ("sched/deadline: Fix a precision problem in the microseconds range") forgot to change the UP version of hrtick_start(), do so now. Signed-off-by: Wanpeng Li Fixes: 177ef2a6315e ("sched/deadline: Fix a precision problem in the microseconds range") [ Fixed the changelog. ] Signed-off-by: Peter Zijlstra (Intel) Cc: Juri Lelli Cc: Kirill Tkhai Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1416962647-76792-7-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 5 + 1 file changed, 5 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d59652d..5091fd4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -492,6 +492,11 @@ static __init void init_hrtick(void) */ void hrtick_start(struct rq *rq, u64 delay) { + /* +* Don't schedule slices shorter than 1ns, that just +* doesn't make sense. Rely on vruntime for fairness. +*/ + delay = max_t(u64, delay, 1LL); __hrtimer_start_range_ns(&rq->hrtick_timer, ns_to_ktime(delay), 0, HRTIMER_MODE_REL_PINNED, 0); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Avoid pointless __setscheduler()
Commit-ID: 75381608e8410a72ae8b4080849dc86b472c01fb Gitweb: http://git.kernel.org/tip/75381608e8410a72ae8b4080849dc86b472c01fb Author: Wanpeng Li AuthorDate: Wed, 26 Nov 2014 08:44:04 +0800 Committer: Ingo Molnar CommitDate: Wed, 4 Feb 2015 07:52:27 +0100 sched/deadline: Avoid pointless __setscheduler() There is no need to dequeue/enqueue and push/pull if there are no scheduling parameters changed for the DL class. Both fair and RT classes already check if parameters changed for them to avoid unnecessary overhead. This patch add the parameters changed test for the DL class in order to reduce overhead. Signed-off-by: Wanpeng Li [ Fixed up the changelog. ] Signed-off-by: Peter Zijlstra (Intel) Cc: Juri Lelli Cc: Kirill Tkhai Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1416962647-76792-5-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 50a5352..d59652d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3417,6 +3417,20 @@ static bool check_same_owner(struct task_struct *p) return match; } +static bool dl_param_changed(struct task_struct *p, + const struct sched_attr *attr) +{ + struct sched_dl_entity *dl_se = &p->dl; + + if (dl_se->dl_runtime != attr->sched_runtime || + dl_se->dl_deadline != attr->sched_deadline || + dl_se->dl_period != attr->sched_period || + dl_se->flags != attr->sched_flags) + return true; + + return false; +} + static int __sched_setscheduler(struct task_struct *p, const struct sched_attr *attr, bool user) @@ -3545,7 +3559,7 @@ recheck: goto change; if (rt_policy(policy) && attr->sched_priority != p->rt_priority) goto change; - if (dl_policy(policy)) + if (dl_policy(policy) && dl_param_changed(p, attr)) goto change; p->sched_reset_on_fork = reset_on_fork; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Fix hrtick for a non-leftmost task
Commit-ID: a7bebf488791aa1036f3e6629daf01d01f705dcb Gitweb: http://git.kernel.org/tip/a7bebf488791aa1036f3e6629daf01d01f705dcb Author: Wanpeng Li AuthorDate: Wed, 26 Nov 2014 08:44:01 +0800 Committer: Ingo Molnar CommitDate: Wed, 4 Feb 2015 07:52:25 +0100 sched/deadline: Fix hrtick for a non-leftmost task After update_curr_dl() the current task might not be the leftmost task anymore. In that case do not start a new hrtick for it. In this case NEED_RESCHED will be set and the next schedule will start the hrtick for the new task if and when appropriate. Signed-off-by: Wanpeng Li Acked-by: Juri Lelli [ Rewrote the changelog and comment. ] Signed-off-by: Peter Zijlstra (Intel) Cc: Kirill Tkhai Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1416962647-76792-2-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index e0e9c29..7b684f9 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1073,7 +1073,13 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued) { update_curr_dl(rq); - if (hrtick_enabled(rq) && queued && p->dl.runtime > 0) + /* +* Even when we have runtime, update_curr_dl() might have resulted in us +* not being the leftmost task anymore. In that case NEED_RESCHED will +* be set and schedule() will start a new hrtick for the next task. +*/ + if (hrtick_enabled(rq) && queued && p->dl.runtime > 0 && + is_leftmost(p, &rq->dl)) start_hrtick_dl(rq, p); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
Commit-ID: 36ce98818a4df66c8134c31fd6e768b4119c7a90 Gitweb: http://git.kernel.org/tip/36ce98818a4df66c8134c31fd6e768b4119c7a90 Author: Wanpeng Li AuthorDate: Tue, 11 Nov 2014 09:52:26 +0800 Committer: Ingo Molnar CommitDate: Sun, 16 Nov 2014 10:59:03 +0100 sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK Introduce start_hrtick_dl for !CONFIG_SCHED_HRTICK to align with the fair class. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Juri Lelli Cc: Kirill Tkhai Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1415670747-58726-1-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 9594c12..e5db8c6 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1013,6 +1013,10 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p) { hrtick_start(rq, p->dl.runtime); } +#else /* !CONFIG_SCHED_HRTICK */ +static void start_hrtick_dl(struct rq *rq, struct task_struct *p) +{ +} #endif static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq, @@ -1066,10 +1070,8 @@ struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev) /* Running task will never be pushed. */ dequeue_pushable_dl_task(rq, p); -#ifdef CONFIG_SCHED_HRTICK if (hrtick_enabled(rq)) start_hrtick_dl(rq, p); -#endif set_post_schedule(rq); @@ -1088,10 +1090,8 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued) { update_curr_dl(rq); -#ifdef CONFIG_SCHED_HRTICK if (hrtick_enabled(rq) && queued && p->dl.runtime > 0) start_hrtick_dl(rq, p); -#endif } static void task_fork_dl(struct task_struct *p) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Fix rq-> dl.pushable_tasks bug in push_dl_task()
Commit-ID: c51b8ab5ad972df26fd9c0ffad34870e98273c4c Gitweb: http://git.kernel.org/tip/c51b8ab5ad972df26fd9c0ffad34870e98273c4c Author: Wanpeng Li AuthorDate: Thu, 6 Nov 2014 15:22:44 +0800 Committer: Ingo Molnar CommitDate: Sun, 16 Nov 2014 10:58:57 +0100 sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task() Do not call dequeue_pushable_dl_task() when failing to push an eligible task, as it remains pushable, merely not at this particular moment. Actually the patch is the same behavior as commit 311e800e16f6 ("sched, rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Juri Lelli Cc: Kirill Tkhai Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index bb1464b..9594c12 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1328,6 +1328,7 @@ static int push_dl_task(struct rq *rq) { struct task_struct *next_task; struct rq *later_rq; + int ret = 0; if (!rq->dl.overloaded) return 0; @@ -1373,7 +1374,6 @@ retry: * The task is still there. We don't try * again, some other cpu will pull it when ready. */ - dequeue_pushable_dl_task(rq, next_task); goto out; } @@ -1389,6 +1389,7 @@ retry: deactivate_task(rq, next_task, 0); set_task_cpu(next_task, later_rq->cpu); activate_task(later_rq, next_task, 0); + ret = 1; resched_curr(later_rq); @@ -1397,7 +1398,7 @@ retry: out: put_task_struct(next_task); - return 1; + return ret; } static void push_dl_tasks(struct rq *rq) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/fair: Fix stale overloaded status in the busiest group finding logic
Commit-ID: cb0b9f2445cdf9893352e4548582a2892af7137c Gitweb: http://git.kernel.org/tip/cb0b9f2445cdf9893352e4548582a2892af7137c Author: Wanpeng Li AuthorDate: Wed, 5 Nov 2014 07:44:50 +0800 Committer: Ingo Molnar CommitDate: Sun, 16 Nov 2014 10:58:56 +0100 sched/fair: Fix stale overloaded status in the busiest group finding logic Commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() return 'true' on a busier sd") changes groups to be ranked in the order of overloaded > imbalance > other, and busiest group is picked according to this order. sgs->group_capacity_factor is used to check if the group is overloaded. When the child domain prefers tasks to go to siblings first, the sgs->group_capacity_factor will be set lower than one in order to move all the excess tasks away. However, group overloaded status is not updated when sgs->group_capacity_factor is set to lower than one, which leads to us missing to find the busiest group. This patch fixes it by updating group overloaded status when sg capacity factor is set to one, in order to find the busiest group accurately. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Rik van Riel Cc: Vincent Guittot Cc: Kirill Tkhai Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng...@linux.intel.com [ Fixed the changelog. ] Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8bca292..df2cdf7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6352,8 +6352,10 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local && - sds->local_stat.group_has_free_capacity) + sds->local_stat.group_has_free_capacity) { sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U); + sgs->group_type = group_classify(sg, sgs); + } if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched: Move p-> nr_cpus_allowed check to select_task_rq()
Commit-ID: 6c1d9410f007a26d13173cf17204cfd965f49b83 Gitweb: http://git.kernel.org/tip/6c1d9410f007a26d13173cf17204cfd965f49b83 Author: Wanpeng Li AuthorDate: Wed, 5 Nov 2014 09:14:37 +0800 Committer: Ingo Molnar CommitDate: Sun, 16 Nov 2014 10:58:55 +0100 sched: Move p->nr_cpus_allowed check to select_task_rq() Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq(). This change will make fair.c, rt.c, and deadline.c all start with the same logic. Suggested-and-Acked-by: Steven Rostedt Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: "pang.xunlei" Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 3 ++- kernel/sched/deadline.c | 3 --- kernel/sched/fair.c | 3 --- kernel/sched/rt.c | 3 --- 4 files changed, 2 insertions(+), 10 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3ccdce1..d44d0c5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1411,7 +1411,8 @@ out: static inline int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags) { - cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags); + if (p->nr_cpus_allowed > 1) + cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags); /* * In order not to call set_task_cpu() on a blocking task we need diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index b091179..bb1464b 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -928,9 +928,6 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) struct task_struct *curr; struct rq *rq; - if (p->nr_cpus_allowed == 1) - goto out; - if (sd_flag != SD_BALANCE_WAKE) goto out; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d11c57d..8bca292 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4730,9 +4730,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f int want_affine = 0; int sync = wake_flags & WF_SYNC; - if (p->nr_cpus_allowed == 1) - return prev_cpu; - if (sd_flag & SD_BALANCE_WAKE) want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index f1bb92f..ee15f5a 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1301,9 +1301,6 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags) struct task_struct *curr; struct rq *rq; - if (p->nr_cpus_allowed == 1) - goto out; - /* For anything but wake ups, just return the task_cpu */ if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK) goto out; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Don' t check CONFIG_SMP in switched_from_dl()
Commit-ID: cad3bb32e181c286c46ec12b2deb1f26a6f9835d Gitweb: http://git.kernel.org/tip/cad3bb32e181c286c46ec12b2deb1f26a6f9835d Author: Wanpeng Li AuthorDate: Fri, 31 Oct 2014 06:39:36 +0800 Committer: Ingo Molnar CommitDate: Tue, 4 Nov 2014 07:17:56 +0100 sched/deadline: Don't check CONFIG_SMP in switched_from_dl() There are both UP and SMP version of pull_dl_task(), so don't need to check CONFIG_SMP in switched_from_dl(); Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Juri Lelli Cc: Kirill Tkhai Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1414708776-124078-6-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 362ab1f..f3d7776 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1637,7 +1637,6 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) __dl_clear_params(p); -#ifdef CONFIG_SMP /* * Since this might be the only -deadline task on the rq, * this is the right place to try to pull some other one @@ -1648,7 +1647,6 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) if (pull_dl_task(rq)) resched_curr(rq); -#endif } /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Fix artificial overrun introduced by yield_task_dl()
Commit-ID: 804968809c321066cca028d4cbd533a420f964bc Gitweb: http://git.kernel.org/tip/804968809c321066cca028d4cbd533a420f964bc Author: Wanpeng Li AuthorDate: Fri, 31 Oct 2014 06:39:32 +0800 Committer: Ingo Molnar CommitDate: Tue, 4 Nov 2014 07:17:53 +0100 sched/deadline: Fix artificial overrun introduced by yield_task_dl() The yield semantic of deadline class is to reduce remaining runtime to zero, and then update_curr_dl() will stop it. However, comsumed bandwidth is reduced from the budget of yield task again even if it has already been set to zero which leads to artificial overrun. This patch fix it by make sure we don't steal some more time from the task that yielded in update_curr_dl(). Suggested-by: Juri Lelli Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Kirill Tkhai Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1414708776-124078-2-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 9d483e8..c047a94 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -628,7 +628,7 @@ static void update_curr_dl(struct rq *rq) sched_rt_avg_update(rq, delta_exec); - dl_se->runtime -= delta_exec; + dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec; if (dl_runtime_exceeded(rq, dl_se)) { __dequeue_task_dl(rq, curr, 0); if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted))) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Reschedule from switched_from_dl () after a successful pull
Commit-ID: cd66091162d34f589631a23bbe0ed214798245b4 Gitweb: http://git.kernel.org/tip/cd66091162d34f589631a23bbe0ed214798245b4 Author: Wanpeng Li AuthorDate: Fri, 31 Oct 2014 06:39:35 +0800 Committer: Ingo Molnar CommitDate: Tue, 4 Nov 2014 07:17:55 +0100 sched/deadline: Reschedule from switched_from_dl() after a successful pull In switched_from_dl() we have to issue a resched if we successfully pulled some task from other cpus. This patch also aligns the behavior with -rt. Suggested-by: Juri Lelli Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Kirill Tkhai Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1414708776-124078-5-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index e7779b3..362ab1f 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1643,8 +1643,11 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) * this is the right place to try to pull some other one * from an overloaded cpu, if any. */ - if (!rq->dl.dl_nr_running) - pull_dl_task(rq); + if (!task_on_rq_queued(p) || rq->dl.dl_nr_running) + return; + + if (pull_dl_task(rq)) + resched_curr(rq); #endif } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Push task away if the deadline is equal to curr during wakeup
Commit-ID: 6b0a563f3a534827c1b56e53c3fd0fccec3c7895 Gitweb: http://git.kernel.org/tip/6b0a563f3a534827c1b56e53c3fd0fccec3c7895 Author: Wanpeng Li AuthorDate: Fri, 31 Oct 2014 06:39:34 +0800 Committer: Ingo Molnar CommitDate: Tue, 4 Nov 2014 07:17:55 +0100 sched/deadline: Push task away if the deadline is equal to curr during wakeup This patch pushes task away if the dealine of the task is equal to current during wake up. The same behavior as rt class. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Juri Lelli Cc: Kirill Tkhai Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1414708776-124078-4-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 8867a67..e7779b3 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1506,7 +1506,7 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p) p->nr_cpus_allowed > 1 && dl_task(rq->curr) && (rq->curr->nr_cpus_allowed < 2 || -dl_entity_preempt(&rq->curr->dl, &p->dl))) { +!dl_entity_preempt(&p->dl, &rq->curr->dl))) { push_dl_tasks(rq); } } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/rt: Clean up check_preempt_equal_prio()
Commit-ID: 308a623a40ce168eb234ea82c2bd13ff85a098d9 Gitweb: http://git.kernel.org/tip/308a623a40ce168eb234ea82c2bd13ff85a098d9 Author: Wanpeng Li AuthorDate: Fri, 31 Oct 2014 06:39:31 +0800 Committer: Ingo Molnar CommitDate: Tue, 4 Nov 2014 07:17:52 +0100 sched/rt: Clean up check_preempt_equal_prio() This patch checks if current can be pushed/pulled somewhere else in advance to make logic clear, the same behavior as dl class. - If current can't be migrated, useless to reschedule, let's hope task can move out. - If task is migratable, so let's not schedule it and see if it can be pushed or pulled somewhere else. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Juri Lelli Cc: Kirill Tkhai Cc: Steven Rostedt Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1414708776-124078-1-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/rt.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index d024e6c..3d14312 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1351,16 +1351,22 @@ out: static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p) { - if (rq->curr->nr_cpus_allowed == 1) + /* +* Current can't be migrated, useless to reschedule, +* let's hope p can move out. +*/ + if (rq->curr->nr_cpus_allowed == 1 || + !cpupri_find(&rq->rd->cpupri, rq->curr, NULL)) return; + /* +* p is migratable, so let's not schedule it and +* see if it is pushed or pulled somewhere else. +*/ if (p->nr_cpus_allowed != 1 && cpupri_find(&rq->rd->cpupri, p, NULL)) return; - if (!cpupri_find(&rq->rd->cpupri, rq->curr, NULL)) - return; - /* * There appears to be other cpus that can accept * current and none to run 'p', so lets reschedule -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Add deadline rq status print
Commit-ID: acb32132ec0433c03bed750f3e9508dc29db0328 Gitweb: http://git.kernel.org/tip/acb32132ec0433c03bed750f3e9508dc29db0328 Author: Wanpeng Li AuthorDate: Fri, 31 Oct 2014 06:39:33 +0800 Committer: Ingo Molnar CommitDate: Tue, 4 Nov 2014 07:17:54 +0100 sched/deadline: Add deadline rq status print This patch add deadline rq status print. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Juri Lelli Cc: Kirill Tkhai Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 9 + kernel/sched/debug.c| 7 +++ kernel/sched/sched.h| 1 + 3 files changed, 17 insertions(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index c047a94..8867a67 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1747,3 +1747,12 @@ const struct sched_class dl_sched_class = { .switched_from = switched_from_dl, .switched_to= switched_to_dl, }; + +#ifdef CONFIG_SCHED_DEBUG +extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq); + +void print_dl_stats(struct seq_file *m, int cpu) +{ + print_dl_rq(m, cpu, &cpu_rq(cpu)->dl); +} +#endif /* CONFIG_SCHED_DEBUG */ diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index ce33780..eeb6046 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -261,6 +261,12 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq) #undef P } +void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq) +{ + SEQ_printf(m, "\ndl_rq[%d]:\n", cpu); + SEQ_printf(m, " .%-30s: %ld\n", "dl_nr_running", dl_rq->dl_nr_running); +} + extern __read_mostly int sched_clock_running; static void print_cpu(struct seq_file *m, int cpu) @@ -329,6 +335,7 @@ do { \ spin_lock_irqsave(&sched_debug_lock, flags); print_cfs_stats(m, cpu); print_rt_stats(m, cpu); + print_dl_stats(m, cpu); print_rq(m, rq, cpu); spin_unlock_irqrestore(&sched_debug_lock, flags); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 49b941f..7e5c1ee 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1537,6 +1537,7 @@ extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq); extern struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq); extern void print_cfs_stats(struct seq_file *m, int cpu); extern void print_rt_stats(struct seq_file *m, int cpu); +extern void print_dl_stats(struct seq_file *m, int cpu); extern void init_cfs_rq(struct cfs_rq *cfs_rq); extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Don' t balance during wakeup if wakee is pinned
Commit-ID: f4e9d94a5bf60193d45f92b136e3d166be3ec8d5 Gitweb: http://git.kernel.org/tip/f4e9d94a5bf60193d45f92b136e3d166be3ec8d5 Author: Wanpeng Li AuthorDate: Tue, 14 Oct 2014 10:22:40 +0800 Committer: Ingo Molnar CommitDate: Tue, 28 Oct 2014 10:48:02 +0100 sched/deadline: Don't balance during wakeup if wakee is pinned Use nr_cpus_allowed to bail from select_task_rq() when only one cpu can be used, and saves some cycles for pinned tasks. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1413253360-5318-2-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index fab3bf8..2e31a30 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -933,6 +933,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) struct task_struct *curr; struct rq *rq; + if (p->nr_cpus_allowed == 1) + goto out; + if (sd_flag != SD_BALANCE_WAKE) goto out; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Don't check SD_BALANCE_FORK
Commit-ID: 1d7e974cbf2fce2683f34ff33c173fd7ef5478c7 Gitweb: http://git.kernel.org/tip/1d7e974cbf2fce2683f34ff33c173fd7ef5478c7 Author: Wanpeng Li AuthorDate: Tue, 14 Oct 2014 10:22:39 +0800 Committer: Ingo Molnar CommitDate: Tue, 28 Oct 2014 10:48:01 +0100 sched/deadline: Don't check SD_BALANCE_FORK There is no need to do balance during fork since SCHED_DEADLINE tasks can't fork. This patch avoid the SD_BALANCE_FORK check. Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Link: http://lkml.kernel.org/r/1413253360-5318-1-git-send-email-wanpeng...@linux.intel.com Cc: Linus Torvalds Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 8aaa971..fab3bf8 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -933,7 +933,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) struct task_struct *curr; struct rq *rq; - if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK) + if (sd_flag != SD_BALANCE_WAKE) goto out; rq = cpu_rq(cpu); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/deadline: Do not try to push tasks if pinned task switches to dl
Commit-ID: d9aade7ae1d283097a3f626790e7c325a5c69007 Gitweb: http://git.kernel.org/tip/d9aade7ae1d283097a3f626790e7c325a5c69007 Author: Wanpeng Li AuthorDate: Wed, 22 Oct 2014 08:36:43 +0800 Committer: Ingo Molnar CommitDate: Tue, 28 Oct 2014 10:47:57 +0100 sched/deadline: Do not try to push tasks if pinned task switches to dl As Kirill mentioned (https://lkml.org/lkml/2013/1/29/118): | If rq has already had 2 or more pushable tasks and we try to add a | pinned task then call of push_rt_task will just waste a time. Just switched pinned task is not able to be pushed. If the rq has had several dl tasks before they have already been considered as candidates to be pushed (or pulled). This patch implements the same behavior as rt class which introduced by commit 10447917551e ("sched/rt: Do not try to push tasks if pinned task switches to RT"). Suggested-by: Kirill V Tkhai Acked-by: Juri Lelli Signed-off-by: Wanpeng Li Signed-off-by: Peter Zijlstra (Intel) Cc: Steven Rostedt Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1413938203-224610-1-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- kernel/sched/deadline.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 5285332..9d1e76a 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1622,7 +1622,8 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p) if (task_on_rq_queued(p) && rq->curr != p) { #ifdef CONFIG_SMP - if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p)) + if (p->nr_cpus_allowed > 1 && rq->dl.overloaded && + push_dl_task(rq) && rq != task_rq(p)) /* Only reschedule if pushing failed */ check_resched = 0; #endif /* CONFIG_SMP */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/urgent] sched: Fix unreleased llc_shared_mask bit during CPU hotplug
Commit-ID: 03bd4e1f7265548832a76e7919a81f3137c44fd1 Gitweb: http://git.kernel.org/tip/03bd4e1f7265548832a76e7919a81f3137c44fd1 Author: Wanpeng Li AuthorDate: Wed, 24 Sep 2014 16:38:05 +0800 Committer: Ingo Molnar CommitDate: Wed, 24 Sep 2014 15:13:20 +0200 sched: Fix unreleased llc_shared_mask bit during CPU hotplug The following bug can be triggered by hot adding and removing a large number of xen domain0's vcpus repeatedly: BUG: unable to handle kernel NULL pointer dereference at 0004 IP: [..] find_busiest_group PGD 5a9d5067 PUD 13067 PMD 0 Oops: [#3] SMP [...] Call Trace: load_balance ? _raw_spin_unlock_irqrestore idle_balance __schedule schedule schedule_timeout ? lock_timer_base schedule_timeout_uninterruptible msleep lock_device_hotplug_sysfs online_store dev_attr_store sysfs_write_file vfs_write SyS_write system_call_fastpath Last level cache shared mask is built during CPU up and the build_sched_domain() routine takes advantage of it to setup the sched domain CPU topology. However, llc_shared_mask is not released during CPU disable, which leads to an invalid sched domainCPU topology. This patch fix it by releasing the llc_shared_mask correctly during CPU disable. Yasuaki also reported that this can happen on real hardware: https://lkml.org/lkml/2014/7/22/1018 His case is here: == Here is an example on my system. My system has 4 sockets and each socket has 15 cores and HT is enabled. In this case, each core of sockes is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 Socket#2 | 30-44, 90-104 Socket#3 | 45-59, 105-119 Then llc_shared_mask of CPU#30 has 0x3fff8001fffc000. It means that last level cache of Socket#2 is shared with CPU#30-44 and 90-104. When hot-removing socket#2 and #3, each core of sockets is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 But llc_shared_mask is not cleared. So llc_shared_mask of CPU#30 remains having 0x3fff8001fffc000. After that, when hot-adding socket#2 and #3, each core of sockets is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 Socket#2 | 30-59 Socket#3 | 90-119 Then llc_shared_mask of CPU#30 becomes 0x3fff8000fffc000. It means that last level cache of Socket#2 is shared with CPU#30-59 and 90-104. So the mask has the wrong value. Signed-off-by: Wanpeng Li Tested-by: Linn Crosetto Reviewed-by: Borislav Petkov Reviewed-by: Toshi Kani Reviewed-by: Yasuaki Ishimatsu Cc: Cc: David Rientjes Cc: Prarit Bhargava Cc: Steven Rostedt Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/1411547885-48165-1-git-send-email-wanpeng...@linux.intel.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/smpboot.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 2d872e0..42a2dca 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1284,6 +1284,9 @@ static void remove_siblinginfo(int cpu) for_each_cpu(sibling, cpu_sibling_mask(cpu)) cpumask_clear_cpu(cpu, cpu_sibling_mask(sibling)); + for_each_cpu(sibling, cpu_llc_shared_mask(cpu)) + cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling)); + cpumask_clear(cpu_llc_shared_mask(cpu)); cpumask_clear(cpu_sibling_mask(cpu)); cpumask_clear(cpu_core_mask(cpu)); c->phys_proc_id = 0; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/numa: Fix period_slot recalculation
Commit-ID: e777b63bbd589248eb151a3191ee92331a701385 Gitweb: http://git.kernel.org/tip/e777b63bbd589248eb151a3191ee92331a701385 Author: Wanpeng Li AuthorDate: Thu, 12 Dec 2013 15:23:26 +0800 Committer: Ingo Molnar CommitDate: Tue, 17 Dec 2013 15:24:41 +0100 sched/numa: Fix period_slot recalculation The original code is as intended and was meant to scale the difference between the NUMA_PERIOD_THRESHOLD and local/remote ratio when adjusting the scan period. The period_slot recalculation can be dropped. Signed-off-by: Wanpeng Li Reviewed-by: Naoya Horiguchi Acked-by: Mel Gorman Acked-by: David Rientjes Signed-off-by: Peter Zijlstra Cc: Andrew Morton Cc: Rik van Riel Link: http://lkml.kernel.org/r/1386833006-6600-4-git-send-email-liw...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 37892d7..4316af2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1342,7 +1342,6 @@ static void update_task_scan_period(struct task_struct *p, * scanning faster if shared accesses dominate as it may * simply bounce migrations uselessly */ - period_slot = DIV_ROUND_UP(diff, NUMA_PERIOD_SLOTS); ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared)); diff = (diff * ratio) / NUMA_PERIOD_SLOTS; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/numa: Use wrapper function task_faults_idx to calculate index in group_faults
Commit-ID: 82897b4fd3920ffd2456731d4f2ae1406558ef4c Gitweb: http://git.kernel.org/tip/82897b4fd3920ffd2456731d4f2ae1406558ef4c Author: Wanpeng Li AuthorDate: Thu, 12 Dec 2013 15:23:25 +0800 Committer: Ingo Molnar CommitDate: Tue, 17 Dec 2013 15:24:40 +0100 sched/numa: Use wrapper function task_faults_idx to calculate index in group_faults Use wrapper function task_faults_idx to calculate index in group_faults. Signed-off-by: Wanpeng Li Reviewed-by: Naoya Horiguchi Acked-by: Mel Gorman Acked-by: David Rientjes Signed-off-by: Peter Zijlstra Cc: Andrew Morton Cc: Rik van Riel Link: http://lkml.kernel.org/r/1386833006-6600-3-git-send-email-liw...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ebdb08b..37892d7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -921,7 +921,8 @@ static inline unsigned long group_faults(struct task_struct *p, int nid) if (!p->numa_group) return 0; - return p->numa_group->faults[2*nid] + p->numa_group->faults[2*nid+1]; + return p->numa_group->faults[task_faults_idx(nid, 0)] + + p->numa_group->faults[task_faults_idx(nid, 1)]; } /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/numa: Drop sysctl_numa_balancing_settle_count sysctl
Commit-ID: 1bd53a7efdc988163ec4c25f656df38dbe500632 Gitweb: http://git.kernel.org/tip/1bd53a7efdc988163ec4c25f656df38dbe500632 Author: Wanpeng Li AuthorDate: Thu, 12 Dec 2013 15:23:23 +0800 Committer: Ingo Molnar CommitDate: Tue, 17 Dec 2013 15:24:38 +0100 sched/numa: Drop sysctl_numa_balancing_settle_count sysctl commit 887c290e (sched/numa: Decide whether to favour task or group weights based on swap candidate relationships) drop the check against sysctl_numa_balancing_settle_count, this patch remove the sysctl. Signed-off-by: Wanpeng Li Acked-by: Mel Gorman Reviewed-by: Rik van Riel Acked-by: David Rientjes Signed-off-by: Peter Zijlstra Cc: Andrew Morton Cc: Naoya Horiguchi Link: http://lkml.kernel.org/r/1386833006-6600-1-git-send-email-liw...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- Documentation/sysctl/kernel.txt | 5 - include/linux/sched/sysctl.h| 1 - kernel/sched/fair.c | 9 - kernel/sysctl.c | 7 --- 4 files changed, 22 deletions(-) diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 26b7ee4..6d48640 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -428,11 +428,6 @@ rate for each task. numa_balancing_scan_size_mb is how many megabytes worth of pages are scanned for a given scan. -numa_balancing_settle_count is how many scan periods must complete before -the schedule balancer stops pushing the task towards a preferred node. This -gives the scheduler a chance to place the task on an alternative node if the -preferred node is overloaded. - numa_balancing_migrate_deferred is how many page migrations get skipped unconditionally, after a page migration is skipped because a page is shared with other tasks. This reduces page migration overhead, and determines diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 41467f8..31e0193 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -48,7 +48,6 @@ extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; -extern unsigned int sysctl_numa_balancing_settle_count; #ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_migration_cost; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a9185f7..fcb6c17 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -872,15 +872,6 @@ static unsigned int task_scan_max(struct task_struct *p) return max(smin, smax); } -/* - * Once a preferred node is selected the scheduler balancer will prefer moving - * a task to that node for sysctl_numa_balancing_settle_count number of PTE - * scans. This will give the process the chance to accumulate more faults on - * the preferred node but still allow the scheduler to move the task again if - * the nodes CPUs are overloaded. - */ -unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4; - static void account_numa_enqueue(struct rq *rq, struct task_struct *p) { rq->nr_numa_running += (p->numa_preferred_nid != -1); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 34a6047..c8da99f 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -385,13 +385,6 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, { - .procname = "numa_balancing_settle_count", - .data = &sysctl_numa_balancing_settle_count, - .maxlen = sizeof(unsigned int), - .mode = 0644, - .proc_handler = proc_dointvec, - }, - { .procname = "numa_balancing_migrate_deferred", .data = &sysctl_numa_balancing_migrate_deferred, .maxlen = sizeof(unsigned int), -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/numa: Use wrapper function task_node to get node which task is on
Commit-ID: de1b301a19754778ddd9f908d266ffe1c010b2cf Gitweb: http://git.kernel.org/tip/de1b301a19754778ddd9f908d266ffe1c010b2cf Author: Wanpeng Li AuthorDate: Thu, 12 Dec 2013 15:23:24 +0800 Committer: Ingo Molnar CommitDate: Tue, 17 Dec 2013 15:24:39 +0100 sched/numa: Use wrapper function task_node to get node which task is on Use wrapper function task_node to get node which task is on. Signed-off-by: Wanpeng Li Acked-by: Mel Gorman Reviewed-by: Naoya Horiguchi Reviewed-by: Rik van Riel Acked-by: David Rientjes Signed-off-by: Peter Zijlstra Cc: Andrew Morton Link: http://lkml.kernel.org/r/1386833006-6600-2-git-send-email-liw...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 5c34d18..374fe04 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -139,7 +139,7 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p) 0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L); #endif #ifdef CONFIG_NUMA_BALANCING - SEQ_printf(m, " %d", cpu_to_node(task_cpu(p))); + SEQ_printf(m, " %d", task_node(p)); #endif #ifdef CONFIG_CGROUP_SCHED SEQ_printf(m, " %s", task_group_path(task_group(p))); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fcb6c17..ebdb08b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1202,7 +1202,7 @@ static int task_numa_migrate(struct task_struct *p) * elsewhere, so there is no point in (re)trying. */ if (unlikely(!sd)) { - p->numa_preferred_nid = cpu_to_node(task_cpu(p)); + p->numa_preferred_nid = task_node(p); return -EINVAL; } @@ -1269,7 +1269,7 @@ static void numa_migrate_preferred(struct task_struct *p) p->numa_migrate_retry = jiffies + HZ; /* Success if task is already running on preferred CPU */ - if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) + if (task_node(p) == p->numa_preferred_nid) return; /* Otherwise, try migrate to a CPU on the preferred node */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched/numa: Drop idx field of task_numa_env struct
Commit-ID: 40ea2b42d7c44386cf81d5636d574193da2c8df2 Gitweb: http://git.kernel.org/tip/40ea2b42d7c44386cf81d5636d574193da2c8df2 Author: Wanpeng Li AuthorDate: Thu, 5 Dec 2013 19:10:17 +0800 Committer: Ingo Molnar CommitDate: Thu, 5 Dec 2013 13:38:36 +0100 sched/numa: Drop idx field of task_numa_env struct Drop unused idx field of task_numa_env struct. Signed-off-by: Wanpeng Li Reviewed-by: Rik van Riel Cc: Peter Zijlstra Cc: Mel Gorman Cc: Naoya Horiguchi Cc: linux...@kvack.org Link: http://lkml.kernel.org/r/1386241817-5051-2-git-send-email-liw...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a566c07..49aa01f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1037,7 +1037,7 @@ struct task_numa_env { struct numa_stats src_stats, dst_stats; - int imbalance_pct, idx; + int imbalance_pct; struct task_struct *best_task; long best_imp; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/