Re: [PATCH v2 00/14] Introducing TIF_NOTIFY_IPI flag
Hello Chenyu, On 6/18/2024 1:19 PM, Chen Yu wrote: [..snip..] Vincent [5] pointed out a case where the idle load kick will fail to run on an idle CPU since the IPI handler launching the ILB will check for need_resched(). In such cases, the idle CPU relies on newidle_balance() to pull tasks towards itself. Is this the need_resched() in _nohz_idle_balance() ? Should we change this to 'need_resched() && (rq->nr_running || rq->ttwu_pending)' or something long those lines? It's not only this but also in do_idle() as well which exits the loop to look for tasks to schedule I mean, it's fairly trivial to figure out if there really is going to be work there. Using an alternate flag instead of NEED_RESCHED to indicate a pending IPI was suggested as the correct approach to solve this problem on the same thread. So adding per-arch changes for this seems like something we shouldn't unless there really is no other sane options. That is, I really think we should start with something like the below and then fix any fallout from that. The main problem is that need_resched becomes somewhat meaningless because it doesn't only mean "I need to resched a task" and we have to add more tests around even for those not using polling diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0935f9d4bb7b..cfa45338ae97 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5799,7 +5800,7 @@ static inline struct task_struct * __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { const struct sched_class *class; - struct task_struct *p; + struct task_struct *p = NULL; /* * Optimization: we know that if all tasks are in the fair class we can @@ -5810,9 +5811,11 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) if (likely(!sched_class_above(prev->sched_class, _sched_class) && rq->nr_running == rq->cfs.h_nr_running)) { - p = pick_next_task_fair(rq, prev, rf); - if (unlikely(p == RETRY_TASK)) - goto restart; + if (rq->nr_running) { How do you make the diff between a spurious need_resched() because of polling and a cpu becoming idle ? isn't rq->nr_running null in both cases ? In the later case, we need to call sched_balance_newidle() but not in the former Not sure if I understand correctly, if the goal of smp_call_function_single() is to kick the idle CPU and do not force it to launch the schedule()->sched_balance_newidle(), can we set the _TIF_POLLING_NRFLAG rather than _TIF_NEED_RESCHED in set_nr_if_polling()? I think writing any value to the monitor address would wakeup the idle CPU. And _TIF_POLLING_NRFLAG will be cleared once that idle CPU exit the idle loop, so we don't introduce arch-wide flag. Although this might work for MWAIT, there is no way for the generic idle path to know if there is a pending interrupt within a TIF_POLLING_NRFLAG section. do_idle() sets TIF_POLLING_NRFLAG and relies on a bunch of need_resched() checks along the way to bail early until finally doing a current_clr_polling_and_test() before handing off to the cpuidle driver in call_cpuidle(). I believe this section will necessarily need the sender to indicate a pending interrupt via TIF_NEED_RESCHED flag to enable the early bail out before going into the cpuidle driver since this case cannot be considered the same as a break from MWAIT. I see, this is a good point. So you mean with only TIF_POLLING_NRFLAG there is possibility that the 'ipi kick CPU out of idle' is lost after the CPU enters do_idle() and before finally entering the idle state. While setting _TIF_NEED_RESCHED could help the do_idle() loop to detect pending request easier. Yup, that is correct. BTW, before the commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()"), the lost of ipi after entering do_idle() and before entering driver idle state is also possible, right(the local irq is disabled)? From what I understand, the IPI remains pending until the interrupts are enabled again. Before the optimization, the interrupts would be disabled all the way until the instruction that is used to put the CPU to sleep which is what __sti_mwait() and native_safe_halt() does. The CPU would have received the IPI then and broke out of idle before Peter's optimization went in. There is an elaborate comment on this in do_idle() function above the call to local_irq_disable(). In commit edc8fc01f608 ("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram") Peter describes a case of actually missing the break from an interrupt as the driver enabled interrupts much earlier than executing the sleep instruction. Since the CPU was in TIF_POLLING_NRFLAG state, one could simply get away by setting TIF_NEED_RESCHED and not sending an actual IPI which the need_resched() checks in the idle path would catch and the flush_smp_call_function_queue() on
Re: [PATCH v2 00/14] Introducing TIF_NOTIFY_IPI flag
Hello Chenyu, On 6/14/2024 10:01 PM, Chen Yu wrote: On 2024-06-14 at 12:48:37 +0200, Vincent Guittot wrote: On Fri, 14 Jun 2024 at 11:28, Peter Zijlstra wrote: On Thu, Jun 13, 2024 at 06:15:59PM +, K Prateek Nayak wrote: Effects of call_function_single_prep_ipi() == To pull a TIF_POLLING thread out of idle to process an IPI, the sender sets the TIF_NEED_RESCHED bit in the idle task's thread info in call_function_single_prep_ipi() and avoids sending an actual IPI to the target. As a result, the scheduler expects a task to be enqueued when exiting the idle path. This is not the case with non-polling idle states where the idle CPU exits the non-polling idle state to process the interrupt, and since need_resched() returns false, soon goes back to idle again. When TIF_NEED_RESCHED flag is set, do_idle() will call schedule_idle(), a large part of which runs with local IRQ disabled. In case of ipistorm, when measuring IPI throughput, this large IRQ disabled section delays processing of IPIs. Further auditing revealed that in absence of any runnable tasks, pick_next_task_fair(), which is called from the pick_next_task() fast path, will always call newidle_balance() in this scenario, further increasing the time spent in the IRQ disabled section. Following is the crude visualization of the problem with relevant functions expanded: -- CPU0 CPU1 do_idle() { __current_set_polling(); ... monitor(addr); if (!need_resched()) mwait() { /* Waiting */ smp_call_function_single(CPU1, func, wait = 1) { ... ... ... set_nr_if_polling(CPU1) { ... /* Realizes CPU1 is polling */ ... try_cmpxchg(addr, ... , ... val | _TIF_NEED_RESCHED); ... } /* Does not send an IPI */ ... ... } /* mwait exit due to write at addr */ csd_lock_wait() { } /* Waiting */ preempt_set_need_resched(); ... __current_clr_polling(); ... flush_smp_call_function_queue() { ... func(); } /* End of wait */ } } schedule_idle() { ... local_irq_disable(); smp_call_function_single(CPU1, func, wait = 1) { ... ... ... arch_send_call_function_single_ipi(CPU1); ... \ ... \ newidle_balance() { \ ... /* Delay */ ... \ } \ ... \--> local_irq_enable(); /* Processes the IPI */ -- Skipping newidle_balance() == In an earlier attempt to solve the challenge of the long IRQ disabled section, newidle_balance() was skipped when a CPU waking up from idle was found to have no runnable tasks, and was transitioning back to idle [2]. Tim [3] and David [4] had pointed out that newidle_balance() may be viable for CPUs that are idling with tick enabled, where the newidle_balance() has the opportunity to pull tasks onto the idle CPU. I don't think we should be rely
Re: [PATCH v2 00/14] Introducing TIF_NOTIFY_IPI flag
Hello Vincent, Peter, On 6/16/2024 8:27 PM, Vincent Guittot wrote: On Sat, 15 Jun 2024 at 03:28, Peter Zijlstra wrote: On Fri, Jun 14, 2024 at 12:48:37PM +0200, Vincent Guittot wrote: On Fri, 14 Jun 2024 at 11:28, Peter Zijlstra wrote: Vincent [5] pointed out a case where the idle load kick will fail to run on an idle CPU since the IPI handler launching the ILB will check for need_resched(). In such cases, the idle CPU relies on newidle_balance() to pull tasks towards itself. Is this the need_resched() in _nohz_idle_balance() ? Should we change this to 'need_resched() && (rq->nr_running || rq->ttwu_pending)' or something long those lines? It's not only this but also in do_idle() as well which exits the loop to look for tasks to schedule Is that really a problem? Reading the initial email the problem seems to be newidle balance, not hitting schedule. Schedule should be fairly quick if there's nothing to do, no? There are 2 problems: - Because of NEED_RESCHED being set, we go through the full schedule path for no reason and we finally do a sched_balance_newidle() Peter's patch up in the thread seems to improve the above case by speeding up the schedule() loop similar to the very first solution I tried with https://lore.kernel.org/lkml/20240119084548.2788-1-kprateek.na...@amd.com/ I do see same level of improvements (if not better) with Peter's SM_IDLE solution: == Test : ipistorm (modified) Units : Normalized runtime Interpretation: Lower is better Statistic : AMean == kernel: time [pct imp] tip:sched/core1.00 [baseline] tip:sched/core + revert 0.40 [60.26%] tip:sched/core + TIF_NOTIFY_IPI 0.46 [54.88%] tip:sched/core + SM_IDLE 0.38 [72.64%] - Because of need_resched being set o wake up the cpu, we will not kick the softirq to run the nohz idle load balance and get a chance to pull a task on an idle CPU However, this issues with need_resched() still remains. Any need_resched() check within an interrupt context will return true if the target CPU is perceived to be in a polling idle state by the sender as a result of the optimization in commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()"). If TIF_POLLING_NRFLAG is defined by an arch, do_idle() will set the flag until the path hits call_cpuidle() where the flag is cleared just before handing off the state entry to the cpuidle driver. An incoming interrupt in this window will allow the idle path to bail early and return before calling the driver specific routine since it'll be indicated by TIF_NEED_RESCHED being set in the idle task's thread info. Beyond that point, the cpuidle driver handles the idle entry. I think an arch may define TIF_POLLING_NRFLAG just to utilize this optimization in the generic idle path to answer Vincent's observation on ARM32 having TIF_POLLING_NRFLAG. I mean, it's fairly trivial to figure out if there really is going to be work there. Using an alternate flag instead of NEED_RESCHED to indicate a pending IPI was suggested as the correct approach to solve this problem on the same thread. So adding per-arch changes for this seems like something we shouldn't unless there really is no other sane options. That is, I really think we should start with something like the below and then fix any fallout from that. The main problem is that need_resched becomes somewhat meaningless because it doesn't only mean "I need to resched a task" and we have to add more tests around even for those not using polling True, however we already had some of that by having the wakeup list, that made nr_running less 'reliable'. The thing is, most architectures seem to have the TIF_POLLING_NRFLAG bit, even if their main idle routine isn't actually using it, much of Yes, I'm surprised that Arm arch has the TIF_POLLING_NRFLAG whereas it has never been supported by the arch the idle loop until it hits the arch idle will be having it set and will thus tickle these cases *sometimes*. [..snip..] -- Thanks and Regards, Prateek
Re: [PATCH v2 00/14] Introducing TIF_NOTIFY_IPI flag
Hello Russell, On 6/15/2024 7:56 PM, Russell King (Oracle) wrote: On Thu, Jun 13, 2024 at 06:15:59PM +, K Prateek Nayak wrote: o Dropping the ARM results since I never got my hands on the ARM64 system I used in my last testing. If I do manage to get my hands on it again, I'll rerun the experiments and share the results on the thread. To test the case where TIF_NOTIFY_IPI is not enabled for a particular architecture, I applied the series only until Patch 3 and tested the same on my x86 machine with a WARN_ON_ONCE() in do_idle() to check if tif_notify_ipi() ever return true and then repeated the same with Patch 4 applied. Confused. ARM (32-bit) or ARM64? You patch 32-bit ARM, but you don't touch 64-bit Arm. "ARM" on its own in the context above to me suggests 32-bit, since you refer to ARM64 later. In my first RFC posting, I had shared the results for ipistorm on an ARM64 server [1]. Vincent and Linus Walleij brought to my attention that ARM32 and ARM64 do not share the thread info flags and I probably saw a one-off behavior during my testing. Since then, it has been slightly challenging to get my hands on that machine again in a stable condition to see if there was any scenario that I might have missed but I tried a bunch of things on my x86 machine to confirm that an arch that does not define the TIF_NOTIFY_IPI would not hit these changes. Rest assured, Patch 5 is for ARM32 machines that currently define TIF_POLLING_NRFLAG [1] https://lore.kernel.org/lkml/20240220171457.703-6-kprateek.na...@amd.com/ -- Thanks and Regards, Prateek
[PATCH v2 03/14] sched/core: Use TIF_NOTIFY_IPI to notify an idle CPU in TIF_POLLING mode of pending IPI
: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@vger.kernel.org Cc: x...@kernel.org Signed-off-by: Gautham R. Shenoy Co-developed-by: K Prateek Nayak Signed-off-by: K Prateek Nayak --- v1..v2: o Updated benchmark numbers. --- include/linux/sched/idle.h | 8 kernel/sched/core.c| 41 ++ kernel/sched/idle.c| 16 +++ 3 files changed, 49 insertions(+), 16 deletions(-) diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h index 497518b84e8d..4757a6ab5c2c 100644 --- a/include/linux/sched/idle.h +++ b/include/linux/sched/idle.h @@ -58,8 +58,8 @@ static __always_inline bool __must_check current_set_polling_and_test(void) __current_set_polling(); /* -* Polling state must be visible before we test NEED_RESCHED, -* paired by resched_curr() +* Polling state must be visible before we test NEED_RESCHED or +* NOTIFY_IPI paired by resched_curr() or notify_ipi_if_polling() */ smp_mb__after_atomic(); @@ -71,8 +71,8 @@ static __always_inline bool __must_check current_clr_polling_and_test(void) __current_clr_polling(); /* -* Polling state must be visible before we test NEED_RESCHED, -* paired by resched_curr() +* Polling state must be visible before we test NEED_RESCHED or +* NOTIFY_IPI paired by resched_curr() or notify_ipi_if_polling() */ smp_mb__after_atomic(); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0935f9d4bb7b..bb01b063320b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -911,12 +911,30 @@ static inline bool set_nr_and_not_polling(struct task_struct *p) } /* - * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set. + * Certain architectures that support TIF_POLLING_NRFLAG may not support + * TIF_NOTIFY_IPI to notify an idle CPU in TIF_POLLING mode of a pending + * IPI. On such architectures, set TIF_NEED_RESCHED instead to wake the + * idle CPU and process the pending IPI. + */ +#ifdef _TIF_NOTIFY_IPI +#define _TIF_WAKE_FLAG _TIF_NOTIFY_IPI +#else +#define _TIF_WAKE_FLAG _TIF_NEED_RESCHED +#endif + +/* + * Atomically set TIF_WAKE_FLAG when TIF_POLLING_NRFLAG is set. + * + * On architectures that define TIF_NOTIFY_IPI, the same is set in the + * idle task's thread_info to pull the CPU out of idle and process + * the pending interrupt. On architectures that don't support + * TIF_NOTIFY_IPI, TIF_NEED_RESCHED is set instead to notify the + * pending IPI. * - * If this returns true, then the idle task promises to call - * sched_ttwu_pending() and reschedule soon. + * If this returns true, then the idle task promises to process the + * call function soon. */ -static bool set_nr_if_polling(struct task_struct *p) +static bool notify_ipi_if_polling(struct task_struct *p) { struct thread_info *ti = task_thread_info(p); typeof(ti->flags) val = READ_ONCE(ti->flags); @@ -924,9 +942,16 @@ static bool set_nr_if_polling(struct task_struct *p) do { if (!(val & _TIF_POLLING_NRFLAG)) return false; - if (val & _TIF_NEED_RESCHED) + /* +* If TIF_NEED_RESCHED flag is set in addition to +* TIF_POLLING_NRFLAG, the CPU will soon fall out of +* idle. Since flush_smp_call_function_queue() is called +* soon after the idle exit, setting TIF_WAKE_FLAG is +* not necessary. +*/ + if (val & (_TIF_NEED_RESCHED | _TIF_WAKE_FLAG)) return true; - } while (!try_cmpxchg(>flags, , val | _TIF_NEED_RESCHED)); + } while (!try_cmpxchg(>flags, , val | _TIF_WAKE_FLAG)); return true; } @@ -939,7 +964,7 @@ static inline bool set_nr_and_not_polling(struct task_struct *p) } #ifdef CONFIG_SMP -static inline bool set_nr_if_polling(struct task_struct *p) +static inline bool notify_ipi_if_polling(struct task_struct *p) { return false; } @@ -3710,7 +3735,7 @@ void sched_ttwu_pending(void *arg) */ bool call_function_single_prep_ipi(int cpu) { - if (set_nr_if_polling(cpu_rq(cpu)->idle)) { + if (notify_ipi_if_polling(cpu_rq(cpu)->idle)) { trace_sched_wake_idle_without_ipi(cpu); return false; } diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 7de94df5d477..6748735156a7 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -329,13 +329,13 @@ static void do_idle(void) } /* -* Since we fell out of the loop above, we know TIF_NEED_RESCHED must -* be set, propagate it into PREEMPT_NEED_RESCHED. +* Since we fell out of the loop above, TIF_NEED_RESCHED may be set. +* Propagate it into PREEMPT_NEED_RESCHED.
[PATCH v2 02/14] sched: Define a need_resched_or_ipi() helper and use it treewide
From: "Gautham R. Shenoy" Currently TIF_NEED_RESCHED is being overloaded, to wakeup an idle CPU in TIF_POLLING mode to service an IPI even if there are no new tasks being woken up on the said CPU. In preparation of a proper fix, introduce a new helper "need_resched_or_ipi()" which is intended to return true if either the TIF_NEED_RESCHED flag or if TIF_NOTIFY_IPI flag is set. Use this helper function in place of need_resched() in idle loops where TIF_POLLING_NRFLAG is set. To preserve bisectibility and avoid unbreakable idle loops, all the need_resched() checks within TIF_POLLING_NRFLAGS sections, have been replaced tree-wide with the need_resched_or_ipi() check. [ prateek: Replaced some of the missed out occurrences of need_resched() within a TIF_POLLING sections with need_resched_or_ipi() ] Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Russell King Cc: Guo Ren Cc: Michal Simek Cc: Dinh Nguyen Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: "Naveen N. Rao" Cc: Yoshinori Sato Cc: Rich Felker Cc: John Paul Adrian Glaubitz Cc: "David S. Miller" Cc: Andreas Larsson Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: "Rafael J. Wysocki" Cc: Daniel Lezcano Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: Andrew Donnellan Cc: Benjamin Gray Cc: Frederic Weisbecker Cc: Xin Li Cc: Kees Cook Cc: Rick Edgecombe Cc: Tony Battersby Cc: Bjorn Helgaas Cc: Brian Gerst Cc: Leonardo Bras Cc: Imran Khan Cc: "Paul E. McKenney" Cc: Rik van Riel Cc: Tim Chen Cc: David Vernet Cc: Julia Lawall Cc: linux-al...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-c...@vger.kernel.org Cc: linux-openr...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@vger.kernel.org Cc: x...@kernel.org Signed-off-by: Gautham R. Shenoy Co-developed-by: K Prateek Nayak Signed-off-by: K Prateek Nayak --- v1..v2: o Fixed a conflict with commit edc8fc01f608 ("x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram") that touched mwait_idle_with_hints() in arch/x86/include/asm/mwait.h --- arch/x86/include/asm/mwait.h | 2 +- arch/x86/kernel/process.c | 2 +- drivers/cpuidle/cpuidle-powernv.c | 2 +- drivers/cpuidle/cpuidle-pseries.c | 2 +- drivers/cpuidle/poll_state.c | 2 +- include/linux/sched.h | 5 + include/linux/sched/idle.h| 4 ++-- kernel/sched/idle.c | 7 --- 8 files changed, 16 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h index 920426d691ce..3fa6f0bbd74f 100644 --- a/arch/x86/include/asm/mwait.h +++ b/arch/x86/include/asm/mwait.h @@ -125,7 +125,7 @@ static __always_inline void mwait_idle_with_hints(unsigned long eax, unsigned lo __monitor((void *)_thread_info()->flags, 0, 0); - if (!need_resched()) { + if (!need_resched_or_ipi()) { if (ecx & 1) { __mwait(eax, ecx); } else { diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index b8441147eb5e..dd73cd6f735c 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -901,7 +901,7 @@ static __cpuidle void mwait_idle(void) } __monitor((void *)_thread_info()->flags, 0, 0); - if (!need_resched()) { + if (!need_resched_or_ipi()) { __sti_mwait(0, 0); raw_local_irq_disable(); } diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 9ebedd972df0..77c3bb371f56 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -79,7 +79,7 @@ static int snooze_loop(struct cpuidle_device *dev, dev->poll_time_limit = false; ppc64_runlatch_off(); HMT_very_low(); - while (!need_resched()) { + while (!need_resched_or_ipi()) { if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time) { /* * Task has not woken up but we are exiting the polling diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 14db9b7d985d..4f2b490f8b73 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -46,7 +46,7 @@ int snooze_loop(struct cpuidle_device *dev, struc
[PATCH v2 01/14] thread_info: Add helpers to test and clear TIF_NOTIFY_IPI
From: "Gautham R. Shenoy" Introduce the notion of TIF_NOTIFY_IPI flag. When a processor in TIF_POLLING mode needs to process an IPI, the sender sets NEED_RESCHED bit in idle task's thread_info to pull the target out of idle and avoids sending an interrupt to the idle CPU. When NEED_RESCHED is set, the scheduler assumes that a new task has been queued on the idle CPU and calls schedule_idle(), however, it is not necessary that an IPI on an idle CPU will necessarily end up waking a task on the said CPU. To avoid spurious calls to schedule_idle() assuming an IPI on an idle CPU will always wake a task on the said CPU, TIF_NOTIFY_IPI will be used to pull a TIF_POLLING CPU out of idle. Since the IPI handlers are processed before the call to schedule_idle(), schedule_idle() will be called only if one of the handlers have woken up a new task on the CPU and has set NEED_RESCHED. Add tif_notify_ipi() and current_clr_notify_ipi() helpers to test if TIF_NOTIFY_IPI is set in the current task's thread_info, and to clear it respectively. These interfaces will be used in subsequent patches as TIF_NOTIFY_IPI notion is integrated in the scheduler and in the idle path. [ prateek: Split the changes into a separate patch, add commit log ] Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Russell King Cc: Guo Ren Cc: Michal Simek Cc: Dinh Nguyen Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: "Naveen N. Rao" Cc: Yoshinori Sato Cc: Rich Felker Cc: John Paul Adrian Glaubitz Cc: "David S. Miller" Cc: Andreas Larsson Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: "Rafael J. Wysocki" Cc: Daniel Lezcano Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: Andrew Donnellan Cc: Benjamin Gray Cc: Frederic Weisbecker Cc: Xin Li Cc: Kees Cook Cc: Rick Edgecombe Cc: Tony Battersby Cc: Bjorn Helgaas Cc: Brian Gerst Cc: Leonardo Bras Cc: Imran Khan Cc: "Paul E. McKenney" Cc: Rik van Riel Cc: Tim Chen Cc: David Vernet Cc: Julia Lawall Cc: linux-al...@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-c...@vger.kernel.org Cc: linux-openr...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@vger.kernel.org Cc: x...@kernel.org Signed-off-by: Gautham R. Shenoy Co-developed-by: K Prateek Nayak Signed-off-by: K Prateek Nayak --- v1..v2: o No changes. --- include/linux/thread_info.h | 43 + 1 file changed, 43 insertions(+) diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h index 9ea0b28068f4..1e10dd8c0227 100644 --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -195,6 +195,49 @@ static __always_inline bool tif_need_resched(void) #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */ +#ifdef TIF_NOTIFY_IPI + +#ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H + +static __always_inline bool tif_notify_ipi(void) +{ + return arch_test_bit(TIF_NOTIFY_IPI, +(unsigned long *)(_thread_info()->flags)); +} + +static __always_inline void current_clr_notify_ipi(void) +{ + arch_clear_bit(TIF_NOTIFY_IPI, + (unsigned long *)(_thread_info()->flags)); +} + +#else + +static __always_inline bool tif_notify_ipi(void) +{ + return test_bit(TIF_NOTIFY_IPI, + (unsigned long *)(_thread_info()->flags)); +} + +static __always_inline void current_clr_notify_ipi(void) +{ + clear_bit(TIF_NOTIFY_IPI, + (unsigned long *)(_thread_info()->flags)); +} + +#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */ + +#else /* !TIF_NOTIFY_IPI */ + +static __always_inline bool tif_notify_ipi(void) +{ + return false; +} + +static __always_inline void current_clr_notify_ipi(void) { } + +#endif /* TIF_NOTIFY_IPI */ + #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES static inline int arch_within_stack_frames(const void * const stack, const void * const stackend, -- 2.34.1
[PATCH v2 00/14] Introducing TIF_NOTIFY_IPI flag
... set_nr_if_polling(CPU1) { ... /* Realizes CPU1 is polling */ ... try_cmpxchg(addr, ... , ... val | _TIF_NOTIFY_IPI); ... } /* Does not send an IPI */ ... ... } /* mwait exit due to write at addr */ csd_lock_wait() { ... /* Waiting */ preempt_fold_need_resched(); /* fold if NEED_RESCHED */ ... __current_clr_polling(); ... flush_smp_call_function_queue() { ... func(); /* Will set NEED_RESCHED if sched_ttwu_pending() */ } /* End of wait */ } } if (need_resched()) { schedule_idle(); smp_call_function_single(CPU1, func, wait = 1) {} ... ... /* IRQs remain enabled */ arch_send_call_function_single_ipi(CPU1); ---> /* Processes the IPI */ -- Results === With the TIF_NOTIFY_IPI, the time taken to complete a fixed set of IPIs using ipistorm improves drastically and is closer the numbers same with the revert. Following are the numbers from the same dual socket 3rd Generation EPYC system (2 x 64C/128T) (boost on, C2 disabled) running ipistorm between CPU8 and CPU16: cmdline: insmod ipistorm.ko numipi=10 single=1 offset=8 cpulist=8 wait=1 == Test : ipistorm (modified) Units : Normalized runtime Interpretation: Lower is better Statistic : AMean == kernel: time [pct imp] tip:sched/core1.00 [baseline] tip:sched/core + revert 0.40 [60.26] tip:sched/core + TIF_NOTIFY_IPI 0.46 [54.88] netperf and tbench results with the patch match the results on tip on the dual socket 3rd Generation AMD system (2 x 64C/128T). Additionally, hackbench, stream, and schbench too were tested, with results from the patched kernel matching that of the tip. Additional benefits === In nohz_csd_func(), the need_resched() check returns true when an idle CPU in TIF_POLLING mode is woken up to do an idle load balance which leads to the idle load balance bailing out early today since send_call_function_single_ipi() ends up setting the TIF_NEED_RESCHED flag to put the CPU out of idle and the flag is not cleared until __schedule() is called much later in the call path. With TIF_NOTIFY_IPI, this is no longer the case since TIF_NEED_RESCHED is only set if there is a genuine need to call schedule() and not used in an overloaded manner to notify a pending IPI. Links = [1] https://github.com/antonblanchard/ipistorm [2] https://lore.kernel.org/lkml/20240119084548.2788-1-kprateek.na...@amd.com/ [3] https://lore.kernel.org/lkml/b4f5ac150685456cf45a342e3bb1f28cdd557a53.ca...@linux.intel.com/ [4] https://lore.kernel.org/lkml/20240123211756.GA221793@maniforge/ [5] https://lore.kernel.org/lkml/cakftptc446lo9catpp7pexdklhhqfobuy-jmgc7agohy4hs...@mail.gmail.com/ This series is based on tip:sched/core at commit c793a62823d1 ("sched/core: Drop spinlocks on contention iff kernel is preemptible") -- Gautham R. Shenoy (4): thread_info: Add helpers to test and clear TIF_NOTIFY_IPI sched: Define a need_resched_or_ipi() helper and use it treewide sched/core: Use TIF_NOTIFY_IPI to notify an idle CPU in TIF_POLLING mode of pending IPI x86/thread_info: Introduce TIF_NOTIFY_IPI flag K Prateek Nayak (10): arm/thread_info: Introduce TIF_NOTIFY_IPI flag alpha/thread_info: Introduce TIF_NOTIFY_IPI flag openrisc/thread_info: Introduce TIF_NOTIFY_IPI flag powerpc/thread_info: Introduce TIF_NOTIFY_IPI flag sh/thread_info: Introduce TIF_NOTIFY_IPI flag sparc/thread_info: Introduce TIF_NOTIFY_IPI flag csky/thread_info: Introduce TIF_NOTIFY_IPI flag parisc/thread_info: Introduce TIF_NOTIFY_IPI flag nios2/thread_info: Introduce TIF_NOTIFY_IPI flag microblaze/thread_info: Introduce TIF_NOTIFY_IPI flag -- Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Russell King Cc: Guo Ren Cc: Michal Simek Cc: Dinh Nguyen Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: S
[PATCH v2 08/14] powerpc/thread_info: Introduce TIF_NOTIFY_IPI flag
Add support for TIF_NOTIFY_IPI on PowerPC. With TIF_NOTIFY_IPI, a sender sending an IPI to an idle CPU in TIF_POLLING mode will set the TIF_NOTIFY_IPI flag in the target's idle tasks's thread_info to pull the CPU out of idle, as opposed to setting TIF_NEED_RESCHED previously. This avoids spurious calls to schedule_idle() in cases where an IPI does not necessarily wake up a task on the idle CPU. Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: "Naveen N. Rao" Cc: Benjamin Gray Cc: Andrew Donnellan Cc: "Rafael J. Wysocki" Cc: Daniel Lezcano Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: K Prateek Nayak Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-ker...@vger.kernel.org Cc: linux...@vger.kernel.org Signed-off-by: K Prateek Nayak --- v1..v2: o No changes. --- arch/powerpc/include/asm/thread_info.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index 15c5691dd218..9545e164463b 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -103,6 +103,7 @@ void arch_setup_new_exec(void); #define TIF_PATCH_PENDING 6 /* pending live patching update */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SINGLESTEP 8 /* singlestepping active */ +#define TIF_NOTIFY_IPI 9 /* Pending IPI on TIF_POLLLING idle CPU */ #define TIF_SECCOMP10 /* secure computing */ #define TIF_RESTOREALL 11 /* Restore all regs (implies NOERROR) */ #define TIF_NOERROR12 /* Force successful syscall return */ @@ -129,6 +130,7 @@ void arch_setup_new_exec(void); #define _TIF_PATCH_PENDING (1<
Re: [RFC PATCH 00/14] Introducing TIF_NOTIFY_IPI flag
Hello Vincent, Thank you for taking a look at the series. On 3/6/2024 3:29 PM, Vincent Guittot wrote: > Hi Prateek, > > Adding Julia who could be interested in this patchset. Your patchset > should trigger idle load balance instead of newly idle load balance > now when the polling is used. This was one reason for not migrating > task in idle CPU Thank you. > > On Tue, 20 Feb 2024 at 18:15, K Prateek Nayak wrote: >> >> Hello everyone, >> >> [..snip..] >> >> >> Skipping newidle_balance() >> == >> >> In an earlier attempt to solve the challenge of the long IRQ disabled >> section, newidle_balance() was skipped when a CPU waking up from idle >> was found to have no runnable tasks, and was transitioning back to >> idle [2]. Tim [3] and David [4] had pointed out that newidle_balance() >> may be viable for CPUs that are idling with tick enabled, where the >> newidle_balance() has the opportunity to pull tasks onto the idle CPU. >> >> Vincent [5] pointed out a case where the idle load kick will fail to >> run on an idle CPU since the IPI handler launching the ILB will check >> for need_resched(). In such cases, the idle CPU relies on >> newidle_balance() to pull tasks towards itself. > > Calling newidle_balance() instead of the normal idle load balance > prevents the CPU to pull tasks from other groups Thank you for the correction. > >> >> Using an alternate flag instead of NEED_RESCHED to indicate a pending >> IPI was suggested as the correct approach to solve this problem on the >> same thread. >> >> >> Proposed solution: TIF_NOTIFY_IPI >> = >> >> Instead of reusing TIF_NEED_RESCHED bit to pull an TIF_POLLING CPU out >> of idle, TIF_NOTIFY_IPI is a newly introduced flag that >> call_function_single_prep_ipi() sets on a target TIF_POLLING CPU to >> indicate a pending IPI, which the idle CPU promises to process soon. >> >> On architectures that do not support the TIF_NOTIFY_IPI flag (this >> series only adds support for x86 and ARM processors for now), > > I'm surprised that you are mentioning ARM processors because they > don't use TIF_POLLING. Yup I just realised that after Linus Walleij pointed it out on the thread. > >> call_function_single_prep_ipi() will fallback to setting >> TIF_NEED_RESCHED bit to pull the TIF_POLLING CPU out of idle. >> >> Since the pending IPI handlers are processed before the call to >> schedule_idle() in do_idle(), schedule_idle() will only be called if the >> IPI handler have woken / migrated a new task on the idle CPU and has set >> TIF_NEED_RESCHED bit to indicate the same. This avoids running into the >> long IRQ disabled section in schedule_idle() unnecessarily, and any >> need_resched() check within a call function will accurately notify if a >> task is waiting for CPU time on the CPU handling the IPI. >> >> Following is the crude visualization of how the situation changes with >> the newly introduced TIF_NOTIFY_IPI flag: >> -- >> CPU0CPU1 >> >> do_idle() { >> >> __current_set_polling(); >> ... >> >> monitor(addr); >> if >> (!need_resched_or_ipi()) >> >> mwait() { >> /* >> Waiting */ >> smp_call_function_single(CPU1, func, wait = 1) { >>... >> ... >>... >> set_nr_if_polling(CPU1) { >>... >> /* Realizes CPU1 is polling */ >>... >> try_cmpxchg(addr, >>... >> , >>... >> val | _TIF_NOTIFY_IPI); >>... >> } /* Does not send an IPI */ &
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
(+ Xuewen Yan, Ke Wang) Hello Tobias, On 2/28/2024 9:40 PM, Tobias Huschle wrote: > The previously used CFS scheduler gave tasks that were woken up an > enhanced chance to see runtime immediately by deducting a certain value > from its vruntime on runqueue placement during wakeup. > > This property was used by some, at least vhost, to ensure, that certain > kworkers are scheduled immediately after being woken up. The EEVDF > scheduler, does not support this so far. Instead, if such a woken up > entitiy carries a negative lag from its previous execution, it will have > to wait for the current time slice to finish, which affects the > performance of the process expecting the immediate execution negatively. > > To address this issue, implement EEVDF strategy #2 for rejoining > entities, which dismisses the lag from previous execution and allows > the woken up task to run immediately (if no other entities are deemed > to be preferred for scheduling by EEVDF). > > The vruntime is decremented by an additional value of 1 to make sure, > that the woken up tasks gets to actually run. This is of course not > following strategy #2 in an exact manner but guarantees the expected > behavior for the scenario described above. Without the additional > decrement, the performance goes south even more. So there are some > side effects I could not get my head around yet. > > Questions: > 1. The kworker getting its negative lag occurs in the following scenario >- kworker and a cgroup are supposed to execute on the same CPU >- one task within the cgroup is executing and wakes up the kworker >- kworker with 0 lag, gets picked immediately and finishes its > execution within ~5000ns >- on dequeue, kworker gets assigned a negative lag >Is this expected behavior? With this short execution time, I would >expect the kworker to be fine. >For a more detailed discussion on this symptom, please see: >https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/ Does the lag clamping path from Xuewen Yan [1] work for the vhost case mentioned in the thread? Instead of placing the task just behind the 0-lag point, clamping the lag seems to be more principled approach since EEVDF already does it in update_entity_lag(). If the lag is still too large, maybe the above coupled with Peter's delayed dequeue patch can help [2] (Note: tree is prone to force updates) [1] https://lore.kernel.org/lkml/20240130080643.1828-1-xuewen@unisoc.com/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf=e62ef63a888c97188a977daddb72b61548da8417 > 2. The proposed code change of course only addresses the symptom. Am I >assuming correctly that this is in general the exepected behavior and >that the task waking up the kworker should rather do an explicit >reschedule of itself to grant the kworker time to execute? >In the vhost case, this is currently attempted through a cond_resched >which is not doing anything because the need_resched flag is not set. > > Feedback and opinions would be highly appreciated. > > Signed-off-by: Tobias Huschle > --- > kernel/sched/fair.c | 5 + > kernel/sched/features.h | 1 + > 2 files changed, 6 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 533547e3c90a..c20ae6d62961 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5239,6 +5239,11 @@ place_entity(struct cfs_rq *cfs_rq, struct > sched_entity *se, int flags) > lag = div_s64(lag, load); > } > > + if (sched_feat(NOLAG_WAKEUP) && (flags & ENQUEUE_WAKEUP)) { > + se->vlag = 0; > + lag = 1; > + } > + > se->vruntime = vruntime - lag; > > /* > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index 143f55df890b..d3118e7568b4 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -7,6 +7,7 @@ > SCHED_FEAT(PLACE_LAG, true) > SCHED_FEAT(PLACE_DEADLINE_INITIAL, true) > SCHED_FEAT(RUN_TO_PARITY, true) > +SCHED_FEAT(NOLAG_WAKEUP, true) > > /* > * Prefer to schedule the task we woke last (assuming it failed -- Thanks and Regards, Prateek
[RFC PATCH 08/14] powerpc/thread_info: Introduce TIF_NOTIFY_IPI flag
Add support for TIF_NOTIFY_IPI on PowerPC. With TIF_NOTIFY_IPI, a sender sending an IPI to an idle CPU in TIF_POLLING mode will set the TIF_NOTIFY_IPI flag in the target's idle tasks's thread_info to pull the CPU out of idle, as opposed to setting TIF_NEED_RESCHED previously. This avoids spurious calls to schedule_idle() in cases where an IPI does not necessarily wake up a task on the idle CPU. Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: "Aneesh Kumar K.V" Cc: "Naveen N. Rao" Cc: "Rafael J. Wysocki" Cc: Daniel Lezcano Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: Andrew Donnellan Cc: K Prateek Nayak Cc: Nicholas Miehlbradt Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-ker...@vger.kernel.org Cc: linux...@vger.kernel.org Signed-off-by: K Prateek Nayak --- arch/powerpc/include/asm/thread_info.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h index bf5dde1a4114..b48db55192e0 100644 --- a/arch/powerpc/include/asm/thread_info.h +++ b/arch/powerpc/include/asm/thread_info.h @@ -103,6 +103,7 @@ void arch_setup_new_exec(void); #define TIF_PATCH_PENDING 6 /* pending live patching update */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SINGLESTEP 8 /* singlestepping active */ +#define TIF_NOTIFY_IPI 9 /* Pending IPI on TIF_POLLLING idle CPU */ #define TIF_SECCOMP10 /* secure computing */ #define TIF_RESTOREALL 11 /* Restore all regs (implies NOERROR) */ #define TIF_NOERROR12 /* Force successful syscall return */ @@ -129,6 +130,7 @@ void arch_setup_new_exec(void); #define _TIF_PATCH_PENDING (1<
[RFC PATCH 03/14] sched/core: Use TIF_NOTIFY_IPI to notify an idle CPU in TIF_POLLING mode of pending IPI
c: "Aneesh Kumar K.V" Cc: "Naveen N. Rao" Cc: Yoshinori Sato Cc: Rich Felker Cc: John Paul Adrian Glaubitz Cc: "David S. Miller" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: "Rafael J. Wysocki" Cc: Daniel Lezcano Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: Al Viro Cc: Linus Walleij Cc: Ard Biesheuvel Cc: Andrew Donnellan Cc: Nicholas Miehlbradt Cc: Andrew Morton Cc: Arnd Bergmann Cc: Josh Poimboeuf Cc: "Kirill A. Shutemov" Cc: Rick Edgecombe Cc: Tony Battersby Cc: Brian Gerst Cc: Tim Chen Cc: David Vernet Cc: x...@kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-al...@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-c...@vger.kernel.org Cc: linux-openr...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@vger.kernel.org Signed-off-by: Gautham R. Shenoy Co-developed-by: K Prateek Nayak Signed-off-by: K Prateek Nayak --- include/linux/sched/idle.h | 8 kernel/sched/core.c| 41 ++ kernel/sched/idle.c| 16 +++ 3 files changed, 49 insertions(+), 16 deletions(-) diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h index d739ab810e00..c22312087c30 100644 --- a/include/linux/sched/idle.h +++ b/include/linux/sched/idle.h @@ -58,8 +58,8 @@ static __always_inline bool __must_check current_set_polling_and_test(void) __current_set_polling(); /* -* Polling state must be visible before we test NEED_RESCHED, -* paired by resched_curr() +* Polling state must be visible before we test NEED_RESCHED or +* NOTIFY_IPI paired by resched_curr() or notify_ipi_if_polling() */ smp_mb__after_atomic(); @@ -71,8 +71,8 @@ static __always_inline bool __must_check current_clr_polling_and_test(void) __current_clr_polling(); /* -* Polling state must be visible before we test NEED_RESCHED, -* paired by resched_curr() +* Polling state must be visible before we test NEED_RESCHED or +* NOTIFY_IPI paired by resched_curr() or notify_ipi_if_polling() */ smp_mb__after_atomic(); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index db4be4921e7f..6fb6e5b75724 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -909,12 +909,30 @@ static inline bool set_nr_and_not_polling(struct task_struct *p) } /* - * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set. + * Certain architectures that support TIF_POLLING_NRFLAG may not support + * TIF_NOTIFY_IPI to notify an idle CPU in TIF_POLLING mode of a pending + * IPI. On such architectures, set TIF_NEED_RESCHED instead to wake the + * idle CPU and process the pending IPI. + */ +#ifdef _TIF_NOTIFY_IPI +#define _TIF_WAKE_FLAG _TIF_NOTIFY_IPI +#else +#define _TIF_WAKE_FLAG _TIF_NEED_RESCHED +#endif + +/* + * Atomically set TIF_WAKE_FLAG when TIF_POLLING_NRFLAG is set. + * + * On architectures that define TIF_NOTIFY_IPI, the same is set in the + * idle task's thread_info to pull the CPU out of idle and process + * the pending interrupt. On architectures that don't support + * TIF_NOTIFY_IPI, TIF_NEED_RESCHED is set instead to notify the + * pending IPI. * - * If this returns true, then the idle task promises to call - * sched_ttwu_pending() and reschedule soon. + * If this returns true, then the idle task promises to process the + * call function soon. */ -static bool set_nr_if_polling(struct task_struct *p) +static bool notify_ipi_if_polling(struct task_struct *p) { struct thread_info *ti = task_thread_info(p); typeof(ti->flags) val = READ_ONCE(ti->flags); @@ -922,9 +940,16 @@ static bool set_nr_if_polling(struct task_struct *p) do { if (!(val & _TIF_POLLING_NRFLAG)) return false; - if (val & _TIF_NEED_RESCHED) + /* +* If TIF_NEED_RESCHED flag is set in addition to +* TIF_POLLING_NRFLAG, the CPU will soon fall out of +* idle. Since flush_smp_call_function_queue() is called +* soon after the idle exit, setting TIF_WAKE_FLAG is +* not necessary. +*/ + if (val & (_TIF_NEED_RESCHED | _TIF_WAKE_FLAG)) return true; - } while (!try_cmpxchg(>flags, , val | _TIF_NEED_RESCHED)); + } while (!try_cmpxchg(>flags, , val | _TIF_WAKE_FLAG)); return true; } @@ -937,7 +962,7 @@ static inline bool set_nr_and_not_polling(struct task_struct *p) } #ifdef CONFIG_SMP -static inline bool
[RFC PATCH 02/14] sched: Define a need_resched_or_ipi() helper and use it treewide
From: "Gautham R. Shenoy" Currently TIF_NEED_RESCHED is being overloaded, to wakeup an idle CPU in TIF_POLLING mode to service an IPI even if there are no new tasks being woken up on the said CPU. In preparation of a proper fix, introduce a new helper "need_resched_or_ipi()" which is intended to return true if either the TIF_NEED_RESCHED flag or if TIF_NOTIFY_IPI flag is set. Use this helper function in place of need_resched() in idle loops where TIF_POLLING_NRFLAG is set. To preserve bisectibility and avoid unbreakable idle loops, all the need_resched() checks within TIF_POLLING_NRFLAGS sections, have been replaced tree-wide with the need_resched_or_ipi() check. [ prateek: Replaced some of the missed out occurrences of need_resched() within a TIF_POLLING sections with need_resched_or_ipi() ] Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Russell King Cc: Guo Ren Cc: Michal Simek Cc: Dinh Nguyen Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: "Aneesh Kumar K.V" Cc: "Naveen N. Rao" Cc: Yoshinori Sato Cc: Rich Felker Cc: John Paul Adrian Glaubitz Cc: "David S. Miller" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: "Rafael J. Wysocki" Cc: Daniel Lezcano Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: Al Viro Cc: Linus Walleij Cc: Ard Biesheuvel Cc: Andrew Donnellan Cc: Nicholas Miehlbradt Cc: Andrew Morton Cc: Arnd Bergmann Cc: Josh Poimboeuf Cc: "Kirill A. Shutemov" Cc: Rick Edgecombe Cc: Tony Battersby Cc: Brian Gerst Cc: Tim Chen Cc: David Vernet Cc: x...@kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-al...@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-c...@vger.kernel.org Cc: linux-openr...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@vger.kernel.org Signed-off-by: Gautham R. Shenoy Co-developed-by: K Prateek Nayak Signed-off-by: K Prateek Nayak --- arch/x86/include/asm/mwait.h | 2 +- arch/x86/kernel/process.c | 2 +- drivers/cpuidle/cpuidle-powernv.c | 2 +- drivers/cpuidle/cpuidle-pseries.c | 2 +- drivers/cpuidle/poll_state.c | 2 +- include/linux/sched.h | 5 + include/linux/sched/idle.h| 4 ++-- kernel/sched/idle.c | 7 --- 8 files changed, 16 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h index 778df05f8539..ac1370143407 100644 --- a/arch/x86/include/asm/mwait.h +++ b/arch/x86/include/asm/mwait.h @@ -115,7 +115,7 @@ static __always_inline void mwait_idle_with_hints(unsigned long eax, unsigned lo } __monitor((void *)_thread_info()->flags, 0, 0); - if (!need_resched()) + if (!need_resched_or_ipi()) __mwait(eax, ecx); } current_clr_polling(); diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index b6f4e8399fca..ca6cb7e28cba 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -925,7 +925,7 @@ static __cpuidle void mwait_idle(void) } __monitor((void *)_thread_info()->flags, 0, 0); - if (!need_resched()) { + if (!need_resched_or_ipi()) { __sti_mwait(0, 0); raw_local_irq_disable(); } diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 9ebedd972df0..77c3bb371f56 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -79,7 +79,7 @@ static int snooze_loop(struct cpuidle_device *dev, dev->poll_time_limit = false; ppc64_runlatch_off(); HMT_very_low(); - while (!need_resched()) { + while (!need_resched_or_ipi()) { if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time) { /* * Task has not woken up but we are exiting the polling diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 14db9b7d985d..4f2b490f8b73 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -46,7 +46,7 @@ int snooze_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv, snooze_exit_time = get_tb() + snooze_timeout; dev->poll_time_limit = false; - while (!need_resched()) { + while (!need_resched_or_ipi()) { HMT_low(); HMT
[RFC PATCH 01/14] thread_info: Add helpers to test and clear TIF_NOTIFY_IPI
From: "Gautham R. Shenoy" Introduce the notion of TIF_NOTIFY_IPI flag. When a processor in TIF_POLLING mode needs to process an IPI, the sender sets NEED_RESCHED bit in idle task's thread_info to pull the target out of idle and avoids sending an interrupt to the idle CPU. When NEED_RESCHED is set, the scheduler assumes that a new task has been queued on the idle CPU and calls schedule_idle(), however, it is not necessary that an IPI on an idle CPU will necessarily end up waking a task on the said CPU. To avoid spurious calls to schedule_idle() assuming an IPI on an idle CPU will always wake a task on the said CPU, TIF_NOTIFY_IPI will be used to pull a TIF_POLLING CPU out of idle. Since the IPI handlers are processed before the call to schedule_idle(), schedule_idle() will be called only if one of the handlers have woken up a new task on the CPU and has set NEED_RESCHED. Add tif_notify_ipi() and current_clr_notify_ipi() helpers to test if TIF_NOTIFY_IPI is set in the current task's thread_info, and to clear it respectively. These interfaces will be used in subsequent patches as TIF_NOTIFY_IPI notion is integrated in the scheduler and in the idle path. [ prateek: Split the changes into a separate patch, add commit log ] Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Russell King Cc: Guo Ren Cc: Michal Simek Cc: Dinh Nguyen Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: "Aneesh Kumar K.V" Cc: "Naveen N. Rao" Cc: Yoshinori Sato Cc: Rich Felker Cc: John Paul Adrian Glaubitz Cc: "David S. Miller" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: "Rafael J. Wysocki" Cc: Daniel Lezcano Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: Al Viro Cc: Linus Walleij Cc: Ard Biesheuvel Cc: Andrew Donnellan Cc: Nicholas Miehlbradt Cc: Andrew Morton Cc: Arnd Bergmann Cc: Josh Poimboeuf Cc: "Kirill A. Shutemov" Cc: Rick Edgecombe Cc: Tony Battersby Cc: Brian Gerst Cc: Tim Chen Cc: David Vernet Cc: x...@kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-al...@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-c...@vger.kernel.org Cc: linux-openr...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@vger.kernel.org Signed-off-by: Gautham R. Shenoy Co-developed-by: K Prateek Nayak Signed-off-by: K Prateek Nayak --- include/linux/thread_info.h | 43 + 1 file changed, 43 insertions(+) diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h index 9ea0b28068f4..1e10dd8c0227 100644 --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -195,6 +195,49 @@ static __always_inline bool tif_need_resched(void) #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */ +#ifdef TIF_NOTIFY_IPI + +#ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H + +static __always_inline bool tif_notify_ipi(void) +{ + return arch_test_bit(TIF_NOTIFY_IPI, +(unsigned long *)(_thread_info()->flags)); +} + +static __always_inline void current_clr_notify_ipi(void) +{ + arch_clear_bit(TIF_NOTIFY_IPI, + (unsigned long *)(_thread_info()->flags)); +} + +#else + +static __always_inline bool tif_notify_ipi(void) +{ + return test_bit(TIF_NOTIFY_IPI, + (unsigned long *)(_thread_info()->flags)); +} + +static __always_inline void current_clr_notify_ipi(void) +{ + clear_bit(TIF_NOTIFY_IPI, + (unsigned long *)(_thread_info()->flags)); +} + +#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */ + +#else /* !TIF_NOTIFY_IPI */ + +static __always_inline bool tif_notify_ipi(void) +{ + return false; +} + +static __always_inline void current_clr_notify_ipi(void) { } + +#endif /* TIF_NOTIFY_IPI */ + #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES static inline int arch_within_stack_frames(const void * const stack, const void * const stackend, -- 2.34.1
[RFC PATCH 00/14] Introducing TIF_NOTIFY_IPI flag
aken to complete a fixed set of IPIs using ipistorm improves drastically. Following are the numbers from the same dual socket 3rd Generation EPYC system (2 x 64C/128T) (boost on, C2 disabled) running ipistorm between CPU8 and CPU16: cmdline: insmod ipistorm.ko numipi=10 single=1 offset=8 cpulist=8 wait=1 == Test : ipistorm (modified) Units : Normalized runtime Interpretation: Lower is better Statistic : AMean == kernel: time [pct imp] tip:sched/core1.00 [0.00] tip:sched/core + revert 0.81 [19.36] tip:sched/core + TIF_NOTIFY_IPI 0.20 [80.99] Same experiment was repeated on an dual socket ARM server (2 x 64C) which too saw a significant improvement in the ipistorm performance: == Test : ipistorm (modified) Units : Normalized runtime Interpretation: Lower is better Statistic : AMean == kernel: time [pct imp] tip:sched/core1.00 [0.00] tip:sched/core + TIF_NOTIFY_IPI 0.41 [59.29] netperf and tbench results with the patch match the results on tip on the dual socket 3rd Generation AMD system (2 x 64C/128T). Additionally, hackbench, stream, and schbench too were tested, with results from the patched kernel matching that of the tip. Future Work === Evaluate impact of newidle_balance() when scheduler tick hits an idle CPU. The call to newidle_balance() will be skipped with the TIF_NOTIFY_IPI solution similar to [2]. Counter argument for the case is that if the idle state did not set the TIF_POLLING bit, the idle CPU would not have called schedule_idle() unless the IPI handler set the NEED_RESCHED bit. Links = [1] https://github.com/antonblanchard/ipistorm [2] https://lore.kernel.org/lkml/20240119084548.2788-1-kprateek.na...@amd.com/ [3] https://lore.kernel.org/lkml/b4f5ac150685456cf45a342e3bb1f28cdd557a53.ca...@linux.intel.com/ [4] https://lore.kernel.org/lkml/20240123211756.GA221793@maniforge/ [5] https://lore.kernel.org/lkml/cakftptc446lo9catpp7pexdklhhqfobuy-jmgc7agohy4hs...@mail.gmail.com/ This series is based on tip:sched/core at tag "sched-core-2024-01-08". --- Gautham R. Shenoy (4): thread_info: Add helpers to test and clear TIF_NOTIFY_IPI sched: Define a need_resched_or_ipi() helper and use it treewide sched/core: Use TIF_NOTIFY_IPI to notify an idle CPU in TIF_POLLING mode of pending IPI x86/thread_info: Introduce TIF_NOTIFY_IPI flag K Prateek Nayak (10): arm/thread_info: Introduce TIF_NOTIFY_IPI flag alpha/thread_info: Introduce TIF_NOTIFY_IPI flag openrisc/thread_info: Introduce TIF_NOTIFY_IPI flag powerpc/thread_info: Introduce TIF_NOTIFY_IPI flag sh/thread_info: Introduce TIF_NOTIFY_IPI flag sparc/thread_info: Introduce TIF_NOTIFY_IPI flag csky/thread_info: Introduce TIF_NOTIFY_IPI flag parisc/thread_info: Introduce TIF_NOTIFY_IPI flag nios2/thread_info: Introduce TIF_NOTIFY_IPI flag microblaze/thread_info: Introduce TIF_NOTIFY_IPI flag --- Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Russell King Cc: Guo Ren Cc: Michal Simek Cc: Dinh Nguyen Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: "Aneesh Kumar K.V" Cc: "Naveen N. Rao" Cc: Yoshinori Sato Cc: Rich Felker Cc: John Paul Adrian Glaubitz Cc: "David S. Miller" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: "Rafael J. Wysocki" Cc: Daniel Lezcano Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: Al Viro Cc: Linus Walleij Cc: Ard Biesheuvel Cc: Andrew Donnellan Cc: Nicholas Miehlbradt Cc: Andrew Morton Cc: Arnd Bergmann Cc: Josh Poimboeuf Cc: "Kirill A. Shutemov" Cc: Rick Edgecombe Cc: Tony Battersby Cc: Brian Gerst Cc: Tim Chen Cc: David Vernet Cc: x...@kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-al...@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-c...@vger.kernel.org Cc: linux-openr...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux...@vger.kernel.org Cc: sparcli...@vger.kernel.org Cc: linux...@vger.kernel.org --- arch/alpha/include/asm/thread_info.h | 2 ++ arch/arm/include/asm/thread_info.h| 3 ++ arch/csky/include/asm/thread_info.h | 2 ++ arch/microblaze/include/asm/thread_info.h | 2 +