Re: [PATCH 4/6] sched/isolation: Offload residual 1Hz scheduler tick

2018-02-20 Thread Frederic Weisbecker
On Sat, Feb 17, 2018 at 11:50:52AM +0100, Thomas Gleixner wrote:
> On Thu, 15 Feb 2018, Frederic Weisbecker wrote:
> 
> > When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> > keep the scheduler stats alive. However this residual tick is a burden
> > for bare metal tasks that can't stand any interruption at all, or want
> > to minimize them.
> > 
> > The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> > outsource these scheduler ticks to the global workqueue so that a
> > housekeeping CPU handles those remotely. The sched_class::task_tick()
> > implementations have been audited and look safe to be called remotely
> > as the target runqueue and its current task are passed in parameter
> > and don't seem to be accessed locally.
> 
> That scares me a bit. Not for the current state of affairs, but we want to
> ensure that this still works in 2 years from now
> 
> So at least you want to add a comment to task_tick() which explains the
> constraints which come with the remote tick.

Good point, I'm adding that.

> 
> Other than that this looks good!
> 
> Reviewed-by: Thomas Gleixner 

Thanks!


Re: [PATCH 4/6] sched/isolation: Offload residual 1Hz scheduler tick

2018-02-20 Thread Frederic Weisbecker
On Sat, Feb 17, 2018 at 11:50:52AM +0100, Thomas Gleixner wrote:
> On Thu, 15 Feb 2018, Frederic Weisbecker wrote:
> 
> > When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> > keep the scheduler stats alive. However this residual tick is a burden
> > for bare metal tasks that can't stand any interruption at all, or want
> > to minimize them.
> > 
> > The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> > outsource these scheduler ticks to the global workqueue so that a
> > housekeeping CPU handles those remotely. The sched_class::task_tick()
> > implementations have been audited and look safe to be called remotely
> > as the target runqueue and its current task are passed in parameter
> > and don't seem to be accessed locally.
> 
> That scares me a bit. Not for the current state of affairs, but we want to
> ensure that this still works in 2 years from now
> 
> So at least you want to add a comment to task_tick() which explains the
> constraints which come with the remote tick.

Good point, I'm adding that.

> 
> Other than that this looks good!
> 
> Reviewed-by: Thomas Gleixner 

Thanks!


Re: [PATCH 4/6] sched/isolation: Offload residual 1Hz scheduler tick

2018-02-17 Thread Thomas Gleixner
On Thu, 15 Feb 2018, Frederic Weisbecker wrote:

> When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> keep the scheduler stats alive. However this residual tick is a burden
> for bare metal tasks that can't stand any interruption at all, or want
> to minimize them.
> 
> The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> outsource these scheduler ticks to the global workqueue so that a
> housekeeping CPU handles those remotely. The sched_class::task_tick()
> implementations have been audited and look safe to be called remotely
> as the target runqueue and its current task are passed in parameter
> and don't seem to be accessed locally.

That scares me a bit. Not for the current state of affairs, but we want to
ensure that this still works in 2 years from now

So at least you want to add a comment to task_tick() which explains the
constraints which come with the remote tick.

Other than that this looks good!

Reviewed-by: Thomas Gleixner 


Re: [PATCH 4/6] sched/isolation: Offload residual 1Hz scheduler tick

2018-02-17 Thread Thomas Gleixner
On Thu, 15 Feb 2018, Frederic Weisbecker wrote:

> When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
> keep the scheduler stats alive. However this residual tick is a burden
> for bare metal tasks that can't stand any interruption at all, or want
> to minimize them.
> 
> The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
> outsource these scheduler ticks to the global workqueue so that a
> housekeeping CPU handles those remotely. The sched_class::task_tick()
> implementations have been audited and look safe to be called remotely
> as the target runqueue and its current task are passed in parameter
> and don't seem to be accessed locally.

That scares me a bit. Not for the current state of affairs, but we want to
ensure that this still works in 2 years from now

So at least you want to add a comment to task_tick() which explains the
constraints which come with the remote tick.

Other than that this looks good!

Reviewed-by: Thomas Gleixner 


[PATCH 4/6] sched/isolation: Offload residual 1Hz scheduler tick

2018-02-14 Thread Frederic Weisbecker
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
keep the scheduler stats alive. However this residual tick is a burden
for bare metal tasks that can't stand any interruption at all, or want
to minimize them.

The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
outsource these scheduler ticks to the global workqueue so that a
housekeeping CPU handles those remotely. The sched_class::task_tick()
implementations have been audited and look safe to be called remotely
as the target runqueue and its current task are passed in parameter
and don't seem to be accessed locally.

Note that in the case of using isolcpus, it's still up to the user to
affine the global workqueues to the housekeeping CPUs through
/sys/devices/virtual/workqueue/cpumask or domains isolation
"isolcpus=nohz,domain".

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Luiz Capitulino 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Ingo Molnar 
---
 kernel/sched/core.c  | 92 
 kernel/sched/isolation.c |  4 +++
 kernel/sched/sched.h |  2 ++
 3 files changed, 98 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f09ec5c..86eefc4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3120,6 +3120,96 @@ u64 scheduler_tick_max_deferment(void)
 
return jiffies_to_nsecs(next - now);
 }
+
+struct tick_work {
+   int cpu;
+   struct delayed_work work;
+};
+
+static struct tick_work __percpu *tick_work_cpu;
+
+static void sched_tick_remote(struct work_struct *work)
+{
+   struct delayed_work *dwork = to_delayed_work(work);
+   struct tick_work *twork = container_of(dwork, struct tick_work, work);
+   int cpu = twork->cpu;
+   struct rq *rq = cpu_rq(cpu);
+   struct rq_flags rf;
+
+   /*
+* Handle the tick only if it appears the remote CPU is running in full
+* dynticks mode. The check is racy by nature, but missing a tick or
+* having one too much is no big deal because the scheduler tick updates
+* statistics and checks timeslices in a time-independent way, 
regardless
+* of when exactly it is running.
+*/
+   if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
+   struct task_struct *curr;
+   u64 delta;
+
+   rq_lock_irq(rq, );
+   update_rq_clock(rq);
+   curr = rq->curr;
+   delta = rq_clock_task(rq) - curr->se.exec_start;
+
+   /*
+* Make sure the next tick runs within a reasonable
+* amount of time.
+*/
+   WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);
+   curr->sched_class->task_tick(rq, curr, 0);
+   rq_unlock_irq(rq, );
+   }
+
+   /*
+* Run the remote tick once per second (1Hz). This arbitrary
+* frequency is large enough to avoid overload but short enough
+* to keep scheduler internal stats reasonably up to date.
+*/
+   queue_delayed_work(system_unbound_wq, dwork, HZ);
+}
+
+static void sched_tick_start(int cpu)
+{
+   struct tick_work *twork;
+
+   if (housekeeping_cpu(cpu, HK_FLAG_TICK))
+   return;
+
+   WARN_ON_ONCE(!tick_work_cpu);
+
+   twork = per_cpu_ptr(tick_work_cpu, cpu);
+   twork->cpu = cpu;
+   INIT_DELAYED_WORK(>work, sched_tick_remote);
+   queue_delayed_work(system_unbound_wq, >work, HZ);
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void sched_tick_stop(int cpu)
+{
+   struct tick_work *twork;
+
+   if (housekeeping_cpu(cpu, HK_FLAG_TICK))
+   return;
+
+   WARN_ON_ONCE(!tick_work_cpu);
+
+   twork = per_cpu_ptr(tick_work_cpu, cpu);
+   cancel_delayed_work_sync(>work);
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
+int __init sched_tick_offload_init(void)
+{
+   tick_work_cpu = alloc_percpu(struct tick_work);
+   BUG_ON(!tick_work_cpu);
+
+   return 0;
+}
+
+#else /* !CONFIG_NO_HZ_FULL */
+static inline void sched_tick_start(int cpu) { }
+static inline void sched_tick_stop(int cpu) { }
 #endif
 
 #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
@@ -5781,6 +5871,7 @@ int sched_cpu_starting(unsigned int cpu)
 {
set_cpu_rq_start_time(cpu);
sched_rq_cpu_starting(cpu);
+   sched_tick_start(cpu);
return 0;
 }
 
@@ -5792,6 +5883,7 @@ int sched_cpu_dying(unsigned int cpu)
 
/* Handle pending wakeups and then migrate everything off */
sched_ttwu_pending();
+   sched_tick_stop(cpu);
 
rq_lock_irqsave(rq, );
if 

[PATCH 4/6] sched/isolation: Offload residual 1Hz scheduler tick

2018-02-14 Thread Frederic Weisbecker
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
keep the scheduler stats alive. However this residual tick is a burden
for bare metal tasks that can't stand any interruption at all, or want
to minimize them.

The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
outsource these scheduler ticks to the global workqueue so that a
housekeeping CPU handles those remotely. The sched_class::task_tick()
implementations have been audited and look safe to be called remotely
as the target runqueue and its current task are passed in parameter
and don't seem to be accessed locally.

Note that in the case of using isolcpus, it's still up to the user to
affine the global workqueues to the housekeeping CPUs through
/sys/devices/virtual/workqueue/cpumask or domains isolation
"isolcpus=nohz,domain".

Signed-off-by: Frederic Weisbecker 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Luiz Capitulino 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Ingo Molnar 
---
 kernel/sched/core.c  | 92 
 kernel/sched/isolation.c |  4 +++
 kernel/sched/sched.h |  2 ++
 3 files changed, 98 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f09ec5c..86eefc4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3120,6 +3120,96 @@ u64 scheduler_tick_max_deferment(void)
 
return jiffies_to_nsecs(next - now);
 }
+
+struct tick_work {
+   int cpu;
+   struct delayed_work work;
+};
+
+static struct tick_work __percpu *tick_work_cpu;
+
+static void sched_tick_remote(struct work_struct *work)
+{
+   struct delayed_work *dwork = to_delayed_work(work);
+   struct tick_work *twork = container_of(dwork, struct tick_work, work);
+   int cpu = twork->cpu;
+   struct rq *rq = cpu_rq(cpu);
+   struct rq_flags rf;
+
+   /*
+* Handle the tick only if it appears the remote CPU is running in full
+* dynticks mode. The check is racy by nature, but missing a tick or
+* having one too much is no big deal because the scheduler tick updates
+* statistics and checks timeslices in a time-independent way, 
regardless
+* of when exactly it is running.
+*/
+   if (!idle_cpu(cpu) && tick_nohz_tick_stopped_cpu(cpu)) {
+   struct task_struct *curr;
+   u64 delta;
+
+   rq_lock_irq(rq, );
+   update_rq_clock(rq);
+   curr = rq->curr;
+   delta = rq_clock_task(rq) - curr->se.exec_start;
+
+   /*
+* Make sure the next tick runs within a reasonable
+* amount of time.
+*/
+   WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);
+   curr->sched_class->task_tick(rq, curr, 0);
+   rq_unlock_irq(rq, );
+   }
+
+   /*
+* Run the remote tick once per second (1Hz). This arbitrary
+* frequency is large enough to avoid overload but short enough
+* to keep scheduler internal stats reasonably up to date.
+*/
+   queue_delayed_work(system_unbound_wq, dwork, HZ);
+}
+
+static void sched_tick_start(int cpu)
+{
+   struct tick_work *twork;
+
+   if (housekeeping_cpu(cpu, HK_FLAG_TICK))
+   return;
+
+   WARN_ON_ONCE(!tick_work_cpu);
+
+   twork = per_cpu_ptr(tick_work_cpu, cpu);
+   twork->cpu = cpu;
+   INIT_DELAYED_WORK(>work, sched_tick_remote);
+   queue_delayed_work(system_unbound_wq, >work, HZ);
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void sched_tick_stop(int cpu)
+{
+   struct tick_work *twork;
+
+   if (housekeeping_cpu(cpu, HK_FLAG_TICK))
+   return;
+
+   WARN_ON_ONCE(!tick_work_cpu);
+
+   twork = per_cpu_ptr(tick_work_cpu, cpu);
+   cancel_delayed_work_sync(>work);
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
+int __init sched_tick_offload_init(void)
+{
+   tick_work_cpu = alloc_percpu(struct tick_work);
+   BUG_ON(!tick_work_cpu);
+
+   return 0;
+}
+
+#else /* !CONFIG_NO_HZ_FULL */
+static inline void sched_tick_start(int cpu) { }
+static inline void sched_tick_stop(int cpu) { }
 #endif
 
 #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
@@ -5781,6 +5871,7 @@ int sched_cpu_starting(unsigned int cpu)
 {
set_cpu_rq_start_time(cpu);
sched_rq_cpu_starting(cpu);
+   sched_tick_start(cpu);
return 0;
 }
 
@@ -5792,6 +5883,7 @@ int sched_cpu_dying(unsigned int cpu)
 
/* Handle pending wakeups and then migrate everything off */
sched_ttwu_pending();
+   sched_tick_stop(cpu);
 
rq_lock_irqsave(rq, );
if (rq->rd) {
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index a2500c4..39f340d 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include